top of page

Mathematicians see the world differently. I built a One Hot Encoder using an identity matrix.

  • Jason Ismail
  • Apr 4, 2021
  • 2 min read

Updated: May 24, 2021

The notebook can be found here:


Recently I was asked to do some text cleanup for a Natural Language Processing assignment. We were asked to build the functions ourself and not rely on libraries like NLTK.


Eventually we were asked to build a One Hot Encoder to vectorize the words in our text. So I decided to start with an identity matrix.



I chose a simple sentence that could prove that my function was working as desired.


sentence = "Apple baNana Pear orange, pear!"

I chose this nonsense sentence to prove a few things. I don't get extra words when words happen multiple times or have different capitalization. I also need to ensure that we are not getting any punctuation included with the words in the sentence.


Here is my logic flow for the problem.

  • Turn the sentence into tokens.

  • Make the tokens lowercase.

  • Find the unique set of words in my sentence.

  • Find the number of words in my set. (Used to build my identity matrix)

  • Build the identity matrix.

  • Save it to a database with the index set as the unique words in my set.

This gives me the following:

But you may notice that I did not include the first column. The reason for this is that banana does not need its own column in the dataset since it can be represented as the vector [0, 0, 0].


What I have essentially built is a key that I can quickly reference with the index

Then I simply take my sentence and and go token by token and match the token to the index grab the row and flatten it into a list.


Now I have turned my words into one hot encoded vectors as required.


Comments


Commenting on this post isn't available anymore. Contact the site owner for more info.

DON'T MISS THE FUN.

Thanks for submitting!

Looking to Hire?

Connect with a Versatile Data Scientist

 

 


Are you in need of tailored data science solutions for your business? I'm here to help. With a Master's Degree in Data Science and a Bachelor's in Mathematics, I bring a blend of academic rigor and practical experience to the table.

Expertise in Building Comprehensive Data Solutions:

Proficient in developing end-to-end data science projects, including the collection, cleaning, and analysis of raw data.
Specialized in Python.


Technical Proficiencies:

Skilled in using Pandas, Yolo, NumPy, PyTorch and Keras/TensorFlow for creating sophisticated Deep Neural Networks.
Experienced in computer vision and leveraging Nvidia CUDA for high-performance computing tasks.


Personal Qualities:

Recognized by peers, mentors, and students as a dedicated and hardworking professional. I come with a long list of references.


Known for facing challenges head-on and being a supportive team player.
Skilled at making complex concepts accessible and relatable, with a passion for continuous learning.


Contact Information:

Jason Ismail
Masters in Data Science, Bachelors in Mathematics
LinkedIn Profile
Phone (Text Only): 719-322-8479

About Me

Data Science

Data Science isn't just my career; it's the realization of a lifelong passion where my love for mathematics, programming, and technology converge. Over the past 20 years, I've nurtured a deep fondness for computers, starting from building them to exploring their immense capabilities.

My academic path initially led me to programming and then chemistry, where I excelled nationally in the 98th percentile. This experience, however, led to an epiphany - it was the mathematical elements within chemistry that truly captivated me. This revelation steered me towards a scholarship in Mathematics and a subsequent career in teaching.

But the true calling came with Data Science. Here, I found an exhilarating opportunity to transform abstract mathematical theories into impactful, real-world applications. My focus now is on cutting-edge areas such as Artificial Intelligence, Neural Networks, Computer Vision, and Reinforcement Learning - fields where I can blend my analytical skills with creative problem-solving to innovate and advance the boundaries of technology.

Data Science for me is more than a profession; it's a canvas where I paint with numbers and algorithms, creating solutions that matter.

POST ARCHIVE

bottom of page