Text vectorization is vital in building NLP models. It’s the process of converting raw textual data into numerical representations in vector form. The vectors serve as inputs to various ML algorithms.
Text vectorization is important because ML and deep learning models require numerical data as inputs to accomplish their tasks. These models can’t process the data directly as humans do. Here are some of the methodologies used to perform text vectorization.
One hot encoding is a vectorization algorithm that takes unique words from textual data and generates a vector for each word. The vectors are of length N, where N is the count of unique words in the data. The term represented takes a value of 1, while the rest are represented by 0. For example, your data has these three words, “I love dogs.” “I” would be 100, “love” would be 010, and “dogs” would be 001.
This method is used less than others because it’s not memory efficient—especially if the data is significant in size. Also, it doesn’t represent the relationships between words, which can result in lost nuance and meaning.
Count vectorizers are similar to one hot encoding with one key difference. They can capture the frequency of word appearance in your textual data. The algorithm creates a document term matrix where the columns represent the unique words in the data, and the rows represent the documents. Each cell in this matrix represents the frequency of words in each document. This method is also memory inefficient if the amount of data is enormous.
Bag of Words is a text-processing methodology that extracts features from textual data. It uses a pre-defined dictionary of words to measure the presence of known words in your data and doesn’t consider the order of word appearance.
The algorithm uses this dictionary to loop through all the documents in the data and can use a simple scoring method to create the vectors. For example, it can mark the presence of a word in a vocabulary as 1 or 0 if absent. Additional scoring methods include looking at the frequency of each word appearing in the document.
When using this method, it’s essential to manage your dictionary of words and ensure the vocabulary is relevant to your use case.
N-grams represent a sequence of words or tokens next to each other in a sentence (N). For example, consider the sentence, “I love my dog.” If N = 2, then the bigram would be, ”I love,” “love my,” or “my dog.” This algorithm creates a document term matrix where columns represent the count of the neighbouring words of length N.
The choice of the N value is critical to ensure the best performance for your NLP models. Smaller N values might not be sufficient to provide conclusive vectors, while larger N values could produce a large matrix with many features.
Term frequency-inverse document frequency (TF-IDF) is the ratio of the frequency a word appears in the text to the total number of words. The IDF is a statistic that calculates how important a word is in textual data. It’s the logarithmic ratio of all documents in the text to the number of documents with a particular word.
The algorithm creates a document term matrix where columns represent each unique word and rows represent the documents in the data. Each cell represents the weight that indicates the importance of that word in the document.
The multiplication of the TF and IDF values produces the weight used in the document term matrix. This methodology is computationally cost-effective, making it a great choice.
In this chapter, you’ve learned the various strategies for turning text into vectors, in order to plug them into your NLP models. On to the next chapter to learn some of the traditional methods used to build NLP models and see how to use the preprocessed text to build a text classifier.
Updated about 1 month ago
Many different models can used for predicting and analyzing text. Learn some of the most important ones in the next chapter!