Feature Representation for Text: From TF-IDF to Modern Embeddings (Word2Vec, GloVe)

In natural language processing (NLP), one of the most critical steps is converting raw text into numerical features that machine learning algorithms can understand. This process is called feature representation. In this lesson, we'll explore both classic and modern methods for representing text data.

Traditional Approach: TF-IDF

TF-IDF (Term Frequency-Inverse Document Frequency) has been a cornerstone of text feature representation for decades. It assigns weights to words based on their frequency within a document relative to their occurrence across all documents in a corpus.

How TF-IDF Works

Here's an example of implementing TF-IDF using Scikit-learn:

from sklearn.feature_extraction.text import TfidfVectorizer

corpus = ['I love Python programming', 'Python is versatile', 'Data science with Python']
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(corpus)
print(tfidf_matrix.toarray())

Modern Approaches: Word Embeddings

While TF-IDF is effective, it lacks semantic understanding. Modern techniques like Word2Vec and GloVe address this by capturing relationships between words in high-dimensional vector spaces.

Word2Vec: Capturing Semantic Relationships

Word2Vec generates dense vector representations of words based on their context. It uses two architectures:

Here's how you can use Word2Vec with Gensim:

from gensim.models import Word2Vec

sentences = [['I', 'love', 'Python'], ['Python', 'is', 'fun']]
model = Word2Vec(sentences, vector_size=100, window=5, min_count=1, workers=4)
print(model.wv['Python'])

GloVe: Global Vectors for Word Representation

GloVe combines the strengths of matrix factorization and global word co-occurrence statistics. Unlike Word2Vec, it leverages statistical information about word contexts.

To summarize, while TF-IDF remains valuable for simpler tasks, embeddings like Word2Vec and GloVe are indispensable for advanced NLP applications requiring semantic understanding.