NLP Basic¶

How to we encode / pre-process data in text?

sentence = "The quick brown fox jumped over the lazy dog"

Tokenisation¶

Charater or Word
Unigram or Bi-Gram or ...

Pre-processing:

Split it by whitspace
Filtering basic punctuations
Change case to lower

import numpy as np
import pandas as pd

from keras.preprocessing.text import text_to_word_sequence

text_to_word_sequence(sentence)

['the', 'quick', 'brown', 'fox', 'jumped', 'over', 'the', 'brown', 'fox']

Vectorisation¶

Frequency Based Embedding (Sparse)
Prediction Based Embedding (Dense)

from keras.preprocessing.text import Tokenizer

simple_tokenizer = Tokenizer(num_words=50)

simple_tokenizer.fit_on_texts([sentence])

print(simple_tokenizer.word_index)

{'the': 1, 'quick': 2, 'brown': 3, 'fox': 4, 'jumped': 5, 'over': 6, 'lazy': 7, 'dog': 8}

Frequency Based Embedding using Sentences¶

sentences = ["The quick brown fox jumped over the lazy dog",
             "The dog woke up lazily and barked at the fox",
            "the fox looked back and just ignored the dog "]

tokenizer = Tokenizer()

tokenizer.fit_on_texts(sentences)

tokenizer.word_index

{'the': 1,
 'fox': 2,
 'dog': 3,
 'and': 4,
 'quick': 5,
 'brown': 6,
 'jumped': 7,
 'over': 8,
 'lazy': 9,
 'woke': 10,
 'up': 11,
 'lazily': 12,
 'barked': 13,
 'at': 14,
 'looked': 15,
 'back': 16,
 'just': 17,
 'ignored': 18}

tokenizer.texts_to_sequences(sentences)

[[1, 5, 6, 2, 7, 8, 1, 9, 3],
 [1, 3, 10, 11, 12, 4, 13, 14, 1, 2],
 [1, 2, 15, 16, 4, 17, 18, 1, 3]]

Frequency Embeddings¶

Binary (Word Occurence)
Count (The number of times that word occurs)
Frequency
tf-idf
Co-occurence Matrix

tokenizer.texts_to_matrix(sentences, mode="binary")

array([[0., 1., 1., 1., 0., 1., 1., 1., 1., 1., 0., 0., 0., 0., 0., 0.,
        0., 0., 0.],
       [0., 1., 1., 1., 1., 0., 0., 0., 0., 0., 1., 1., 1., 1., 1., 0.,
        0., 0., 0.],
       [0., 1., 1., 1., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1.,
        1., 1., 1.]])

sent_count = tokenizer.texts_to_matrix(sentences, mode="count")

sent_freq = tokenizer.texts_to_matrix(sentences, mode="freq")

sent_count[0]

array([0., 2., 1., 1., 0., 1., 1., 1., 1., 1., 0., 0., 0., 0., 0., 0., 0.,
       0., 0.])

sent_freq[0]

array([0.        , 0.22222222, 0.11111111, 0.11111111, 0.        ,
       0.11111111, 0.11111111, 0.11111111, 0.11111111, 0.11111111,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        ])

sent_count[0]/ sum(sent_count[0])

array([0.        , 0.22222222, 0.11111111, 0.11111111, 0.        ,
       0.11111111, 0.11111111, 0.11111111, 0.11111111, 0.11111111,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        ])

tokenizer.texts_to_matrix(sentences, mode="tfidf")

array([[0.        , 0.94751189, 0.55961579, 0.55961579, 0.        ,
        0.91629073, 0.91629073, 0.91629073, 0.91629073, 0.91629073,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        ],
       [0.        , 0.94751189, 0.55961579, 0.55961579, 0.69314718,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.91629073, 0.91629073, 0.91629073, 0.91629073, 0.91629073,
        0.        , 0.        , 0.        , 0.        ],
       [0.        , 0.94751189, 0.55961579, 0.55961579, 0.69314718,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.91629073, 0.91629073, 0.91629073, 0.91629073]])

Prediction Embedding¶

Learnt on a corpus
- Word2Vec (Word level learnt using skipgram)
- FastTExt (Character level)
- Glove (Co-occurence Matrix)

import spacy

#!python -m spacy download en_core_web_lg

nlp = spacy.load("en_core_web_lg")

doc1 = nlp("fox")
doc2 = nlp("dog")

doc1.vector.shape, doc2.vector.shape

((300,), (300,))

doc1.similarity(doc2)

0.48585482527991497

king = nlp("king")
queen = nlp("queen")
man = nlp("man")
woman = nlp("woman")

pred_queen = king.vector - man.vector + woman.vector

np.dot(pred_queen, queen.vector)/np.linalg.norm(pred_queen)/ np.linalg.norm(queen.vector)

0.78808445