NLP Basic

How to we encode / pre-process data in text?

In [16]:
sentence = "The quick brown fox jumped over the lazy dog"

Tokenisation

  • Charater or Word
  • Unigram or Bi-Gram or ...

Pre-processing:

  • Split it by whitspace
  • Filtering basic punctuations
  • Change case to lower
In [2]:
import numpy as np
import pandas as pd
In [3]:
from keras.preprocessing.text import text_to_word_sequence
In [4]:
text_to_word_sequence(sentence)
Out[4]:
['the', 'quick', 'brown', 'fox', 'jumped', 'over', 'the', 'brown', 'fox']

Vectorisation

  • Frequency Based Embedding (Sparse)
  • Prediction Based Embedding (Dense)
In [17]:
from keras.preprocessing.text import Tokenizer
In [18]:
simple_tokenizer = Tokenizer(num_words=50)
In [19]:
simple_tokenizer.fit_on_texts([sentence])
In [20]:
print(simple_tokenizer.word_index)
{'the': 1, 'quick': 2, 'brown': 3, 'fox': 4, 'jumped': 5, 'over': 6, 'lazy': 7, 'dog': 8}

Frequency Based Embedding using Sentences

In [25]:
sentences = ["The quick brown fox jumped over the lazy dog",
             "The dog woke up lazily and barked at the fox",
            "the fox looked back and just ignored the dog "]
In [35]:
tokenizer = Tokenizer()
In [36]:
tokenizer.fit_on_texts(sentences)
In [37]:
tokenizer.word_index
Out[37]:
{'the': 1,
 'fox': 2,
 'dog': 3,
 'and': 4,
 'quick': 5,
 'brown': 6,
 'jumped': 7,
 'over': 8,
 'lazy': 9,
 'woke': 10,
 'up': 11,
 'lazily': 12,
 'barked': 13,
 'at': 14,
 'looked': 15,
 'back': 16,
 'just': 17,
 'ignored': 18}
In [38]:
tokenizer.texts_to_sequences(sentences)
Out[38]:
[[1, 5, 6, 2, 7, 8, 1, 9, 3],
 [1, 3, 10, 11, 12, 4, 13, 14, 1, 2],
 [1, 2, 15, 16, 4, 17, 18, 1, 3]]

Frequency Embeddings

  • Binary (Word Occurence)
  • Count (The number of times that word occurs)
  • Frequency
  • tf-idf
  • Co-occurence Matrix
In [41]:
tokenizer.texts_to_matrix(sentences, mode="binary")
Out[41]:
array([[0., 1., 1., 1., 0., 1., 1., 1., 1., 1., 0., 0., 0., 0., 0., 0.,
        0., 0., 0.],
       [0., 1., 1., 1., 1., 0., 0., 0., 0., 0., 1., 1., 1., 1., 1., 0.,
        0., 0., 0.],
       [0., 1., 1., 1., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1.,
        1., 1., 1.]])
In [44]:
sent_count = tokenizer.texts_to_matrix(sentences, mode="count")
In [45]:
sent_freq = tokenizer.texts_to_matrix(sentences, mode="freq")
In [46]:
sent_count[0]
Out[46]:
array([0., 2., 1., 1., 0., 1., 1., 1., 1., 1., 0., 0., 0., 0., 0., 0., 0.,
       0., 0.])
In [47]:
sent_freq[0]
Out[47]:
array([0.        , 0.22222222, 0.11111111, 0.11111111, 0.        ,
       0.11111111, 0.11111111, 0.11111111, 0.11111111, 0.11111111,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        ])
In [48]:
sent_count[0]/ sum(sent_count[0])
Out[48]:
array([0.        , 0.22222222, 0.11111111, 0.11111111, 0.        ,
       0.11111111, 0.11111111, 0.11111111, 0.11111111, 0.11111111,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        ])
In [51]:
tokenizer.texts_to_matrix(sentences, mode="tfidf")
Out[51]:
array([[0.        , 0.94751189, 0.55961579, 0.55961579, 0.        ,
        0.91629073, 0.91629073, 0.91629073, 0.91629073, 0.91629073,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        ],
       [0.        , 0.94751189, 0.55961579, 0.55961579, 0.69314718,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.91629073, 0.91629073, 0.91629073, 0.91629073, 0.91629073,
        0.        , 0.        , 0.        , 0.        ],
       [0.        , 0.94751189, 0.55961579, 0.55961579, 0.69314718,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.91629073, 0.91629073, 0.91629073, 0.91629073]])

Prediction Embedding

  • Learnt on a corpus
    • Word2Vec (Word level learnt using skipgram)
    • FastTExt (Character level)
    • Glove (Co-occurence Matrix)
In [52]:
import spacy
In [57]:
#!python -m spacy download en_core_web_lg
In [56]:
nlp = spacy.load("en_core_web_lg")
In [63]:
doc1 = nlp("fox")
doc2 = nlp("dog")
In [64]:
doc1.vector.shape, doc2.vector.shape
Out[64]:
((300,), (300,))
In [66]:
doc1.similarity(doc2)
Out[66]:
0.48585482527991497
In [67]:
king = nlp("king")
queen = nlp("queen")
man = nlp("man")
woman = nlp("woman")
In [69]:
pred_queen = king.vector - man.vector + woman.vector
In [75]:
np.dot(pred_queen, queen.vector)/np.linalg.norm(pred_queen)/ np.linalg.norm(queen.vector)
Out[75]:
0.78808445