NLP Lecture 1

NLP Lecture 1

by Javantea
Oct 5, 2020

KNN Pipeline

I'm starting to work on Amazon Web Services (AWS)'s Natural Language Processing (NLP) course through Machine Learning University. This blog post wrote itself, not literally. My aims are far more practical. I want to be able to classify text for projects that build up a set of information that people would be able to use to create interesting solutions for their group. Would you like to input this blog to a python script and find out what I'm talking about? That may become a reality.

I'm writing this blog in a Jupyter notebook. I realize that you might resist the idea of installing a piece of software to run my examples, but I hope you will sandbox it and see what happens. Jupyter is surprisingly well-designed even if it takes time to learn how to use it properly. Since I realize you might not want to run Jupyter, I've also released my code as python scripts. Jupyter allows export to executable scripts and I highly recommend using this feature.

Lecture 1 of the NLP course is a mess. It impressed me to see what terrible code they could release as a course. But after providing a serious demotivation, it was able to reverse that. I am now motivated to show the shortcomings of this course and how easily they can be fixed.

MLA-NLP-Lecture1-KNN.ipynb is our first mess.

The stemmer converts sensible sentences into jibberish losing the original meaning of the sentence. How could this result in sentiment analysis? The answer is that it focuses on words that when stemmed coincide with sentiment. That is -- any nuance in your speech is lost and the classifier simply counts how many times you say "great" or "good" or "bad". If you use this system to classify nuanced speech, it will not work. At best you get 50% chance of correct classification of 2-class prediction. At worst, you classify text incorrectly. Let's take a quick look.

In [1]:
import nltk, re
from nltk.corpus import stopwords
from nltk.stem import SnowballStemmer
from nltk.tokenize import word_tokenize

# Let's get a list of stop words from the NLTK library
stop = stopwords.words('english')

# These words are important for our problem. We don't want to remove them.
excluding = ['against', 'not', 'don', "don't",'ain', 'aren', "aren't", 'couldn', "couldn't",
             'didn', "didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn', "hasn't", 
             'haven', "haven't", 'isn', "isn't", 'mightn', "mightn't", 'mustn', "mustn't",
             'needn', "needn't",'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', 
             "weren't", 'won', "won't", 'wouldn', "wouldn't"]

# New stop word list
stop_words = [word for word in stop if word not in excluding]

snow = SnowballStemmer('english')

def process_text(texts): 
    final_text_list=[]
    for sent in texts:
        
        # Check if the sentence is a missing value
        if isinstance(sent, str) == False:
            sent = ""
            
        filtered_sentence=[]
        
        sent = sent.lower() # Lowercase 
        sent = sent.strip() # Remove leading/trailing whitespace
        sent = re.sub('\s+', ' ', sent) # Remove extra space and tabs
        sent = re.compile('<.*?>').sub('', sent) # Remove HTML tags/markups:
        
        for w in word_tokenize(sent):
            # We are applying some custom filtering here, feel free to try different things
            # Check if it is not numeric and its length>2 and not in stop words
            if(not w.isnumeric()) and (len(w)>2) and (w not in stop_words):  
                # Stem and add to filtered list
                filtered_sentence.append(snow.stem(w))
        final_string = " ".join(filtered_sentence) #final string of cleaned words
 
        final_text_list.append(final_string)
        
    return final_text_list
In [5]:
blog_part1 = process_text(["I'm starting to work on Amazon Web Services (AWS)'s Natural Language Processing (NLP) course through Machine Learning University.", 
"This blog post wrote itself, not literally.", 
"My aims are far more practical.", 
"I want to be able to classify text for projects that build up a set of information that people would be able to use to create interesting solutions for their group.", 
"Would you like to input this blog to a python script and find out what I'm talking about?", 
"That may become a reality.", 

"I'm writing this blog in a Jupyter notebook.", 
"I realize that you might resist the idea of installing a piece of software to run my examples, but I hope you will sandbox it and see what happens.", 
"Jupyter is surprisingly well-designed even if it takes time to learn how ot use it properly.", 
"Since I realize you might not want to run Jupyter, I've also released my code as python scripts.", 
"Jupyter allows export to executable scripts and I highly recommend using this feature.", 

"Lecture 1 of the NLP course is a mess.", 
"It impressed me to see what terrible code they could release as a course.", 
"But after providing a serious demotivation, it was able to reverse that.", 
"I am now motivated to show the shortcomings of this course and how easily they can be fixed.", 

"MLA-NLP-Lecture1-KNN.ipynb is our first mess.", 

"The stemmer converts sensible sentences into jibberish losing the original meaning of the sentence.", 
"How could this result in sentiment analysis?", 
"The answer is that it focuses on words that when stemmed coincide with sentiment.", 
"That is -- any nuance in your speech is lost and the classifier simply counts how many times you say \"great\" or \"good\" or \"bad\".", 
"If you use this system to classify nuanced speech, it will not work.", 
"At best you get 50% chance of correct classification of 2-class prediction.", 
"At worst, you classify text incorrectly.", 
"Let's take a quick look."])
print(blog_part1)
['start work amazon web servic aw natur languag process nlp cours machin learn univers', 'blog post wrote not liter', 'aim far practic', 'want abl classifi text project build set inform peopl would abl use creat interest solut group', 'would like input blog python script find talk', 'may becom realiti', 'write blog jupyt notebook', 'realiz might resist idea instal piec softwar run exampl hope sandbox see happen', 'jupyt surpris well-design even take time learn use proper', 'sinc realiz might not want run jupyt ve also releas code python script', 'jupyt allow export execut script high recommend use featur', 'lectur nlp cours mess', 'impress see terribl code could releas cours', 'provid serious demotiv abl revers', 'motiv show shortcom cours easili fix', 'mla-nlp-lecture1-knn.ipynb first mess', 'stemmer convert sensibl sentenc jibberish lose origin mean sentenc', 'could result sentiment analysi', 'answer focus word stem coincid sentiment', 'nuanc speech lost classifi simpli count mani time say great good bad', 'use system classifi nuanc speech not work', 'best get chanc correct classif 2-class predict', 'worst classifi text incorrect', 'let take quick look']

As you can see, most of the nuance is lost. How could we possibly fix this? Well, we could actually skip stemming. The reason they have chosen stemming is to get more information out of a small corpus. If you walked through the KNN notebook, you found the data has shape (70000, 6) which means 70,000 rows, 6 columns. 70,000 rows of text seems like a lot when you are considering Amazon reviews, but in corpus size, 70,000 is insufficient. The table is 33MB, which gives an average of 470 bytes per review, which you might consider really valuable. It could actually be used to create a good classifier. But most of that is not data that gets counted in sentiment. It's thrown away. That's right, our second bug in the KNN lecture is the CountVectorizer.

In [6]:
import pandas as pd

df = pd.read_csv('../data/examples/AMAZON-REVIEW-DATA-CLASSIFICATION.csv')

print('The shape of the dataset is:', df.shape)
The shape of the dataset is: (70000, 6)
In [7]:
df["isPositive"].value_counts()
Out[7]:
1.0    43692
0.0    26308
Name: isPositive, dtype: int64
In [8]:
from sklearn.model_selection import train_test_split


X_train, X_val, y_train, y_val = train_test_split(df[["reviewText"]],
                                                  df["isPositive"],
                                                  test_size=0.10,
                                                  shuffle=True,
                                                  random_state=324
                                                 )
In [9]:
print("Processing the reviewText fields")
train_text_list = process_text(X_train["reviewText"].tolist())
val_text_list = process_text(X_val["reviewText"].tolist())
Processing the reviewText fields
In [10]:
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.neighbors import KNeighborsClassifier

### PIPELINE ###
##########################

pipeline = Pipeline([
    ('text_vect', CountVectorizer(binary=True,
                                  max_features=15)),
    ('knn', KNeighborsClassifier())  
                                ])

# Visualize the pipeline
# This will come in handy especially when building more complex pipelines, stringing together multiple preprocessing steps
from sklearn import set_config
set_config(display='diagram')
pipeline
Out[10]:
Pipeline(steps=[('text_vect', CountVectorizer(binary=True, max_features=15)),
                ('knn', KNeighborsClassifier())])
CountVectorizer(binary=True, max_features=15)
KNeighborsClassifier()

The bug is in the above code. Can you see it? max_features=15 is a pretty subtle bug. What do you think it does? Let's find out!

In [11]:
# We using lists of processed text fields 
X_train = train_text_list
X_val = val_text_list

# Fit the Pipeline to training data
pipeline.fit(X_train, y_train.values)
Out[11]:
Pipeline(steps=[('text_vect', CountVectorizer(binary=True, max_features=15)),
                ('knn', KNeighborsClassifier())])
CountVectorizer(binary=True, max_features=15)
KNeighborsClassifier()
In [12]:
pipeline.steps[0]
Out[12]:
('text_vect', CountVectorizer(binary=True, max_features=15))
In [13]:
pipeline.steps[0][1].vocabulary_
Out[13]:
{'not': 4,
 'time': 9,
 'version': 11,
 'would': 13,
 'great': 2,
 'program': 7,
 'get': 1,
 'like': 3,
 'use': 10,
 'one': 5,
 'work': 12,
 'comput': 0,
 'product': 6,
 'year': 14,
 'softwar': 8}

If you went through MLA-NLP-Lecture1-BOW.ipynb, you know exactly what this means. Do you know what it means? It means that the only words that actually count towards sentiment are: not, time, version, would, great, program, get, like, use, one, work, comput, product, year, softwar.

In [16]:
", ".join(pipeline.steps[0][1].vocabulary_.keys())
Out[16]:
'not, time, version, would, great, program, get, like, use, one, work, comput, product, year, softwar'

Does that sound like a sentiment analysis classifier to you? No. It isn't. Let's do the Count vectorizer on all 70000 of the data and see how many actually have any data counted which can be sent on to the next step, the KNN.

In [20]:
len(X_train)
Out[20]:
63000
In [22]:
df[["reviewText"]][:10]
Out[22]:
reviewText
0 PURCHASED FOR YOUNGSTER WHO\nINHERITED MY "TOO...
1 unable to open or use
2 Waste of money!!! It wouldn't load to my system.
3 I attempted to install this OS on two differen...
4 I've spent 14 fruitless hours over the past tw...
5 I purchased the home and business because I wa...
6 The download doesn't take long at all. And it'...
7 This program is positively wonderful for word ...
8 Fantastic protection!! Great customer support!!
9 Obviously Win 7 now the last great operating s...
In [25]:
df[["reviewText"]][:10].values
Out[25]:
array([['PURCHASED FOR YOUNGSTER WHO\nINHERITED MY "TOO sMALL FOR ME"\nLAPTOP.  IDEAL FOR LEARNING A\nFUTURE GOOD SKILL.  HER CHOICE\nOF BOOKS IS A PLUS AS WAS THIS BOOK!'],
       ['unable to open or use'],
       ["Waste of money!!! It wouldn't load to my system."],
       ['I attempted to install this OS on two different PCs. it will not complete the install.\nWhen it gets to the page to select the language, and country the mouse and keyboard become non-functional.'],
       ["I've spent 14 fruitless hours over the past two days fruitlessly attempting to install this software on my computer and nothing I've found has worked. I need the software to type proficiently due to disability, and it will not install. The download itself seems to be a corrupted file, I have a fair amount of computer skills and no amount of tinkering has made the program work, so, judging by other reviews, it must be the program itself."],
       ["I purchased the home and business because I was going back into business.  I found it very hart to use as I was used to using\n\nQuick Books.  I have had it a couple months and still can't get it to do business income and expenses like it says it will.\n\nAs far as Intuit product help, that is a joke."],
       ["The download doesn't take long at all. And it's extremely fast, so you can use it ASAP for school or work."],
       ['This program is positively wonderful for word practice and pronunciation with a number of flash-card style game and  a very elegant listen-record-replay interface. While the two characters do pronounce words slightly differently, hearing those variations really helps to understand the phonetics in a way that a single voice would lack.\n\nHowever, this really is more of a companion tool for a more traditional course. I purchased this with "Colloquial Icelandic" and would definitely recommend purchasing them together. There are some extremely strange spelling/pronunciation variations and without having a written explanation of these things, you may be left staring in great confusion as the characters read phrases on screen -- "Hva" is pronounced "kwah?" WHAT? Having "Colluquial Icelandic" to turn to for explanation helped a great deal. That package has two audio CDs of alphabet, number, vocabulary and conversation exercises with a very well-written 370 page coursebook.\n\nI gave this product a 5-star rating as it is superb for its intended purpose, but it does not stand alone as a complete course.'],
       ['Fantastic protection!!  Great customer support!!'],
       ['Obviously Win 7 now the last great operating system since XP. Change is not always good.']],
      dtype=object)
In [28]:
process_text(df[["reviewText"]][:10].values)
Out[28]:
['', '', '', '', '', '', '', '', '', '']
In [31]:
process_text([x[0] for x in df[["reviewText"]][:10].values])
Out[31]:
['purchas youngster inherit small laptop ideal learn futur good skill choic book plus book',
 'unabl open use',
 "wast money would n't load system",
 'attempt instal two differ pcs not complet instal get page select languag countri mous keyboard becom non-funct',
 've spent fruitless hour past two day fruitless attempt instal softwar comput noth ve found work need softwar type profici due disabl not instal download seem corrupt file fair amount comput skill amount tinker made program work judg review must program',
 "purchas home busi go back busi found hart use use use quick book coupl month still n't get busi incom expens like say far intuit product help joke",
 "download n't take long extrem fast use asap school work",
 'program posit wonder word practic pronunci number flash-card style game eleg listen-record-replay interfac two charact pronounc word slight differ hear variat realli help understand phonet way singl voic would lack howev realli companion tool tradit cours purchas colloqui iceland would definit recommend purchas togeth extrem strang spelling/pronunci variat without written explan thing may left stare great confus charact read phrase screen hva pronounc kwah colluqui iceland turn explan help great deal packag two audio cds alphabet number vocabulari convers exercis well-written page coursebook gave product 5-star rate superb intend purpos not stand alon complet cours',
 'fantast protect great custom support',
 'obvious win last great oper system sinc chang not alway good']
In [32]:
pipeline.steps[0][1].transform(process_text([x[0] for x in df[["reviewText"]][:10].values]))
Out[32]:
<10x15 sparse matrix of type '<class 'numpy.int64'>'
	with 23 stored elements in Compressed Sparse Row format>
In [33]:
result = pipeline.steps[0][1].transform(process_text([x[0] for x in df[["reviewText"]].values]))
In [34]:
result
Out[34]:
<70000x15 sparse matrix of type '<class 'numpy.int64'>'
	with 227755 stored elements in Compressed Sparse Row format>
In [36]:
result[:10].todense()
Out[36]:
matrix([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0],
        [0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0],
        [0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0],
        [0, 0, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0],
        [0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])

So now we want a list of ones that are processed (or the inverse).

In [39]:
count_of_features = sum(result[:10]).todense()
In [41]:
[x for x in count_of_features if x.all()]
Out[41]:
[]
In [42]:
[x for x in count_of_features]
Out[42]:
[matrix([[1, 2, 3, 1, 4, 0, 2, 2, 1, 0, 3, 0, 2, 2, 0]])]
In [48]:
[x for x in count_of_features.tolist()[0] if x]
Out[48]:
[1, 2, 3, 1, 4, 2, 2, 1, 3, 2, 2]

Out of the first 10, we got 11. Which means that we summed the wrong way. Luckily it didn't give us 9, right?

In [50]:
count_of_features = sum(result[:10].T).todense()
In [51]:
[x for x in count_of_features.tolist()[0] if x]
Out[51]:
[1, 1, 2, 5, 4, 2, 5, 1, 2]

We got 9 out of 10. Let's do it manually to make sure that matches. It matches our matrix. The first row is empty, the rest have values. So our CountVectorizer converts the first review first from it's original text to 'purchas youngster inherit small laptop ideal learn futur good skill choic book plus book' and then to nothing at all. The second sentence is bad as well. It first converts its original text 'unable to open or use' to 'unabl open use' and then to "use". What does our brilliant KNN classify "use" to?

In [55]:
import numpy as np
pipeline.steps[1][1].predict(np.matrix([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0]]))
Out[55]:
array([1.])

It predicts that this is a positive review. This is wrong. Now let's count how many reviews are thrown out entirely.

In [60]:
count_of_features_full = sum(result.T).todense()
reviews_counted = len([x for x in count_of_features_full.tolist()[0] if x])
print("Number of reviews counted", reviews_counted)
print("Number of reviews not counted", 70000 - reviews_counted)
Number of reviews counted 61117
Number of reviews not counted 8883

So our count vectorizer discards 8883 reviews out of 70000. How do we fix this? We increase the number of features. This will cost significantly more CPU, but let's find out how much!

In [61]:
### PIPELINE ###
##########################

pipeline = Pipeline([
    ('text_vect', CountVectorizer(binary=True,
                                  max_features=150)),
    ('knn', KNeighborsClassifier())  
                                ])

# Visualize the pipeline
# This will come in handy especially when building more complex pipelines, stringing together multiple preprocessing steps
from sklearn import set_config
set_config(display='diagram')
pipeline
Out[61]:
Pipeline(steps=[('text_vect', CountVectorizer(binary=True, max_features=150)),
                ('knn', KNeighborsClassifier())])
CountVectorizer(binary=True, max_features=150)
KNeighborsClassifier()
In [62]:
# We using lists of processed text fields 
X_train = train_text_list
X_val = val_text_list

# Fit the Pipeline to training data
pipeline.fit(X_train, y_train.values)
Out[62]:
Pipeline(steps=[('text_vect', CountVectorizer(binary=True, max_features=150)),
                ('knn', KNeighborsClassifier())])
CountVectorizer(binary=True, max_features=150)
KNeighborsClassifier()

It was very fast, it took a few seconds, maybe 5.

In [63]:
print("Vocabulary", ", ".join(pipeline.steps[0][1].vocabulary_.keys()))
Vocabulary not, expect, price, time, upgrad, window, version, problem, run, everi, alway, find, file, could, mac, well, would, great, first, thing, good, account, chang, open, start, old, set, made, see, come, excel, back, still, program, allow, make, think, better, get, like, right, even, edit, new, use, put, everyth, take, littl, way, custom, want, anoth, sure, one, work, give, seem, review, comput, someth, download, never, got, tri, tax, product, year, mani, bought, home, busi, abl, sinc, purchas, differ, bit, need, updat, realli, far, instal, worth, complet, read, amazon, softwar, also, servic, word, featur, user, ve, buy, quick, help, reason, love, easi, day, intuit, fix, sever, system, issu, recommend, friend, includ, re, lot, without, support, look, inform, creat, two, hour, best, keep, know, microsoft, quicken, much, option, call, found, last, return, long, say, simpl, compani, secur, high, howev, money, learn, peopl, norton, pay, fine, game, go, ever, hard, load, save, free, play, onlin

This seems pretty good actually. These are the type of words a person might use to describe a piece of software, hardware, or stuff in between. Let's see if we get better accuracy.

In [73]:
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score

# Use the fitted pipeline to make predictions on the validation dataset
val_predictions = pipeline.predict(X_val)
print(confusion_matrix(y_val.values, val_predictions))
print(classification_report(y_val.values, val_predictions))
print("Accuracy (validation):", accuracy_score(y_val.values, val_predictions))
[[1494 1111]
 [ 648 3747]]
              precision    recall  f1-score   support

         0.0       0.70      0.57      0.63      2605
         1.0       0.77      0.85      0.81      4395

    accuracy                           0.75      7000
   macro avg       0.73      0.71      0.72      7000
weighted avg       0.74      0.75      0.74      7000

Accuracy (validation): 0.7487142857142857

Precision and recall are both better with the only value that stayed the same was recall of false. Our accuracy increased from 0.688 to 0.749. This is a pretty substantial increase. Let's try it with our blog again.

In [65]:
pipeline.predict(blog_part1)
Out[65]:
array([0., 0., 1., 1., 0., 1., 1., 0., 1., 0., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 1., 0., 1., 1., 1.])

Let's do manual sentiment analysis of our text and see which it got right and wrong. To avoid bias, I'm not going to look at predictions while I do my manual sentiment analysis. This of course has some drawbacks, but it is what it is.

My results:  [1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1]
Predictions: [0, 0, 1, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1]
Incorrect:   [1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0]
As you can see, it's pretty bad, but that is because of the words it trains on. We could go in depth to each sentence and see exactly why it got the second half wrong, but let's do a check of one that seems like it should be easy to classify.

In [70]:
processed = process_text(["At worst, you classify text incorrectly."])
print(processed)
pipeline.predict(processed)
['worst classifi text incorrect']
Out[70]:
array([1.])
In [72]:
pipeline.steps[0][1].transform(processed)
Out[72]:
<1x150 sparse matrix of type '<class 'numpy.int64'>'
	with 0 stored elements in Compressed Sparse Row format>

As you can see, the count vectorizer could not find a single word in its vocabulary to classify, so it transformed it to a blank record. Let's add more features because it didn't hurt our CPU to multiply the feature size by 10.

In [74]:
### PIPELINE ###
##########################

pipeline = Pipeline([
    ('text_vect', CountVectorizer(binary=True,
                                  max_features=1500)),
    ('knn', KNeighborsClassifier())  
                                ])

# Visualize the pipeline
# This will come in handy especially when building more complex pipelines, stringing together multiple preprocessing steps
from sklearn import set_config
set_config(display='diagram')
pipeline
Out[74]:
Pipeline(steps=[('text_vect', CountVectorizer(binary=True, max_features=1500)),
                ('knn', KNeighborsClassifier())])
CountVectorizer(binary=True, max_features=1500)
KNeighborsClassifier()
In [75]:
# We using lists of processed text fields 
X_train = train_text_list
X_val = val_text_list

# Fit the Pipeline to training data
pipeline.fit(X_train, y_train.values)
Out[75]:
Pipeline(steps=[('text_vect', CountVectorizer(binary=True, max_features=1500)),
                ('knn', KNeighborsClassifier())])
CountVectorizer(binary=True, max_features=1500)
KNeighborsClassifier()

Again it was very fast. Let's check the accuracy again.

In [76]:
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score

# Use the fitted pipeline to make predictions on the validation dataset
val_predictions = pipeline.predict(X_val)
print(confusion_matrix(y_val.values, val_predictions))
print(classification_report(y_val.values, val_predictions))
print("Accuracy (validation):", accuracy_score(y_val.values, val_predictions))
[[1393 1212]
 [ 458 3937]]
              precision    recall  f1-score   support

         0.0       0.75      0.53      0.63      2605
         1.0       0.76      0.90      0.83      4395

    accuracy                           0.76      7000
   macro avg       0.76      0.72      0.73      7000
weighted avg       0.76      0.76      0.75      7000

Accuracy (validation): 0.7614285714285715

From 0.749 with 150 features to 0.761 with 1500 features. I'm willing to argue that this is where diminishing returns comes in. So let's stop here. I think we're ready for our final project. The final project is just using a different data set, IMDB's review data and creating a similar classifier. What could possibly go wrong?

              precision    recall  f1-score   support

          0       0.73      0.44      0.55      1259
          1       0.60      0.83      0.69      1241

    accuracy                           0.64      2500

   macro avg       0.66      0.64      0.62      2500
weighted avg       0.66      0.64      0.62      2500
Accuracy (validation): 0.6364

The accuracy with 15 words was abysmal. The accuracy with 1500 words is pretty bad, but at least it's better than random chance. What does our vocabulary look like?

Vocabulary somewher, read, film, suppos, comedi, see, call, anyth, point, movi, dialogu, extrem, absurd, mani, set, seem, despit, nuditi, sexual, content, noth, leav, wonder, thing, titl, premis, could, fun, polit, instead, re, treat, cheap, weird, want, ll, buy, grace, jone, first, comment, imdb, reason, write, talk, one, best, ever, make, laugh, cri, time, fall, love, not, fan, yet, sing, tradit, possibl, accept, exist, feel, good, watch, like, lot, actor, non, lead, role, creat, memor, charact, interest, import, central, star, great, rememb, everi, singl, face, around, find, line, becom, name, speak, peopl, live, simpl, deep, guess, meant, sort, beauti, say, ve, began, sever, minut, graphic, sex, wow, anyway, young, woman, travel, marri, son, owner, man, take, care, sure, get, rock, legend, area, rather, ladi, least, particular, kind, refer, away, regard, soon, to, curious, littl, hand, father, sight, goe, sleep, famili, wait, show, member, dream, wood, way, effect, desir, attempt, know, much, laughabl, especi, certain, featur, appear, realist, bed, post, strang, twist, end, realli, clue, build, fit, definit, expect, tough, moment, think, everyth, theater, death, awesom, use, dull, stori, follow, girl, hope, th, part, came, straight, video, kill, boy, base, mother, visit, mom, boyfriend, move, tri, writer, turn, anyon, hous, danger, give, worth, blood, dvd, disappoint, european, horror, almost, redeem, qualiti, except, thorough, entertain, bad, open, alon, heroin, dr, arriv, help, scientist, experi, dead, rape, fight, final, near, next, day, someth, tell, would, place, decid, stay, put, even, enjoy, event, over, the, top, knew, go, this, but, repeat, drug, night, nake, tortur, mild, convinc, angri, complet, negat, includ, obvious, fact, popular, women, work, mind, pretti, someon, basic, physic, brother, respons, long, wind, scene, warn, involv, suffer, well, cours, brief, marriag, shot, local, surround, hurt, two, killer, whole, quick, blame, idea, subtl, ridicul, predict, climax, bore, director, anoth, hit, head, made, realiz, hear, funniest, bit, dub, endless, product, valu, close, credit, red, class, music, sound, sometim, tim, add, odd, cloth, standard, rate, solid, pleas, must, left, taken, aw, victim, may, dare, lie, so, posit, messag, shown, need, societi, throw, stand, went, wrong, portray, weak, strong, against, famous, usual, full, hide, learn, mean, clean, though, light, whether, daughter, thus, confus, subject, simpli, funni, wooden, mediocr, track, bright, pointless, overal, viewer, god, save, match, edg, also, michael, main, honest, fell, worthi, joke, big, miss, earli, probabl, found, still, warm, wast, back, floor, half, crap, screen, that, wit, field, hour, might, better, spent, act, plot, saw, cut, seen, manag, escap, ride, background, old, king, desper, clear, heavi, will, front, money, review, pain, done, justic, type, word, let, look, flick, pass, walk, car, start, anymor, buddi, mine, sent, girlfriend, come, what, wors, twice, heard, ball, month, real, life, struggl, win, emot, friend, sister, jane, thi, focus, loos, special, rise, incred, intens, ultim, america, memori, childhood, promis, remark, perform, establish, claim, support, youth, sport, program, award, alway, told, age, proper, key, never, recommend, within, believ, achiev, rest, piec, bottom, cinemat, food, play, in, anim, aliv, spot, often, small, eat, monster, slowli, crazi, stupid, destroy, new, ground, gore, element, clever, script, plenti, lin, cult, violenc, middl, fine, italian, pick, deliv, express, suggest, mak, wife, job, beyond, hold, you, scream, teenag, horribl, teen, poor, male, femal, absolut, hate, sit, garbag, dad, room, depress, worst, detect, happen, mysteri, fiction, die, year, ago, wish, finish, mix, releas, cast, gave, everyon, els, state, latter, amaz, quit, career, ahead, question, person, last, festiv, huge, storylin, potenti, power, excit, reveng, check, conclus, stick, classic, yes, dialog, flat, right, depict, colleg, world, serv, purpos, serious, lack, lame, photographi, impress, inde, break, either, admit, mayb, club, longer, version, bar, american, fire, blue, general, decad, atmospher, prove, terribl, ask, abus, polic, jame, less, regular, execut, and, children, profession, home, enough, countri, perfect, got, road, howev, black, magic, comparison, meet, display, alreadi, develop, understand, visual, silent, forward, drama, without, talent, stun, imag, fill, arm, box, handl, of, view, far, insight, later, object, parti, offer, happi, relationship, ring, episod, artist, behind, rare, mistak, affair, free, annoy, cute, hero, coupl, bond, etc, previous, therefor, maker, mad, creatur, nightmar, histor, chang, industri, normal, impact, upon, audienc, refus, satisfi, sudden, pathet, opportun, million, five, earlier, took, imagin, entir, disappear, eye, actual, run, down, system, consid, biggest, guy, along, begin, problem, crime, respect, deserv, amount, keep, deal, exact, surpris, besid, david, order, pull, distract, up, nice, inform, action, outsid, hilari, case, discov, now, white, matter, direct, thank, exampl, due, abl, jump, high, on, design, doubt, aspect, camera, second, said, bare, stage, decent, low, budget, craft, mention, equal, modern, sequenc, although, abil, easi, thought, bill, differ, style, origin, commentari, intellig, answer, band, natur, assum, truth, requir, pop, cultur, group, public, fascin, agre, sinc, remain, most, art, produc, anywher, record, theme, nobodi, it, cop, initi, shock, fate, cheesi, soundtrack, cool, suit, song, fantast, pay, hill, whatev, martin, pleasant, formula, japanes, stuff, insan, camp, level, genius, ther, allow, discuss, edit, today, favorit, appeal, don, german, unlik, eventu, dark, frighten, fail, greatest, disney, heaven, masterpiec, damn, suck, chill, realiti, practic, if, concern, york, child, hidden, sick, explor, various, flashback, other, gorgeous, men, fast, true, church, spoiler, boss, drag, forc, rip, human, market, psycholog, excel, angl, late, avoid, pictur, touch, hard, romant, older, issu, okay, kid, adult, doctor, extra, bring, fear, evil, brought, faith, seri, scari, return, 10, among, unfortun, stretch, movie, notic, wrote, screenplay, consist, materi, appar, result, cliché, exploit, co, three, school, trip, enter, past, color, spend, togeth, none, student, husband, date, immedi, hire, actress, continu, attract, lik, bodi, voic, spoil, nation, given, creativ, concept, describ, hollywood, soldier, armi, govern, murder, detail, door, obsess, across, compar, critic, sorri, shine, unbeliev, compel, theatr, explan, credibl, drawn, draw, addit, unit, team, shoot, stereotyp, plan, space, genr, land, written, richard, pair, tale, approach, out, situat, sens, wild, iron, held, pre, georg, round, compani, chanc, share, worri, fashion, inspir, piti, reach, avail, sequel, ghost, babi, figur, tire, sad, outstand, crimin, story, known, gem, alien, possess, reveal, for, sign, note, send, third, war, suspens, explain, english, languag, depth, spirit, whose, sub, success, readi, tom, imposs, noir, narrat, villain, lord, giant, pure, hint, loud, stop, summer, somehow, short, brain, burn, test, john, side, violent, report, somewhat, crew, offic, cost, born, appropri, rule, phone, hair, slasher, mid, lesson, troubl, west, control, similar, tast, soul, humour, favourit, prefer, christian, grant, four, nonsens, indian, cross, danc, progress, perhap, thriller, jack, master, mental, tension, term, air, week, appreci, comed, dramat, rent, store, opinion, tragedi, number, cameo, silli, fantasi, generat, comic, book, ad, onto, score, contrast, romanc, engag, sell, otherwis, mari, self, seek, ruin, provid, mess, paid, locat, hospit, amus, filmmak, scienc, doubl, larg, attack, grow, caus, mouth, constant, throughout, charm, cover, water, wall, forget, excus, insid, surviv, truli, chemistri, former, remind, technic, highlight, bunch, heart, fulli, difficult, vision, led, familiar, recogn, intent, cinematographi, breath, deepli, south, gay, kept, intern, cinema, hole, felt, apart, plus, receiv, center, flaw, accent, histori, remak, reaction, superb, slight, disgust, moral, essenti, skill, ten, succeed, project, forev, relat, zombi, effort, length, frame, aim, current, thrown, angel, magnific, occasion, celebr, lock, rais, accid, brutal, nasti, frank, neither, nowher, shame, british, zero, mi, citi, sceneri, major, documentari, motion, toward, tend, pace, haunt, utter, model, creepi, christoph, terrif, drop, christma, dread, western, island, mood, opera, caught, unknown, present, tie, earth, accur, frustrat, captur, necessari, studi, mark, fellow, vampir, lost, recent, trailer, bought, list, costum, process, period, desert, robert, street, town, strike, steve, smith, ex, evid, trust, intend, occur, easili, choic, common, adventur, combin, manner, smile, search, train, battl, teacher, board, yeah, push, charg, convers, step, form, grade, typic, footag, drive, militari, brilliant, chris, joe, fair, van, afraid, do, humor, unexpect, wide, innoc, thrill, insult, commerci, cold, collect, bother, sentiment, pleasur, parent, rich, off, total, al, season, sum, scare, tone, steal, fresh, era, roll, rat, sexi, dog, toni, hell, stone, skip, beat, sam, scott, busi, fool, tune, sweet, game, individu, william, mr, listen, hey, jean, park, prepar, bloodi, hang, univers, logic, contain, folk, anti, everybodi, met, rang, separ, somebodi, choos, unusu, wear, favor, blow, remot, protagonist, accord, random, averag, gun, peter, ben, connect, idiot, prison, super, hunt, dollar, trash, televis, catch, conflict, interview, author, station, london, copi, challeng, disast, gone, replac, seat, bizarr, plain, trick, asid, tragic, smart, adapt, repres, paul, embarrass, all, paint, attitud, won, awar, wise, kick, influenc, superior, tear, irrit, improv, vote, commit, suicid, thin, oscar, news, anybodi, character, crash, planet, watchabl, hot, cat, glad, futur, rush, carri, blond, decis, likabl, root, sci, fi, becam, stuck, partner, judg, forgotten, slow, window, trap, limit, complex, unless, cartoon, player, attent, agent, investig, chase, join, fake, shop, channel, race, gang, harri, six, uniqu, journey, affect, centuri, teach, seven, count, lover, lose, drink, encount, gag, ignor, satir, quiet, novel, studio, serial, bland, bear, law, protect, correct, secret, tape, mere, comput, lee, minor, charl, delight, ann, resembl, machin, no, her, ill, intrigu, ship, younger, opposit, stephen, energi, billi, social, motiv, photograph, fortun, ident, liter, fli, passion, green, ugli, load, suspect, signific, disturb, target, instanc, mile, strength, marvel, unnecessari, bomb, them, genuin, introduc, dress, dumb, presenc, independ, rescu, french, epic, admir, invent, jim, reflect, ray, nomin, joy, pack, fault, convey, whatsoev

Does this look like a list of words that would determine sentiment? Yeah, more or less. Because it's 1500 words, it's pretty nuanced I bet.

So that's the result of a few hours of my work on NLP. K Nearest Neighbors (KNN) is an interesting, but low quality classifier for NLP. It's pretty efficient, so I suspect I'll use it to classify speech as time goes on. It makes a lot more sense to use KNN on stuff like spam, hate speech, politics, search, and topic instead of sentiment. Why is that? Because

  • spam has patterns (one of those patterns is to avoid patterns)
  • hate speech has wording embedded in it and that hate speech is annoying for humans to read
  • politics is rapidly evolving, but makes sense for a KNN because people copy one another's speech
  • search would benefit from nearest neighbor part, giving a possible match on a sentence (with limitations discussed earlier)
  • topic classification would be easier for KNN because people are trying to provide metadata about what they are talking about in their text.
In [ ]:
 

Permalink

Comments: 0

Leave a reply »

 
  • Leave a Reply
    Your gravatar
    Your Name