#!/usr/bin/env python # coding: utf-8 # # NLP Lecture 1 # *by Javantea* # *Oct 5, 2020* # # I'm starting to work on Amazon Web Services (AWS)'s [Natural Language Processing (NLP) course](https://www.youtube.com/playlist?list=PL8P_Z6C4GcuWfAq8Pt6PBYlck4OprHXsw) through Machine Learning University. This blog post wrote itself, not literally. My aims are far more practical. I want to be able to classify text for projects that build up a set of information that people would be able to use to create interesting solutions for their group. Would you like to input this blog to a python script and find out what I'm talking about? That may become a reality. # # I'm writing this blog in a Jupyter notebook. I realize that you might resist the idea of installing a piece of software to run my examples, but I hope you will sandbox it and see what happens. Jupyter is surprisingly well-designed even if it takes time to learn how ot use it properly. Since I realize you might not want to run Jupyter, I've also released my code as python scripts. Jupyter allows export to executable scripts and I highly recommend using this feature. # # Lecture 1 of the NLP course is a mess. It impressed me to see what terrible code they could release as a course. But after providing a serious demotivation, it was able to reverse that. I am now motivated to show the shortcomings of this course and how easily they can be fixed. # # MLA-NLP-Lecture1-KNN.ipynb is our first mess. # The stemmer converts sensible sentences into jibberish losing the original meaning of the sentence. How could this result in sentiment analysis? The answer is that it focuses on words that when stemmed coincide with sentiment. That is -- any nuance in your speech is lost and the classifier simply counts how many times you say "great" or "good" or "bad". If you use this system to classify nuanced speech, it will not work. At best you get 50% chance of correct classification of 2-class prediction. At worst, you classify text incorrectly. Let's take a quick look. # In[1]: import nltk, re from nltk.corpus import stopwords from nltk.stem import SnowballStemmer from nltk.tokenize import word_tokenize # Let's get a list of stop words from the NLTK library stop = stopwords.words('english') # These words are important for our problem. We don't want to remove them. excluding = ['against', 'not', 'don', "don't",'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'mightn', "mightn't", 'mustn', "mustn't", 'needn', "needn't",'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", 'won', "won't", 'wouldn', "wouldn't"] # New stop word list stop_words = [word for word in stop if word not in excluding] snow = SnowballStemmer('english') def process_text(texts): final_text_list=[] for sent in texts: # Check if the sentence is a missing value if isinstance(sent, str) == False: sent = "" filtered_sentence=[] sent = sent.lower() # Lowercase sent = sent.strip() # Remove leading/trailing whitespace sent = re.sub('\s+', ' ', sent) # Remove extra space and tabs sent = re.compile('<.*?>').sub('', sent) # Remove HTML tags/markups: for w in word_tokenize(sent): # We are applying some custom filtering here, feel free to try different things # Check if it is not numeric and its length>2 and not in stop words if(not w.isnumeric()) and (len(w)>2) and (w not in stop_words): # Stem and add to filtered list filtered_sentence.append(snow.stem(w)) final_string = " ".join(filtered_sentence) #final string of cleaned words final_text_list.append(final_string) return final_text_list # In[5]: blog_part1 = process_text(["I'm starting to work on Amazon Web Services (AWS)'s Natural Language Processing (NLP) course through Machine Learning University.", "This blog post wrote itself, not literally.", "My aims are far more practical.", "I want to be able to classify text for projects that build up a set of information that people would be able to use to create interesting solutions for their group.", "Would you like to input this blog to a python script and find out what I'm talking about?", "That may become a reality.", "I'm writing this blog in a Jupyter notebook.", "I realize that you might resist the idea of installing a piece of software to run my examples, but I hope you will sandbox it and see what happens.", "Jupyter is surprisingly well-designed even if it takes time to learn how ot use it properly.", "Since I realize you might not want to run Jupyter, I've also released my code as python scripts.", "Jupyter allows export to executable scripts and I highly recommend using this feature.", "Lecture 1 of the NLP course is a mess.", "It impressed me to see what terrible code they could release as a course.", "But after providing a serious demotivation, it was able to reverse that.", "I am now motivated to show the shortcomings of this course and how easily they can be fixed.", "MLA-NLP-Lecture1-KNN.ipynb is our first mess.", "The stemmer converts sensible sentences into jibberish losing the original meaning of the sentence.", "How could this result in sentiment analysis?", "The answer is that it focuses on words that when stemmed coincide with sentiment.", "That is -- any nuance in your speech is lost and the classifier simply counts how many times you say \"great\" or \"good\" or \"bad\".", "If you use this system to classify nuanced speech, it will not work.", "At best you get 50% chance of correct classification of 2-class prediction.", "At worst, you classify text incorrectly.", "Let's take a quick look."]) print(blog_part1) # As you can see, most of the nuance is lost. How could we possibly fix this? Well, we could actually skip stemming. The reason they have chosen stemming is to get more information out of a small corpus. If you walked through the KNN notebook, you found the data has shape `(70000, 6)` which means 70,000 rows, 6 columns. 70,000 rows of text seems like a lot when you are considering Amazon reviews, but in corpus size, 70,000 is insufficient. The table is 33MB, which gives an average of 470 bytes per review, which you might consider really valuable. It could actually be used to create a good classifier. But most of that is not data that gets counted in sentiment. It's thrown away. That's right, our second bug in the KNN lecture is the CountVectorizer. # In[6]: import pandas as pd df = pd.read_csv('../data/examples/AMAZON-REVIEW-DATA-CLASSIFICATION.csv') print('The shape of the dataset is:', df.shape) # In[7]: df["isPositive"].value_counts() # In[8]: from sklearn.model_selection import train_test_split X_train, X_val, y_train, y_val = train_test_split(df[["reviewText"]], df["isPositive"], test_size=0.10, shuffle=True, random_state=324 ) # In[9]: print("Processing the reviewText fields") train_text_list = process_text(X_train["reviewText"].tolist()) val_text_list = process_text(X_val["reviewText"].tolist()) # In[10]: from sklearn.pipeline import Pipeline from sklearn.feature_extraction.text import CountVectorizer from sklearn.neighbors import KNeighborsClassifier ### PIPELINE ### ########################## pipeline = Pipeline([ ('text_vect', CountVectorizer(binary=True, max_features=15)), ('knn', KNeighborsClassifier()) ]) # Visualize the pipeline # This will come in handy especially when building more complex pipelines, stringing together multiple preprocessing steps from sklearn import set_config set_config(display='diagram') pipeline # The bug is in the above code. Can you see it? `max_features=15` is a pretty subtle bug. What do you think it does? Let's find out! # In[11]: # We using lists of processed text fields X_train = train_text_list X_val = val_text_list # Fit the Pipeline to training data pipeline.fit(X_train, y_train.values) # In[12]: pipeline.steps[0] # In[13]: pipeline.steps[0][1].vocabulary_ # If you went through MLA-NLP-Lecture1-BOW.ipynb, you know exactly what this means. Do you know what it means? It means that the only words that actually count towards sentiment are: not, time, version, would, great, program, get, like, use, one, work, comput, product, year, softwar. # In[16]: ", ".join(pipeline.steps[0][1].vocabulary_.keys()) # Does that sound like a sentiment analysis classifier to you? No. It isn't. Let's do the Count vectorizer on all 70000 of the data and see how many actually have any data counted which can be sent on to the next step, the KNN. # In[20]: len(X_train) # In[22]: df[["reviewText"]][:10] # In[25]: df[["reviewText"]][:10].values # In[28]: process_text(df[["reviewText"]][:10].values) # In[31]: process_text([x[0] for x in df[["reviewText"]][:10].values]) # In[32]: pipeline.steps[0][1].transform(process_text([x[0] for x in df[["reviewText"]][:10].values])) # In[33]: result = pipeline.steps[0][1].transform(process_text([x[0] for x in df[["reviewText"]].values])) # In[34]: result # In[36]: result[:10].todense() # So now we want a list of ones that are processed (or the inverse). # In[39]: count_of_features = sum(result[:10]).todense() # In[41]: [x for x in count_of_features if x.all()] # In[42]: [x for x in count_of_features] # In[48]: [x for x in count_of_features.tolist()[0] if x] # Out of the first 10, we got 11. Which means that we summed the wrong way. Luckily it didn't give us 9, right? # In[50]: count_of_features = sum(result[:10].T).todense() # In[51]: [x for x in count_of_features.tolist()[0] if x] # We got 9 out of 10. Let's do it manually to make sure that matches. It matches our matrix. The first row is empty, the rest have values. So our CountVectorizer converts the first review first from it's original text to 'purchas youngster inherit small laptop ideal learn futur good skill choic book plus book' and then to nothing at all. The second sentence is bad as well. It first converts its original text 'unable to open or use' to 'unabl open use' and then to "use". What does our brilliant KNN classify "use" to? # In[55]: import numpy as np pipeline.steps[1][1].predict(np.matrix([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0]])) # It predicts that this is a positive review. This is wrong. Now let's count how many reviews are thrown out entirely. # In[60]: count_of_features_full = sum(result.T).todense() reviews_counted = len([x for x in count_of_features_full.tolist()[0] if x]) print("Number of reviews counted", reviews_counted) print("Number of reviews not counted", 70000 - reviews_counted) # So our count vectorizer discards 8883 reviews out of 70000. How do we fix this? We increase the number of features. This will cost significantly more CPU, but let's find out how much! # In[61]: ### PIPELINE ### ########################## pipeline = Pipeline([ ('text_vect', CountVectorizer(binary=True, max_features=150)), ('knn', KNeighborsClassifier()) ]) # Visualize the pipeline # This will come in handy especially when building more complex pipelines, stringing together multiple preprocessing steps from sklearn import set_config set_config(display='diagram') pipeline # In[62]: # We using lists of processed text fields X_train = train_text_list X_val = val_text_list # Fit the Pipeline to training data pipeline.fit(X_train, y_train.values) # It was very fast, it took a few seconds, maybe 5. # In[63]: print("Vocabulary", ", ".join(pipeline.steps[0][1].vocabulary_.keys())) # This seems pretty good actually. These are the type of words a person might use to describe a piece of software, hardware, or stuff in between. Let's see if we get better accuracy. # In[73]: from sklearn.metrics import confusion_matrix, classification_report, accuracy_score # Use the fitted pipeline to make predictions on the validation dataset val_predictions = pipeline.predict(X_val) print(confusion_matrix(y_val.values, val_predictions)) print(classification_report(y_val.values, val_predictions)) print("Accuracy (validation):", accuracy_score(y_val.values, val_predictions)) # Precision and recall are both better with the only value that stayed the same was recall of false. Our accuracy increased from 0.688 to 0.749. This is a pretty substantial increase. Let's try it with our blog again. # In[65]: pipeline.predict(blog_part1) # Let's do manual sentiment analysis of our text and see which it got right and wrong. To avoid bias, I'm not going to look at predictions while I do my manual sentiment analysis. This of course has some drawbacks, but it is what it is. #
# My results:  [1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1]
# Predictions: [0, 0, 1, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1]
# Incorrect:   [1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0]
# 
# As you can see, it's pretty bad, but that is because of the words it trains on. We could go in depth to each sentence and see exactly why it got the second half wrong, but let's do a check of one that seems like it should be easy to classify. # In[70]: processed = process_text(["At worst, you classify text incorrectly."]) print(processed) pipeline.predict(processed) # In[72]: pipeline.steps[0][1].transform(processed) # As you can see, the count vectorizer could not find a single word in its vocabulary to classify, so it transformed it to a blank record. Let's add more features because it didn't hurt our CPU to multiply the feature size by 10. # In[74]: ### PIPELINE ### ########################## pipeline = Pipeline([ ('text_vect', CountVectorizer(binary=True, max_features=1500)), ('knn', KNeighborsClassifier()) ]) # Visualize the pipeline # This will come in handy especially when building more complex pipelines, stringing together multiple preprocessing steps from sklearn import set_config set_config(display='diagram') pipeline # In[75]: # We using lists of processed text fields X_train = train_text_list X_val = val_text_list # Fit the Pipeline to training data pipeline.fit(X_train, y_train.values) # Again it was very fast. Let's check the accuracy again. # In[76]: from sklearn.metrics import confusion_matrix, classification_report, accuracy_score # Use the fitted pipeline to make predictions on the validation dataset val_predictions = pipeline.predict(X_val) print(confusion_matrix(y_val.values, val_predictions)) print(classification_report(y_val.values, val_predictions)) print("Accuracy (validation):", accuracy_score(y_val.values, val_predictions)) # From 0.749 with 150 features to 0.761 with 1500 features. I'm willing to argue that this is where diminishing returns comes in. So let's stop here. I think we're ready for our final project. The final project is just using a different data set, IMDB's review data and creating a similar classifier. What could possibly go wrong? #
#               precision    recall  f1-score   support
# 
#            0       0.73      0.44      0.55      1259
#            1       0.60      0.83      0.69      1241
# 
#     accuracy                           0.64      2500
#    macro avg       0.66      0.64      0.62      2500
# weighted avg       0.66      0.64      0.62      2500
# 
# Accuracy (validation): 0.6364
# 
# The accuracy with 15 words was abysmal. The accuracy with 1500 words is pretty bad, but at least it's better than random chance. What does our vocabulary look like? # # Vocabulary somewher, read, film, suppos, comedi, see, call, anyth, point, movi, dialogu, extrem, absurd, mani, set, seem, despit, nuditi, sexual, content, noth, leav, wonder, thing, titl, premis, could, fun, polit, instead, re, treat, cheap, weird, want, ll, buy, grace, jone, first, comment, imdb, reason, write, talk, one, best, ever, make, laugh, cri, time, fall, love, not, fan, yet, sing, tradit, possibl, accept, exist, feel, good, watch, like, lot, actor, non, lead, role, creat, memor, charact, interest, import, central, star, great, rememb, everi, singl, face, around, find, line, becom, name, speak, peopl, live, simpl, deep, guess, meant, sort, beauti, say, ve, began, sever, minut, graphic, sex, wow, anyway, young, woman, travel, marri, son, owner, man, take, care, sure, get, rock, legend, area, rather, ladi, least, particular, kind, refer, away, regard, soon, to, curious, littl, hand, father, sight, goe, sleep, famili, wait, show, member, dream, wood, way, effect, desir, attempt, know, much, laughabl, especi, certain, featur, appear, realist, bed, post, strang, twist, end, realli, clue, build, fit, definit, expect, tough, moment, think, everyth, theater, death, awesom, use, dull, stori, follow, girl, hope, th, part, came, straight, video, kill, boy, base, mother, visit, mom, boyfriend, move, tri, writer, turn, anyon, hous, danger, give, worth, blood, dvd, disappoint, european, horror, almost, redeem, qualiti, except, thorough, entertain, bad, open, alon, heroin, dr, arriv, help, scientist, experi, dead, rape, fight, final, near, next, day, someth, tell, would, place, decid, stay, put, even, enjoy, event, over, the, top, knew, go, this, but, repeat, drug, night, nake, tortur, mild, convinc, angri, complet, negat, includ, obvious, fact, popular, women, work, mind, pretti, someon, basic, physic, brother, respons, long, wind, scene, warn, involv, suffer, well, cours, brief, marriag, shot, local, surround, hurt, two, killer, whole, quick, blame, idea, subtl, ridicul, predict, climax, bore, director, anoth, hit, head, made, realiz, hear, funniest, bit, dub, endless, product, valu, close, credit, red, class, music, sound, sometim, tim, add, odd, cloth, standard, rate, solid, pleas, must, left, taken, aw, victim, may, dare, lie, so, posit, messag, shown, need, societi, throw, stand, went, wrong, portray, weak, strong, against, famous, usual, full, hide, learn, mean, clean, though, light, whether, daughter, thus, confus, subject, simpli, funni, wooden, mediocr, track, bright, pointless, overal, viewer, god, save, match, edg, also, michael, main, honest, fell, worthi, joke, big, miss, earli, probabl, found, still, warm, wast, back, floor, half, crap, screen, that, wit, field, hour, might, better, spent, act, plot, saw, cut, seen, manag, escap, ride, background, old, king, desper, clear, heavi, will, front, money, review, pain, done, justic, type, word, let, look, flick, pass, walk, car, start, anymor, buddi, mine, sent, girlfriend, come, what, wors, twice, heard, ball, month, real, life, struggl, win, emot, friend, sister, jane, thi, focus, loos, special, rise, incred, intens, ultim, america, memori, childhood, promis, remark, perform, establish, claim, support, youth, sport, program, award, alway, told, age, proper, key, never, recommend, within, believ, achiev, rest, piec, bottom, cinemat, food, play, in, anim, aliv, spot, often, small, eat, monster, slowli, crazi, stupid, destroy, new, ground, gore, element, clever, script, plenti, lin, cult, violenc, middl, fine, italian, pick, deliv, express, suggest, mak, wife, job, beyond, hold, you, scream, teenag, horribl, teen, poor, male, femal, absolut, hate, sit, garbag, dad, room, depress, worst, detect, happen, mysteri, fiction, die, year, ago, wish, finish, mix, releas, cast, gave, everyon, els, state, latter, amaz, quit, career, ahead, question, person, last, festiv, huge, storylin, potenti, power, excit, reveng, check, conclus, stick, classic, yes, dialog, flat, right, depict, colleg, world, serv, purpos, serious, lack, lame, photographi, impress, inde, break, either, admit, mayb, club, longer, version, bar, american, fire, blue, general, decad, atmospher, prove, terribl, ask, abus, polic, jame, less, regular, execut, and, children, profession, home, enough, countri, perfect, got, road, howev, black, magic, comparison, meet, display, alreadi, develop, understand, visual, silent, forward, drama, without, talent, stun, imag, fill, arm, box, handl, of, view, far, insight, later, object, parti, offer, happi, relationship, ring, episod, artist, behind, rare, mistak, affair, free, annoy, cute, hero, coupl, bond, etc, previous, therefor, maker, mad, creatur, nightmar, histor, chang, industri, normal, impact, upon, audienc, refus, satisfi, sudden, pathet, opportun, million, five, earlier, took, imagin, entir, disappear, eye, actual, run, down, system, consid, biggest, guy, along, begin, problem, crime, respect, deserv, amount, keep, deal, exact, surpris, besid, david, order, pull, distract, up, nice, inform, action, outsid, hilari, case, discov, now, white, matter, direct, thank, exampl, due, abl, jump, high, on, design, doubt, aspect, camera, second, said, bare, stage, decent, low, budget, craft, mention, equal, modern, sequenc, although, abil, easi, thought, bill, differ, style, origin, commentari, intellig, answer, band, natur, assum, truth, requir, pop, cultur, group, public, fascin, agre, sinc, remain, most, art, produc, anywher, record, theme, nobodi, it, cop, initi, shock, fate, cheesi, soundtrack, cool, suit, song, fantast, pay, hill, whatev, martin, pleasant, formula, japanes, stuff, insan, camp, level, genius, ther, allow, discuss, edit, today, favorit, appeal, don, german, unlik, eventu, dark, frighten, fail, greatest, disney, heaven, masterpiec, damn, suck, chill, realiti, practic, if, concern, york, child, hidden, sick, explor, various, flashback, other, gorgeous, men, fast, true, church, spoiler, boss, drag, forc, rip, human, market, psycholog, excel, angl, late, avoid, pictur, touch, hard, romant, older, issu, okay, kid, adult, doctor, extra, bring, fear, evil, brought, faith, seri, scari, return, 10, among, unfortun, stretch, movie, notic, wrote, screenplay, consist, materi, appar, result, cliché, exploit, co, three, school, trip, enter, past, color, spend, togeth, none, student, husband, date, immedi, hire, actress, continu, attract, lik, bodi, voic, spoil, nation, given, creativ, concept, describ, hollywood, soldier, armi, govern, murder, detail, door, obsess, across, compar, critic, sorri, shine, unbeliev, compel, theatr, explan, credibl, drawn, draw, addit, unit, team, shoot, stereotyp, plan, space, genr, land, written, richard, pair, tale, approach, out, situat, sens, wild, iron, held, pre, georg, round, compani, chanc, share, worri, fashion, inspir, piti, reach, avail, sequel, ghost, babi, figur, tire, sad, outstand, crimin, story, known, gem, alien, possess, reveal, for, sign, note, send, third, war, suspens, explain, english, languag, depth, spirit, whose, sub, success, readi, tom, imposs, noir, narrat, villain, lord, giant, pure, hint, loud, stop, summer, somehow, short, brain, burn, test, john, side, violent, report, somewhat, crew, offic, cost, born, appropri, rule, phone, hair, slasher, mid, lesson, troubl, west, control, similar, tast, soul, humour, favourit, prefer, christian, grant, four, nonsens, indian, cross, danc, progress, perhap, thriller, jack, master, mental, tension, term, air, week, appreci, comed, dramat, rent, store, opinion, tragedi, number, cameo, silli, fantasi, generat, comic, book, ad, onto, score, contrast, romanc, engag, sell, otherwis, mari, self, seek, ruin, provid, mess, paid, locat, hospit, amus, filmmak, scienc, doubl, larg, attack, grow, caus, mouth, constant, throughout, charm, cover, water, wall, forget, excus, insid, surviv, truli, chemistri, former, remind, technic, highlight, bunch, heart, fulli, difficult, vision, led, familiar, recogn, intent, cinematographi, breath, deepli, south, gay, kept, intern, cinema, hole, felt, apart, plus, receiv, center, flaw, accent, histori, remak, reaction, superb, slight, disgust, moral, essenti, skill, ten, succeed, project, forev, relat, zombi, effort, length, frame, aim, current, thrown, angel, magnific, occasion, celebr, lock, rais, accid, brutal, nasti, frank, neither, nowher, shame, british, zero, mi, citi, sceneri, major, documentari, motion, toward, tend, pace, haunt, utter, model, creepi, christoph, terrif, drop, christma, dread, western, island, mood, opera, caught, unknown, present, tie, earth, accur, frustrat, captur, necessari, studi, mark, fellow, vampir, lost, recent, trailer, bought, list, costum, process, period, desert, robert, street, town, strike, steve, smith, ex, evid, trust, intend, occur, easili, choic, common, adventur, combin, manner, smile, search, train, battl, teacher, board, yeah, push, charg, convers, step, form, grade, typic, footag, drive, militari, brilliant, chris, joe, fair, van, afraid, do, humor, unexpect, wide, innoc, thrill, insult, commerci, cold, collect, bother, sentiment, pleasur, parent, rich, off, total, al, season, sum, scare, tone, steal, fresh, era, roll, rat, sexi, dog, toni, hell, stone, skip, beat, sam, scott, busi, fool, tune, sweet, game, individu, william, mr, listen, hey, jean, park, prepar, bloodi, hang, univers, logic, contain, folk, anti, everybodi, met, rang, separ, somebodi, choos, unusu, wear, favor, blow, remot, protagonist, accord, random, averag, gun, peter, ben, connect, idiot, prison, super, hunt, dollar, trash, televis, catch, conflict, interview, author, station, london, copi, challeng, disast, gone, replac, seat, bizarr, plain, trick, asid, tragic, smart, adapt, repres, paul, embarrass, all, paint, attitud, won, awar, wise, kick, influenc, superior, tear, irrit, improv, vote, commit, suicid, thin, oscar, news, anybodi, character, crash, planet, watchabl, hot, cat, glad, futur, rush, carri, blond, decis, likabl, root, sci, fi, becam, stuck, partner, judg, forgotten, slow, window, trap, limit, complex, unless, cartoon, player, attent, agent, investig, chase, join, fake, shop, channel, race, gang, harri, six, uniqu, journey, affect, centuri, teach, seven, count, lover, lose, drink, encount, gag, ignor, satir, quiet, novel, studio, serial, bland, bear, law, protect, correct, secret, tape, mere, comput, lee, minor, charl, delight, ann, resembl, machin, no, her, ill, intrigu, ship, younger, opposit, stephen, energi, billi, social, motiv, photograph, fortun, ident, liter, fli, passion, green, ugli, load, suspect, signific, disturb, target, instanc, mile, strength, marvel, unnecessari, bomb, them, genuin, introduc, dress, dumb, presenc, independ, rescu, french, epic, admir, invent, jim, reflect, ray, nomin, joy, pack, fault, convey, whatsoev # # Does this look like a list of words that would determine sentiment? Yeah, more or less. Because it's 1500 words, it's pretty nuanced I bet. # So that's the result of a few hours of my work on NLP. K Nearest Neighbors (KNN) is an interesting, but low quality classifier for NLP. It's pretty efficient, so I suspect I'll use it to classify speech as time goes on. It makes a lot more sense to use KNN on stuff like spam, hate speech, politics, search, and topic instead of sentiment. Why is that? Because # * spam has patterns (one of those patterns is to avoid patterns) # * hate speech has wording embedded in it and that hate speech is annoying for humans to read # * politics is rapidly evolving, but makes sense for a KNN because people copy one another's speech # * search would benefit from nearest neighbor part, giving a possible match on a sentence (with limitations discussed earlier) # * topic classification would be easier for KNN because people are trying to provide metadata about what they are talking about in their text. # In[ ]: