-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA512
Sept 1, 2016
Yesterday I published a small piece of software to Small Wide World's git to very little fanfare. It was a generalization of a bad piece of software I wrote the day before. It uses NLTK to perform a simple task: parse a simple sentence which follows the form "subject verb object" with optional additional information starting with "because". Examples of this grammar include:
GnuPG is software IRC is a protocol software implements a protocol Javantea is human AI3 is software Javantea wrote AI3 Javantea writes software Javantea writes English Javantea reads German Javantea reads Japanese Javantea reads Portuguese Javantea reads Spanish
nlp1.py creates this graph of the relationships:
How does it parse? It uses NLTK to find parts of speech, splits the sentence by its verb, assumes that the first part of the sentence is the subject, the second part is the verb, and the third part is the object. First of all let's be honest that there are a bunch of bugs in NLTK's part of speech tagger out of the box. If you want to make something that does something simple or complex, you will run into this or you will not test it well enough to run into it. So in order to fix the numerous inaccuracies of the part of speech tagger I chose to hardcode a few fixes: if a sentence has three words, it assumes subject verb object. In this grammar, no other option is possible, so this is accurate. If there are more than three words and it can't find a verb, it looks at previous verbs and if one exists, it splits on that. This doesn't always work and fails badly when a sentence uses does not and a verb that isn't detected correctly, but it works for all of the text I have given it so far (about 66 lines).
So what practical use does this have? Let's say that there's a conflict between ten people. Let's say that conflict is really complex, for example someone has divorced someone else and married someone else. Then their friends became unruly and insulted someone random in the vicinity. So let's say that you don't actually know what is going on besides a few statements of fact that don't actually make much sense. Normally a person might completely avoid the conflict because they don't want to get involved unintentionally with something they don't understand. But graphing this might make it possible to navigate the landmine of social hypocricy, misunderstanding, and delusion that exists in any conflict. As Lady Grantham put it "however much the couple may strive to be honest, no one is ever in posession of the facts". Let's put a few rules down. Don't connect anyone who is disagreeing into the first graph. Don't put anyone who is agreeing into the second graph. The first graph is alliances and neutral parties, the second graph is active conflicts. Since we're not ready to take on Syria just yet, let's draw something a little simpler.
Comcast owns NBC NBC owns SyFy NBC owns MSNBC Microsoft owned MSNBC Microsoft divested MSNBC Disney owns ABC Disney owns ESPN Disney owns A&E Hearst owns A&E Disney owns History Channel Hearst owns History Channel NBC owns USA NBC owns Weather Channel NBC owns Telemundo Russian government owns RT United Kingdom owns BBC United Kingdom owns CBC Rupert Murdoch owns Fox CBS owns Showtime CBS owns Viacom CBS owns Westinghouse CBS sold nuclear power plants to BNFL United Kingdom owns [BNFL](https://en.wikipedia.org/wiki/British_Nuclear_Fuels_Ltd) Viacom owns MTV MTV owns Nickelodeon MTV owns Comedy Central MTV owns CMT MTV owns VH1 MTV owns MTV2 Viacom owns BET Rupert Murdoch owned News of the World IBA founded Channel Four United Kingdom operates IBA Channel Four Television Corporation owns Channel Four Scott Trust Limited owns The Guardian Scott Trust Limited owns The Observer Time Warner owns CNN Time Warner owns TBS Time Warner owns TNT Time Warner owns HBO Time Warner owns Cartoon Network Time Warner owns Adult Swim Time Warner merged AOL Time Warner owns WB Time Warner owns DC Comics Time Warner owns New Line Cinema Time Warner owns Time Time owns Sports Illustrated Time owns Travel + Leisure Time owns Food & Wine Time owns Fortune Time owns People Time owns InStyle Time owns Life Time owns Golf Magazine Time owns Southern Living Time owns Essence Time owns Real Simple Time owns Entertainment Weekly Time owns Myspace Hearst owns Popular Mechanics Hearst owns Car and Driver Hearst owns Cosmopolitan Hearst owns Country Living Hearst owns Dr. Oz Hearst owns ELLE Hearst owns Elle Decor Hearst owns Esquire Hearst owns Food Network Magazine Hearst owns Good Housekeeping Hearst owns Harper's Bazaar Hearst owns House Beautiful Hearst owns Marie Claire Hearst owns Nat Mags Hearst owns O Hearst owns Red Hearst owns Redbook Hearst owns Road & Track Hearst owns Seventeen Hearst owns Town & Country Hearst owns Veranda Hearst owns Woman's Day Hearst owns ESPN Hearst owns Seattle Post-Intelligencer Bonnier owns Popular Science
Only one of the sentences was too complex for our script to parse, "CBS sold nuclear power plants to BNFL". In order to fix this, I change it to two sentences: "CBS sold nuclear power plants" and "BNFL bought nuclear power plants". Clearly this is an oversimplification, but shows the limitation of my simple 152 line python script. If we wanted to handle complex grammars that involve subject object and a second object, we would need to incrementally add complex parsers. This isn't rocket science or brain surgery, but it is time consuming. I won't be publishing a generic English parser any sooner than I'll be publishing AI3. Funny that AI3 actually contains a considerable amount of English. From my work, my guess is that a generic parser would take a person a few months of pretty intense work. Regular people could help in this task by coming up with reasonable sentences that they would want parsed and what information they would want parsed from the sentence. This may seem easy, but if you want all the data from a sentence, you get no secondary information. Allow me to explain. I parsed "Hearst owns Popular Science" into
owns("Hearst", "Popular Science"), I can graph all ownership on a map.
Hearst -- Popular Science. Okay, let's parse it in a different way.
["Hearst", "owns", "Popular Science"] This doesn't help the computer system nearly as much because it doesn't say whether owns is a noun, verb, preposition, adjective or adverb. The word "owns" can only be a verb luckily but what about the word "fights"? This is where nltk fails. Let's look at Wordnet's database entries for fights:
$ abs fights Everything about 'fights' : found battle, conflict, fight, engagement battle.n.01 noun.act a hostile meeting of opposing military forces in the course of a war Examples: Grant won a decisive victory in the battle of Chickamauga he lost his romantic ideas about war when he got into a real engagement Hyper: military_action, action Hypo: Armageddon Hypo: assault Hypo: combat, armed_combat Hypo: dogfight Hypo: naval_battle Hypo: pitched_battle found fight, fighting, combat, scrap fight.n.02 noun.act the act of fighting; any contest or struggle Examples: a fight broke out at the hockey game there was fighting in the streets the unhappy couple got into a terrible scrap Hyper: conflict, struggle, battle Hypo: affray, disturbance, fray, ruffle Hypo: battering, banging Hypo: beating, whipping Hypo: brawl, free-for-all Hypo: brush, clash, encounter, skirmish Hypo: close-quarter_fighting Hypo: dogfight Hypo: duel, affaire_d'honneur Hypo: fencing Hypo: fistfight, fisticuffs, slugfest Hypo: gunfight, gunplay, shootout Hypo: hassle, scuffle, tussle, dogfight, rough-and-tumble Hypo: in-fighting Hypo: knife_fight, snickersnee, cut-and-thrust Hypo: rumble, gang_fight Hypo: set-to Hypo: shock, impact Hypo: single_combat found competitiveness, fight competitiveness.n.01 noun.attribute an aggressive willingness to compete Examples: the team was full of fight Hyper: aggressiveness found fight fight.n.04 noun.communication an intense verbal dispute Examples: a violent fight over the bill is expected in the Senate Hyper: controversy, contention, contestation, disputation, disceptation, tilt, argument, arguing found fight fight.n.05 noun.act a boxing or wrestling match Examples: the fight was on television last night Hyper: boxing, pugilism, fisticuffs found contend, fight, struggle contend.v.06 verb.competition be engaged in a fight; carry on a fight Examples: the tribesmen fought each other Siblings are always fighting Militant groups are contending for control of the country Hypo: attack, assail Hypo: bandy Hypo: battle, combat Hypo: bear_down Hypo: box Hypo: chicken-fight, chickenfight Hypo: duel Hypo: engage, wage Hypo: fence Hypo: feud Hypo: fight, oppose, fight_back, fight_down, defend Hypo: fight_back Hypo: fistfight Hypo: join_battle Hypo: joust Hypo: scuffle, tussle Hypo: settle, get_back Hypo: skirmish Hypo: spar Hypo: tourney Hypo: tug Hypo: war Hypo: wrestle found fight, oppose, fight_back, fight_down, defend fight.v.02 verb.competition fight against or resist strongly Examples: The senator said he would oppose the bill Don't fight it! Hyper: contend, fight, struggle Hypo: recalcitrate Hypo: repel, repulse, fight_off, rebuff, drive_back Hypo: resist, hold_out, withstand, stand_firm Hypo: resist, stand, fend found fight, struggle fight.v.03 verb.social make a strenuous or labored effort Examples: She struggled for years to survive without welfare He fought for breath Hyper: try, seek, attempt, essay, assay Hypo: flounder Hypo: tug, labor, labour, push, drive found crusade, fight, press, campaign, push, agitate crusade.v.01 verb.social exert oneself continuously, vigorously, or obtrusively to gain an end or engage in a crusade for a certain cause or person; be an advocate for Examples: The liberal party pushed for reforms She is crusading for women's rights The Dean is pushing for his favorite candidate Hyper: advertise, advertize, promote, push
This is a bit too much information, but you can see that Wordnet has an incredibly dense set of relationships for each word as well as synonym information. Let's condense this to just parts of speech and how common it is.
|Synset||Part of speech||Popularity|
As you can see, fights as a verb can mean 4 different things, fights as a noun can mean 5 things. Thus when I say "USA fights ISIL in Syria", I'm being imprecise. Thus if you put that sentence even with clarification or qualification into your college essay, you should be marked down no matter how articulate the rest of the paper is. So if you're creating a catchy slogan for your social justice cause, don't use the word fight unless you're aiming for confusion. Let's give a good example: "Fight MS" is dually confusing. Both words are easily confused unless context is fully given. Using this we can understand historical significance of slogans, but are slogans really valuable? Who reading this remembers HOPE?
Unfortunately, language gives us a really poor method of precise communication and communicating with computers is even more difficult because they lack common sense. A little bit of effort on our part can create machine parsable logic that is capable of communicating our ideas clearly and articulately. Those statements can also be parsed by intelligent people effectively but for the most part, people will be able to understand drivel that we communicate to them if we are verbose enough. Here's a question for the reader, limiting yourself to 140 words can you express your feelings to this paragraph without reading the next paragraph? Please post your answer in the comment section before reading the next paragraph but don't worry about the 140 character limit, I won't be grading.
But is this my encouragement for the reader to spend a day reducing their most important thoughts into machine parsable logic statements? Spending an hour working on this problem would be beneficial if you wish to become a more intelligent person, so I recommend it to everyone. But there's absolutely no reason for us to spend copious amounts of time communicating and parsing vast quantities of human text until we have a purpose. Who here has a good purpose for a huge quantity of parsed natural language? I wrote a relationship graph because I want people to use my graph layout software. What do you want to do with parsed natural language?
I believe that is all. If you'd like to play with nlp1.py, you can clone my git repository for Small Wide World.
git clone https://www.altsci.com/repo/smallwideworld.git
-----BEGIN PGP SIGNATURE----- Version: GnuPG v2 iQIcBAEBCgAGBQJXyKGnAAoJEDxoyNvLp4PvCKEP/2uQk8ltI7L1+Edf2W0+Z4xy kEDQi9H5cSGstte7nrN/JXnimUZbjpgvIWLan/koqXWRXSSIc/22hq28Pn0uq5hU 3G14T2zJgPSIf18snIkE2wgFul85mz7AmtgEhkrYP7AqZDUALF0DrfDD6dQg9LeH GcriodOFMuBxO9/k4g1iBBmwNBiu709qXFhzn0CrSTKxK6f7wD2F4bFVyusEgEf8 TrnPdvTA+w/NmEHtGFhLEdAwp+c+t+HhMftN3Pv/1Q9Fdr6kj1cInCifoFkvcEfn 4ITdasM/Nb7S4KNHLaGNEYxxk245Do1rJw4TEUEHSbzz0Gdvbmk6okKDk3qYKoHB jjvIM3m20+esQS16LwOF5ECw3dF9R4crAFqHN9U+2pl2avjV/cU1iFN0ouIu9gy2 VfYQKXSceQIuGdJM6znWz8d9oCWYvMLGd6dDRqY8VqG1EyGIWGQXffItk4vnlEXl xcLQSMWy7pm1M4+gV8mAKlsc81GRe8K68LaJCwMrDzlDnFem5mEfgBTyJZa29hbP 8Tc0lbjcODNE0JQFwZuyxVhZwp3m4yrbQrA+/y/S/0GwfEDXPcaYFXQN/iOVeJ+T Hej7eTfiiOhOWcGwrmsEFDXiRiyh5OW/KqW2QMSnjpV4WLUSpSTeOikm/jn2E3rC Kdb/MzsVszuK5MdVPyuK =ZjkD -----END PGP SIGNATURE-----Permalink