Natural Language in Small Wide World

Hash: SHA512

by Javantea
Sept 1, 2016

Yesterday I published a small piece of software to Small Wide World's git to very little fanfare. It was a generalization of a bad piece of software I wrote the day before. It uses NLTK to perform a simple task: parse a simple sentence which follows the form "subject verb object" with optional additional information starting with "because". Examples of this grammar include:

GnuPG is software
IRC is a protocol
software implements a protocol
Javantea is human
AI3 is software
Javantea wrote AI3
Javantea writes software
Javantea writes English
Javantea reads German
Javantea reads Japanese
Javantea reads Portuguese
Javantea reads Spanish creates this graph of the relationships:

Natural Language Parsed graph
Natural Language Parsed graph

How does it parse? It uses NLTK to find parts of speech, splits the sentence by its verb, assumes that the first part of the sentence is the subject, the second part is the verb, and the third part is the object. First of all let's be honest that there are a bunch of bugs in NLTK's part of speech tagger out of the box. If you want to make something that does something simple or complex, you will run into this or you will not test it well enough to run into it. So in order to fix the numerous inaccuracies of the part of speech tagger I chose to hardcode a few fixes: if a sentence has three words, it assumes subject verb object. In this grammar, no other option is possible, so this is accurate. If there are more than three words and it can't find a verb, it looks at previous verbs and if one exists, it splits on that. This doesn't always work and fails badly when a sentence uses does not and a verb that isn't detected correctly, but it works for all of the text I have given it so far (about 66 lines).

So what practical use does this have? Let's say that there's a conflict between ten people. Let's say that conflict is really complex, for example someone has divorced someone else and married someone else. Then their friends became unruly and insulted someone random in the vicinity. So let's say that you don't actually know what is going on besides a few statements of fact that don't actually make much sense. Normally a person might completely avoid the conflict because they don't want to get involved unintentionally with something they don't understand. But graphing this might make it possible to navigate the landmine of social hypocricy, misunderstanding, and delusion that exists in any conflict. As Lady Grantham put it "however much the couple may strive to be honest, no one is ever in posession of the facts". Let's put a few rules down. Don't connect anyone who is disagreeing into the first graph. Don't put anyone who is agreeing into the second graph. The first graph is alliances and neutral parties, the second graph is active conflicts. Since we're not ready to take on Syria just yet, let's draw something a little simpler.

Comcast owns NBC
NBC owns SyFy
Microsoft owned MSNBC
Microsoft divested MSNBC
Disney owns ABC
Disney owns ESPN
Disney owns A&E
Hearst owns A&E
Disney owns History Channel
Hearst owns History Channel
NBC owns USA
NBC owns Weather Channel
NBC owns Telemundo
Russian government owns RT
United Kingdom owns BBC
United Kingdom owns CBC
Rupert Murdoch owns Fox
CBS owns Showtime
CBS owns Viacom
CBS owns Westinghouse
CBS sold nuclear power plants to BNFL
United Kingdom owns [BNFL](
Viacom owns MTV
MTV owns Nickelodeon
MTV owns Comedy Central
MTV owns CMT
MTV owns VH1
MTV owns MTV2
Viacom owns BET
Rupert Murdoch owned News of the World
IBA founded Channel Four
United Kingdom operates IBA
Channel Four Television Corporation owns Channel Four
Scott Trust Limited owns The Guardian
Scott Trust Limited owns The Observer
Time Warner owns CNN
Time Warner owns TBS
Time Warner owns TNT
Time Warner owns HBO
Time Warner owns Cartoon Network
Time Warner owns Adult Swim
Time Warner merged AOL
Time Warner owns WB
Time Warner owns DC Comics
Time Warner owns New Line Cinema
Time Warner owns Time
Time owns Sports Illustrated
Time owns Travel + Leisure
Time owns Food & Wine
Time owns Fortune
Time owns People
Time owns InStyle
Time owns Life
Time owns Golf Magazine
Time owns Southern Living
Time owns Essence
Time owns Real Simple
Time owns Entertainment Weekly
Time owns Myspace
Hearst owns Popular Mechanics
Hearst owns Car and Driver
Hearst owns Cosmopolitan
Hearst owns Country Living
Hearst owns Dr. Oz
Hearst owns ELLE
Hearst owns Elle Decor
Hearst owns Esquire
Hearst owns Food Network Magazine
Hearst owns Good Housekeeping
Hearst owns Harper's Bazaar
Hearst owns House Beautiful
Hearst owns Marie Claire
Hearst owns Nat Mags
Hearst owns O
Hearst owns Red
Hearst owns Redbook
Hearst owns Road & Track
Hearst owns Seventeen
Hearst owns Town & Country
Hearst owns Veranda
Hearst owns Woman's Day
Hearst owns ESPN
Hearst owns Seattle Post-Intelligencer
Bonnier owns Popular Science
Media Ownership Consolidation
Media Ownership Consolidation

Only one of the sentences was too complex for our script to parse, "CBS sold nuclear power plants to BNFL". In order to fix this, I change it to two sentences: "CBS sold nuclear power plants" and "BNFL bought nuclear power plants". Clearly this is an oversimplification, but shows the limitation of my simple 152 line python script. If we wanted to handle complex grammars that involve subject object and a second object, we would need to incrementally add complex parsers. This isn't rocket science or brain surgery, but it is time consuming. I won't be publishing a generic English parser any sooner than I'll be publishing AI3. Funny that AI3 actually contains a considerable amount of English. From my work, my guess is that a generic parser would take a person a few months of pretty intense work. Regular people could help in this task by coming up with reasonable sentences that they would want parsed and what information they would want parsed from the sentence. This may seem easy, but if you want all the data from a sentence, you get no secondary information. Allow me to explain. I parsed "Hearst owns Popular Science" into owns("Hearst", "Popular Science"), I can graph all ownership on a map. Hearst -- Popular Science. Okay, let's parse it in a different way. ["Hearst", "owns", "Popular Science"] This doesn't help the computer system nearly as much because it doesn't say whether owns is a noun, verb, preposition, adjective or adverb. The word "owns" can only be a verb luckily but what about the word "fights"? This is where nltk fails. Let's look at Wordnet's database entries for fights:

$ abs fights
Everything about 'fights' :
found battle, conflict, fight, engagement
    a hostile meeting of opposing military forces in the course of a war
    Grant won a decisive victory in the battle of Chickamauga
    he lost his romantic ideas about war when he got into a real engagement
   Hyper: military_action, action
   Hypo:  Armageddon
   Hypo:  assault
   Hypo:  combat, armed_combat
   Hypo:  dogfight
   Hypo:  naval_battle
   Hypo:  pitched_battle
found fight, fighting, combat, scrap
    the act of fighting; any contest or struggle
    a fight broke out at the hockey game
    there was fighting in the streets
    the unhappy couple got into a terrible scrap
   Hyper: conflict, struggle, battle
   Hypo:  affray, disturbance, fray, ruffle
   Hypo:  battering, banging
   Hypo:  beating, whipping
   Hypo:  brawl, free-for-all
   Hypo:  brush, clash, encounter, skirmish
   Hypo:  close-quarter_fighting
   Hypo:  dogfight
   Hypo:  duel, affaire_d'honneur
   Hypo:  fencing
   Hypo:  fistfight, fisticuffs, slugfest
   Hypo:  gunfight, gunplay, shootout
   Hypo:  hassle, scuffle, tussle, dogfight, rough-and-tumble
   Hypo:  in-fighting
   Hypo:  knife_fight, snickersnee, cut-and-thrust
   Hypo:  rumble, gang_fight
   Hypo:  set-to
   Hypo:  shock, impact
   Hypo:  single_combat
found competitiveness, fight
    an aggressive willingness to compete
    the team was full of fight
   Hyper: aggressiveness
found fight
    an intense verbal dispute
    a violent fight over the bill is expected in the Senate
   Hyper: controversy, contention, contestation, disputation, disceptation, tilt, argument, arguing
found fight
    a boxing or wrestling match
    the fight was on television last night
   Hyper: boxing, pugilism, fisticuffs
found contend, fight, struggle
    be engaged in a fight; carry on a fight
    the tribesmen fought each other
    Siblings are always fighting
    Militant groups are contending for control of the country
   Hypo:  attack, assail
   Hypo:  bandy
   Hypo:  battle, combat
   Hypo:  bear_down
   Hypo:  box
   Hypo:  chicken-fight, chickenfight
   Hypo:  duel
   Hypo:  engage, wage
   Hypo:  fence
   Hypo:  feud
   Hypo:  fight, oppose, fight_back, fight_down, defend
   Hypo:  fight_back
   Hypo:  fistfight
   Hypo:  join_battle
   Hypo:  joust
   Hypo:  scuffle, tussle
   Hypo:  settle, get_back
   Hypo:  skirmish
   Hypo:  spar
   Hypo:  tourney
   Hypo:  tug
   Hypo:  war
   Hypo:  wrestle
found fight, oppose, fight_back, fight_down, defend
    fight against or resist strongly
    The senator said he would oppose the bill
    Don't fight it!
   Hyper: contend, fight, struggle
   Hypo:  recalcitrate
   Hypo:  repel, repulse, fight_off, rebuff, drive_back
   Hypo:  resist, hold_out, withstand, stand_firm
   Hypo:  resist, stand, fend
found fight, struggle
    make a strenuous or labored effort
    She struggled for years to survive without welfare
    He fought for breath
   Hyper: try, seek, attempt, essay, assay
   Hypo:  flounder
   Hypo:  tug, labor, labour, push, drive
found crusade, fight, press, campaign, push, agitate
    exert oneself continuously, vigorously, or obtrusively to gain an end or engage in a crusade for a certain cause or person; be an advocate for
    The liberal party pushed for reforms
    She is crusading for women's rights
    The Dean is pushing for his favorite candidate
   Hyper: advertise, advertize, promote, push

This is a bit too much information, but you can see that Wordnet has an incredibly dense set of relationships for each word as well as synonym information. Let's condense this to just parts of speech and how common it is.

Synset Part of speech Popularity
battle.n.01 noun 1
fight.n.02 noun 2
competitiveness.n.01 noun 1
fight.n.04 noun 4
fight.n.05 noun 5
contend.v.06 verb 6
fight.v.02 verb 2
fight.v.03 verb 3
crusade.v.01 verb 1

As you can see, fights as a verb can mean 4 different things, fights as a noun can mean 5 things. Thus when I say "USA fights ISIL in Syria", I'm being imprecise. Thus if you put that sentence even with clarification or qualification into your college essay, you should be marked down no matter how articulate the rest of the paper is. So if you're creating a catchy slogan for your social justice cause, don't use the word fight unless you're aiming for confusion. Let's give a good example: "Fight MS" is dually confusing. Both words are easily confused unless context is fully given. Using this we can understand historical significance of slogans, but are slogans really valuable? Who reading this remembers HOPE?

Unfortunately, language gives us a really poor method of precise communication and communicating with computers is even more difficult because they lack common sense. A little bit of effort on our part can create machine parsable logic that is capable of communicating our ideas clearly and articulately. Those statements can also be parsed by intelligent people effectively but for the most part, people will be able to understand drivel that we communicate to them if we are verbose enough. Here's a question for the reader, limiting yourself to 140 words can you express your feelings to this paragraph without reading the next paragraph? Please post your answer in the comment section before reading the next paragraph but don't worry about the 140 character limit, I won't be grading.

But is this my encouragement for the reader to spend a day reducing their most important thoughts into machine parsable logic statements? Spending an hour working on this problem would be beneficial if you wish to become a more intelligent person, so I recommend it to everyone. But there's absolutely no reason for us to spend copious amounts of time communicating and parsing vast quantities of human text until we have a purpose. Who here has a good purpose for a huge quantity of parsed natural language? I wrote a relationship graph because I want people to use my graph layout software. What do you want to do with parsed natural language?

I believe that is all. If you'd like to play with, you can clone my git repository for Small Wide World.

git clone

Javantea out.

Version: GnuPG v2



Comments: 0

Leave a reply »

  • Leave a Reply
    Your gravatar
    Your Name