Natural Language in Small Wide World

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA512

by Javantea
Sept 1, 2016

Yesterday I published a small piece of software to Small Wide World's git to very little fanfare. It was a generalization of a bad piece of software I wrote the day before. It uses NLTK to perform a simple task: parse a simple sentence which follows the form "subject verb object" with optional additional information starting with "because". Examples of this grammar include:

GnuPG is software
IRC is a protocol
software implements a protocol
Javantea is human
AI3 is software
Javantea wrote AI3
Javantea writes software
Javantea writes English
Javantea reads German
Javantea reads Japanese
Javantea reads Portuguese
Javantea reads Spanish

nlp1.py creates this graph of the relationships:

Natural Language Parsed graph
Natural Language Parsed graph

How does it parse? It uses NLTK to find parts of speech, splits the sentence by its verb, assumes that the first part of the sentence is the subject, the second part is the verb, and the third part is the object. First of all let's be honest that there are a bunch of bugs in NLTK's part of speech tagger out of the box. If you want to make something that does something simple or complex, you will run into this or you will not test it well enough to run into it. So in order to fix the numerous inaccuracies of the part of speech tagger I chose to hardcode a few fixes: if a sentence has three words, it assumes subject verb object. In this grammar, no other option is possible, so this is accurate. If there are more than three words and it can't find a verb, it looks at previous verbs and if one exists, it splits on that. This doesn't always work and fails badly when a sentence uses does not and a verb that isn't detected correctly, but it works for all of the text I have given it so far (about 66 lines).

So what practical use does this have? Let's say that there's a conflict between ten people. Let's say that conflict is really complex, for example someone has divorced someone else and married someone else. Then their friends became unruly and insulted someone random in the vicinity. So let's say that you don't actually know what is going on besides a few statements of fact that don't actually make much sense. Normally a person might completely avoid the conflict because they don't want to get involved unintentionally with something they don't understand. But graphing this might make it possible to navigate the landmine of social hypocricy, misunderstanding, and delusion that exists in any conflict. As Lady Grantham put it "however much the couple may strive to be honest, no one is ever in posession of the facts". Let's put a few rules down. Don't connect anyone who is disagreeing into the first graph. Don't put anyone who is agreeing into the second graph. The first graph is alliances and neutral parties, the second graph is active conflicts. Since we're not ready to take on Syria just yet, let's draw something a little simpler.

Comcast owns NBC
NBC owns SyFy
NBC owns MSNBC
Microsoft owned MSNBC
Microsoft divested MSNBC
Disney owns ABC
Disney owns ESPN
Disney owns A&E
Hearst owns A&E
Disney owns History Channel
Hearst owns History Channel
NBC owns USA
NBC owns Weather Channel
NBC owns Telemundo
Russian government owns RT
United Kingdom owns BBC
United Kingdom owns CBC
Rupert Murdoch owns Fox
CBS owns Showtime
CBS owns Viacom
CBS owns Westinghouse
CBS sold nuclear power plants to BNFL
United Kingdom owns [BNFL](https://en.wikipedia.org/wiki/British_Nuclear_Fuels_Ltd)
Viacom owns MTV
MTV owns Nickelodeon
MTV owns Comedy Central
MTV owns CMT
MTV owns VH1
MTV owns MTV2
Viacom owns BET
Rupert Murdoch owned News of the World
IBA founded Channel Four
United Kingdom operates IBA
Channel Four Television Corporation owns Channel Four
Scott Trust Limited owns The Guardian
Scott Trust Limited owns The Observer
Time Warner owns CNN
Time Warner owns TBS
Time Warner owns TNT
Time Warner owns HBO
Time Warner owns Cartoon Network
Time Warner owns Adult Swim
Time Warner merged AOL
Time Warner owns WB
Time Warner owns DC Comics
Time Warner owns New Line Cinema
Time Warner owns Time
Time owns Sports Illustrated
Time owns Travel + Leisure
Time owns Food & Wine
Time owns Fortune
Time owns People
Time owns InStyle
Time owns Life
Time owns Golf Magazine
Time owns Southern Living
Time owns Essence
Time owns Real Simple
Time owns Entertainment Weekly
Time owns Myspace
Hearst owns Popular Mechanics
Hearst owns Car and Driver
Hearst owns Cosmopolitan
Hearst owns Country Living
Hearst owns Dr. Oz
Hearst owns ELLE
Hearst owns Elle Decor
Hearst owns Esquire
Hearst owns Food Network Magazine
Hearst owns Good Housekeeping
Hearst owns Harper's Bazaar
Hearst owns House Beautiful
Hearst owns Marie Claire
Hearst owns Nat Mags
Hearst owns O
Hearst owns Red
Hearst owns Redbook
Hearst owns Road & Track
Hearst owns Seventeen
Hearst owns Town & Country
Hearst owns Veranda
Hearst owns Woman's Day
Hearst owns ESPN
Hearst owns Seattle Post-Intelligencer
Bonnier owns Popular Science
Media Ownership Consolidation
Media Ownership Consolidation

Only one of the sentences was too complex for our script to parse, "CBS sold nuclear power plants to BNFL". In order to fix this, I change it to two sentences: "CBS sold nuclear power plants" and "BNFL bought nuclear power plants". Clearly this is an oversimplification, but shows the limitation of my simple 152 line python script. If we wanted to handle complex grammars that involve subject object and a second object, we would need to incrementally add complex parsers. This isn't rocket science or brain surgery, but it is time consuming. I won't be publishing a generic English parser any sooner than I'll be publishing AI3. Funny that AI3 actually contains a considerable amount of English. From my work, my guess is that a generic parser would take a person a few months of pretty intense work. Regular people could help in this task by coming up with reasonable sentences that they would want parsed and what information they would want parsed from the sentence. This may seem easy, but if you want all the data from a sentence, you get no secondary information. Allow me to explain. I parsed "Hearst owns Popular Science" into owns("Hearst", "Popular Science"), I can graph all ownership on a map. Hearst -- Popular Science. Okay, let's parse it in a different way. ["Hearst", "owns", "Popular Science"] This doesn't help the computer system nearly as much because it doesn't say whether owns is a noun, verb, preposition, adjective or adverb. The word "owns" can only be a verb luckily but what about the word "fights"? This is where nltk fails. Let's look at Wordnet's database entries for fights:

$ abs fights
Everything about 'fights' :
found battle, conflict, fight, engagement
    battle.n.01
    noun.act
    a hostile meeting of opposing military forces in the course of a war
    Examples:
    Grant won a decisive victory in the battle of Chickamauga
    he lost his romantic ideas about war when he got into a real engagement
   Hyper: military_action, action
   Hypo:  Armageddon
   Hypo:  assault
   Hypo:  combat, armed_combat
   Hypo:  dogfight
   Hypo:  naval_battle
   Hypo:  pitched_battle
found fight, fighting, combat, scrap
    fight.n.02
    noun.act
    the act of fighting; any contest or struggle
    Examples:
    a fight broke out at the hockey game
    there was fighting in the streets
    the unhappy couple got into a terrible scrap
   Hyper: conflict, struggle, battle
   Hypo:  affray, disturbance, fray, ruffle
   Hypo:  battering, banging
   Hypo:  beating, whipping
   Hypo:  brawl, free-for-all
   Hypo:  brush, clash, encounter, skirmish
   Hypo:  close-quarter_fighting
   Hypo:  dogfight
   Hypo:  duel, affaire_d'honneur
   Hypo:  fencing
   Hypo:  fistfight, fisticuffs, slugfest
   Hypo:  gunfight, gunplay, shootout
   Hypo:  hassle, scuffle, tussle, dogfight, rough-and-tumble
   Hypo:  in-fighting
   Hypo:  knife_fight, snickersnee, cut-and-thrust
   Hypo:  rumble, gang_fight
   Hypo:  set-to
   Hypo:  shock, impact
   Hypo:  single_combat
found competitiveness, fight
    competitiveness.n.01
    noun.attribute
    an aggressive willingness to compete
    Examples:
    the team was full of fight
   Hyper: aggressiveness
found fight
    fight.n.04
    noun.communication
    an intense verbal dispute
    Examples:
    a violent fight over the bill is expected in the Senate
   Hyper: controversy, contention, contestation, disputation, disceptation, tilt, argument, arguing
found fight
    fight.n.05
    noun.act
    a boxing or wrestling match
    Examples:
    the fight was on television last night
   Hyper: boxing, pugilism, fisticuffs
found contend, fight, struggle
    contend.v.06
    verb.competition
    be engaged in a fight; carry on a fight
    Examples:
    the tribesmen fought each other
    Siblings are always fighting
    Militant groups are contending for control of the country
   Hypo:  attack, assail
   Hypo:  bandy
   Hypo:  battle, combat
   Hypo:  bear_down
   Hypo:  box
   Hypo:  chicken-fight, chickenfight
   Hypo:  duel
   Hypo:  engage, wage
   Hypo:  fence
   Hypo:  feud
   Hypo:  fight, oppose, fight_back, fight_down, defend
   Hypo:  fight_back
   Hypo:  fistfight
   Hypo:  join_battle
   Hypo:  joust
   Hypo:  scuffle, tussle
   Hypo:  settle, get_back
   Hypo:  skirmish
   Hypo:  spar
   Hypo:  tourney
   Hypo:  tug
   Hypo:  war
   Hypo:  wrestle
found fight, oppose, fight_back, fight_down, defend
    fight.v.02
    verb.competition
    fight against or resist strongly
    Examples:
    The senator said he would oppose the bill
    Don't fight it!
   Hyper: contend, fight, struggle
   Hypo:  recalcitrate
   Hypo:  repel, repulse, fight_off, rebuff, drive_back
   Hypo:  resist, hold_out, withstand, stand_firm
   Hypo:  resist, stand, fend
found fight, struggle
    fight.v.03
    verb.social
    make a strenuous or labored effort
    Examples:
    She struggled for years to survive without welfare
    He fought for breath
   Hyper: try, seek, attempt, essay, assay
   Hypo:  flounder
   Hypo:  tug, labor, labour, push, drive
found crusade, fight, press, campaign, push, agitate
    crusade.v.01
    verb.social
    exert oneself continuously, vigorously, or obtrusively to gain an end or engage in a crusade for a certain cause or person; be an advocate for
    Examples:
    The liberal party pushed for reforms
    She is crusading for women's rights
    The Dean is pushing for his favorite candidate
   Hyper: advertise, advertize, promote, push

This is a bit too much information, but you can see that Wordnet has an incredibly dense set of relationships for each word as well as synonym information. Let's condense this to just parts of speech and how common it is.

Synset Part of speech Popularity
battle.n.01 noun 1
fight.n.02 noun 2
competitiveness.n.01 noun 1
fight.n.04 noun 4
fight.n.05 noun 5
contend.v.06 verb 6
fight.v.02 verb 2
fight.v.03 verb 3
crusade.v.01 verb 1

As you can see, fights as a verb can mean 4 different things, fights as a noun can mean 5 things. Thus when I say "USA fights ISIL in Syria", I'm being imprecise. Thus if you put that sentence even with clarification or qualification into your college essay, you should be marked down no matter how articulate the rest of the paper is. So if you're creating a catchy slogan for your social justice cause, don't use the word fight unless you're aiming for confusion. Let's give a good example: "Fight MS" is dually confusing. Both words are easily confused unless context is fully given. Using this we can understand historical significance of slogans, but are slogans really valuable? Who reading this remembers HOPE?

Unfortunately, language gives us a really poor method of precise communication and communicating with computers is even more difficult because they lack common sense. A little bit of effort on our part can create machine parsable logic that is capable of communicating our ideas clearly and articulately. Those statements can also be parsed by intelligent people effectively but for the most part, people will be able to understand drivel that we communicate to them if we are verbose enough. Here's a question for the reader, limiting yourself to 140 words can you express your feelings to this paragraph without reading the next paragraph? Please post your answer in the comment section before reading the next paragraph but don't worry about the 140 character limit, I won't be grading.

But is this my encouragement for the reader to spend a day reducing their most important thoughts into machine parsable logic statements? Spending an hour working on this problem would be beneficial if you wish to become a more intelligent person, so I recommend it to everyone. But there's absolutely no reason for us to spend copious amounts of time communicating and parsing vast quantities of human text until we have a purpose. Who here has a good purpose for a huge quantity of parsed natural language? I wrote a relationship graph because I want people to use my graph layout software. What do you want to do with parsed natural language?

I believe that is all. If you'd like to play with nlp1.py, you can clone my git repository for Small Wide World.

git clone https://www.altsci.com/repo/smallwideworld.git

Javantea out.

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2

iQIcBAEBCgAGBQJXyKGnAAoJEDxoyNvLp4PvCKEP/2uQk8ltI7L1+Edf2W0+Z4xy
kEDQi9H5cSGstte7nrN/JXnimUZbjpgvIWLan/koqXWRXSSIc/22hq28Pn0uq5hU
3G14T2zJgPSIf18snIkE2wgFul85mz7AmtgEhkrYP7AqZDUALF0DrfDD6dQg9LeH
GcriodOFMuBxO9/k4g1iBBmwNBiu709qXFhzn0CrSTKxK6f7wD2F4bFVyusEgEf8
TrnPdvTA+w/NmEHtGFhLEdAwp+c+t+HhMftN3Pv/1Q9Fdr6kj1cInCifoFkvcEfn
4ITdasM/Nb7S4KNHLaGNEYxxk245Do1rJw4TEUEHSbzz0Gdvbmk6okKDk3qYKoHB
jjvIM3m20+esQS16LwOF5ECw3dF9R4crAFqHN9U+2pl2avjV/cU1iFN0ouIu9gy2
VfYQKXSceQIuGdJM6znWz8d9oCWYvMLGd6dDRqY8VqG1EyGIWGQXffItk4vnlEXl
xcLQSMWy7pm1M4+gV8mAKlsc81GRe8K68LaJCwMrDzlDnFem5mEfgBTyJZa29hbP
8Tc0lbjcODNE0JQFwZuyxVhZwp3m4yrbQrA+/y/S/0GwfEDXPcaYFXQN/iOVeJ+T
Hej7eTfiiOhOWcGwrmsEFDXiRiyh5OW/KqW2QMSnjpV4WLUSpSTeOikm/jn2E3rC
Kdb/MzsVszuK5MdVPyuK
=ZjkD
-----END PGP SIGNATURE-----

Permalink