AI3 Regular Blog

by Javantea
Jan 12, 2013

I've been blogging more than usual since I released AI3 on Christmas Eve. You should check it out. In comparison to all websites I have released, AI3 has the most potential and should get the most respect. I purchased a super-fast server (SSD especially for fast database lookups), leased a super-fast colo space for it, and am going to add to it regularly. As a feature of AI3, I will attempt to keep a regular blog here with insight into what I think about each feature of the website is and then I will make a page with that data on ai3 using a simple slug. I've already done a few if you want to look at the past few blog posts.

The feature that I'm going to discuss today is single-minded research of a single difficult topic. Searching for a common word in Google can be one of the most frustrating things in the world. What you really want is for someone to answer the question you are asking, not learn every way to misunderstand what you are asking. Sometimes AI3 will fail, there's no doubt that Google is more in depth than anything I can create even if I had all of Wikipedia. So let's get in depth on a very simple question. It's not one of the easy questions I've been dealing with. Let's ask: "Is the word 'We' used more positively or negatively?" By that, I mean "Is the sentence 'We plan to solve poverty by 2017,' more common than 'We can not solve poverty by 2017'?" But not just that sentence, but every sentence which is in the positive "We *verb*" vs "We *verb* not". This is a deviously difficult problem. Even with a huge corpus, definitive answers require statistical analysis of a ton of stuff. Let's attempt it though. Start with We and we. All words in AI3 are case-sensitive, which is why there are links to all variants of we on the We word page. 1276 pages is too many unless we have a script. Let's try collocation of We. It's a slow process because We is such a common word. You can look below if you're impatient. While you're waiting, maybe try looking at a few sentences. The second sentence is:

`` We didn't want town work '', Jones said.
Eureka already? Yup. All we need to do is find similar words on We and every word that is in the negative. That's pretty easy, right? There are only four pages of words that contain n't and most of them are pretty uncommon. Note that there's a bug where dashes assume that two words are one. That's a problem with my parser which should be more intelligent about whitespace. So manually or automatically, we can start searching for sentences that contain We didn't and so on. Since the related page doesn't have a count (due to slowness), we are stuck just trying a high page number and using a binary search from there. If you don't know what a binary search is, let me explain. Let's say that there could be upwards of 100 pages of sentences or more. Simply skip to page 100. If it gives you an error, then there aren't that many pages. Go to half that number, page 50. Half the number again and again until you come up a valid page. Then pick a number half way between the valid page and the invalid page. After a few hits, you will find that page 6 is the end of We didn't. In total, it should only take 7 tries to find any number between 1 and 100 because 2^7 is 128. If you don't understand the math, hopefully you'll understand the process. Anyway, now we have a way of counting all the negative sentences. Then we simply need to count all the sentences that contain We. That can be found on the We word page. But let's say that you thought this algorithm through and have some skill with a database. How long would it take you to come up with the solution?

select group_concat(id) from (select id from one_word where value like "%n't%" limit 40) x\G
*************************** 1. row ***************************
group_concat(id): 18,83,361,446,505,521,527,582,650,784,826,1042,1687,2102,2153,2758,3464,3664,4391,4880,4888,5661,5882,6002,6018,7292,8221,8791,18808,27077,28006,28196,31523,43118,43179,54124,54422,55162,94110,108086

select id from one_word where value = 'we';
select id from one_word where value = 'We';

select sw.sentence_id from one_sentence_words sw join one_sentence_words sw2 on sw.sentence_id = sw2.sentence_id where sw.word_id in (295,443) and sw2.word_id in (18,83,361,446,505,521,527,582,650,784,826,1042,1687,2102,2153,2758,3464,3664,4391,4880,4888,5661,5882,6002,6018,7292,8221,8791,18808,27077,28006,28196,31523,43118,43179,54124,54422,55162,94110,108086) group by sw.sentence_id limit 20;
| sentence_id |
|          67 |
|          78 |
|         150 |
|         168 |
|         169 |
|         315 |
|         633 |
|        1257 |
|        1494 |
|        1952 |
|        1980 |
|        2667 |
|        2693 |
|        3281 |
|        3295 |
|        3322 |
|        3379 |
|        3591 |
|        3597 |
|        3834 |
20 rows in set (0.32 sec)

select count(distinct sw.sentence_id) from one_sentence_words sw join one_sentence_words sw2 on sw.sentence_id = sw2.sentence_id where sw.word_id in (295,443) and sw2.word_id in (18,83,361,446,505,521,527,582,650,784,826,1042,1687,2102,2153,2758,3464,3664,4391,4880,4888,5661,5882,6002,6018,7292,8221,8791,18808,27077,28006,28196,31523,43118,43179,54124,54422,55162,94110,108086);
| count(distinct sw.sentence_id) |
|                           4087 |
1 row in set (0.91 sec)

select id from one_word where value = 'not';
select id from one_word where value = 'Not';

select count(distinct sw.sentence_id) from one_sentence_words sw join one_sentence_words sw2 on sw.sentence_id = sw2.sentence_id where sw.word_id in (295,443) and sw2.word_id in (50, 603);
| count(distinct sw.sentence_id) |
|                           9505 |
1 row in set (53.16 sec)

select count(distinct sw.sentence_id) from one_sentence_words sw where sw.word_id in (295,443);
| count(distinct sw.sentence_id) |
|                          86542 |
1 row in set (29.20 sec)

The results show that the use of the word We and negative is fairly rare, amounting to 16% of sentences (plus or minus a few percent due to errors in the system). I didn't count the use of the word no with the word We because of possible conflicts. We could add a fraction of these sentences if we wanted to. It's pretty clear that direct database access is by far a cheaper and easier method of access to this data. As I improve the website it may be possible for me to give more direct database access to superusers who are interested in improving the website. Django's powerful (yet fairly obtuse) QuerySet API makes it possible to give users much more control over the database than any of my string-based database query models have done. Of course, with abstraction comes difficulty of use, so it's a tradeoff.

Collocation of We

We are (2448 uses)
We are thirsty and hungry ; ;
We have (1871 uses)
`` We have now a national character to establish '', Washington wrote in 1783.
We Are (1331 uses)
A second collection of nursery rhymes, Now We Are Six, was published in 1927.
We can (1151 uses)
He swung round to the other men -- `` We can catch him easy!!
We were (888 uses)
We were coming to an intersection, turning right, chuffing to a stop.
We will (725 uses)
We will recall that the still confident liberals of the Truman administration gathered with other Western utopians in San Francisco to set up the legal framework, finally and at last, to rationalize war -- to rationalize want and fear -- out of the world: the United Nations.
We had (508 uses)
`` We had to do something ''.
We must (486 uses)
We must believe we have the ability to affect our own destinies: otherwise why try anything??
We do (435 uses)
We do not defeat the good ones with this cruelty, but we add to their burden, while expecting them to bestow saintliness upon us in return for ostentatious church attendance and a few bucks a week, American cash.
We believe (417 uses)
We believe that autism, like so many other conditions of defect and deviation, is to a large extent inborn.
We know (407 uses)
`` We know Penny spent some -- and Carmer must have dropped a few dollars getting that load on ''.
We Were (372 uses)
His son was born in August 1920 and in 1924 Milne produced a collection of children's poems When We Were Very Young, which were illustrated by Punch staff cartoonist E. H. Shepard.
We want (355 uses)
We want him back there or we want him dead ''.
We don't (331 uses)
We don't want Barton's Night Riders loose again ''.
We Can (315 uses)
David Lloyd George adopted a programme at the 1929 general election entitled We Can Conquer Unemployment !, although by this stage the Liberals had declined to third-party status.
We shall (309 uses)
We shall return to these statements and deal with them more fully as the evidence for them accumulates.
We Will (307 uses)
and We Will Rock You.
We also (300 uses)
We also know that the Saxon Shore as reflected in the Notitia was created as a part of the Theodosian reorganization of Britain ( post A.D. 369 ).
We should (288 uses)
We should not become confused or let our public become confused over irrelevant questions of number or even of geography.
We need (284 uses)
We need a doctrine of imitation to save us from the solipsism and futility of pure formalism.
We ’ (279 uses)
We re not here to save the culture.
We Go (275 uses)
Satirising life in a British prison, meanwhile, the Bowie-penned " Over the Wall We Go " became a 1967 single for Oscar ; another Bowie composition, " Silly Boy Blue ", was released by Billy Fury the following year.
We may (262 uses)
We may take her with us -- to California.
We Love (242 uses)
* 1958: The Lucy-Desi Comedy Hour ( 1 episode, 1958 ) ... aka " We Love Lucy " – USA ( syndication title ) – Lucy Wins a Race Horse ( 1958 ) TV episode ( performer: " The Bayamo ")
We Know (239 uses)
* " Conspiracy ", a song by the pop-punk band Paramore from their album All We Know Is Falling

If you're interested in becoming a superuser of AI3, let me know. In case you're wondering about bait and switch, the site will remain free so long as it's up with no advertising. Also I will not sell or give user data to anyone. The database itself is created out of free sources and will always be available under the Creative Commons license.

There's something to be said about a regular blog. Sometimes it's a burden and sometimes people don't read, but it's fun to be able to say: I wrote a blog about that very topic 3 years ago. Some people get much more exposure than this blog by posting on Facebook, Reddit, Slashdot, Hacker News and Twitter. But those news sources rotate content so fast that hardly anything sticks. And most of the posts are either images (1 second max impression), posts (short or tl;dr), news articles (skipped) or blogs (just like this one but more relevant). So I see blogs as being the medium for the monologue, the oratory. What other service would be fit to print the above?

That said, I'm headed to Brazil on Tuesday (1/15) and I'll be blogging on my AltSci Brasil website about that as well as this website until February 11. I won't be checking my phone's voicemail very often, so if you want to contact me, I recommend e-mail. I will use VoIP for phone so that my Brazilian friends can call me locally, so you can try your luck with that by calling (206) 787-9322. It has voicemail that will actually go to my e-mail so I will get it sooner than a voicemail to my cell phone if that makes sense.

Javantea out.


Comments: 0

Leave a reply »

  • Leave a Reply
    Your gravatar
    Your Name