Today I release a simple Sunday script building on the distant past. How do you sort a list of words in a language you're learning? The easiest order is the order you originally wrote them down in. The second easiest (and probably one of the best orderings) is random. Random ordering removes biases that a human put in and biases the list to another order. If you order your list randomly each time you read from it, you can remove bias.
This is how I ended up trying to learn kanji last year. It didn't go so well. What's wrong with random ordering of a list of words? It provides a test of one's memory. As my memory is not great, the random ordering does not actually solve the problem of memorization. So how does one memorize words? I had quite a bit of success with WaniKani this winter. I learned 85 kanji (that I had already learned on my own) and 184 words based on those kanji in a few months. What does WaniKani do that I didn't do? 1) WaniKani is an SRS. 2) WaniKani has experts to come up with a good set of data about kanji and vocabulary to cement the meaning and readings.
I decided to write my own SRS and it's working poorly. What did I do wrong? I think I need to tune it so that mistakes are punished. One of the things I didn't like about WaniKani was that it punished mistakes by slowing down the learning process. That made it impossible to move on with lower than 100% score on the tests. Eventually one gets 100% since questions answered correctly are removed and the process of memorizing the remaining vocabulary gets easier. Anyway, the results from my SRS shows that removing the punishment part of SRS (on hard kanji that I did not remember on the first go around) results in ~50% successful memory. Is 50% good enough? No.
But I'm not publishing my SRS. I'm not even publishing something that will work on the Japanese language. I'm publishing a script that sorts words by popularity. How? If we have a corpus of legitimate sentences from the target language, we can create a histogram of words. Then we can do a cross-reference with our list of words to get a count for each word. An example makes sense.
Say you have a book written in Brazilian Portuguese. The histogram of words can be created with my script. The first 8 words in the histogram can be seen below:
139104 de 124121 que 115228 e 109142 o 64106 se 47148 do 46125 não 42648 um
So now we want cross-reference this list with our short list of words we want to learn. My script (just grep) cross-references the list getting the most popular words in the to the top of the list. The first 8 words in the list can be seen below:
3515 porém 2340 fosse 2170 tanto 2072 são 1508 dizia 1431 algumas 1394 força 1387 ar
Does focusing on the most common words in a corpus improve language learning? Maybe. Does it make sense to focus on the bottom of the list? Actually the bottom of the list is more interesting than the top of the list. There are plenty of reasons for a word to not show up in our corpus. Let's take a look at words not found in the corpus:
básica recem paraceu chateado sujeiras for expressou enjoo propugnando submetidas lesivas avaliação esquema cientistas nucleo regionais telefonema tênis planejar
Interesting, right? What is more interesting to me are how few of these words can be found in my english-portuguese dictionary. That's right, we're getting into the nitty gritty of language learning. Compiling resources that make it possible to remember words is the smart way to build a structure by which you can learn. But that is not sufficient in itself. You need motivation, interaction, and a native speaker. Whether that native speaker is speaking to you in lessons, on youtube, or in person, native speakers are critical in language learning. If you're looking for a new corpus of native speakers, I recommend Mozilla Voice.
So portuguese_words is a project I wrote on a Sunday afternoon based on code I wrote 10 years ago which uses the availability of a large Portuguese corpus to sort a list of words I want to learn by popularity.