Portuguese Words

Portuguese Words

pt_words-0.1.tgz [sig]
git clone https://www.altsci.com/repo/pt_words.git

Today I release a simple Sunday script building on the distant past. How do you sort a list of words in a language you're learning? The easiest order is the order you originally wrote them down in. The second easiest (and probably one of the best orderings) is random. Random ordering removes biases that a human put in and biases the list to another order. If you order your list randomly each time you read from it, you can remove bias.

This is how I ended up trying to learn kanji last year. It didn't go so well. What's wrong with random ordering of a list of words? It provides a test of one's memory. As my memory is not great, the random ordering does not actually solve the problem of memorization. So how does one memorize words? I had quite a bit of success with WaniKani this winter. I learned 85 kanji (that I had already learned on my own) and 184 words based on those kanji in a few months. What does WaniKani do that I didn't do? 1) WaniKani is an SRS. 2) WaniKani has experts to come up with a good set of data about kanji and vocabulary to cement the meaning and readings.

I decided to write my own SRS and it's working poorly. What did I do wrong? I think I need to tune it so that mistakes are punished. One of the things I didn't like about WaniKani was that it punished mistakes by slowing down the learning process. That made it impossible to move on with lower than 100% score on the tests. Eventually one gets 100% since questions answered correctly are removed and the process of memorizing the remaining vocabulary gets easier. Anyway, the results from my SRS shows that removing the punishment part of SRS (on hard kanji that I did not remember on the first go around) results in ~50% successful memory. Is 50% good enough? No.

But I'm not publishing my SRS. I'm not even publishing something that will work on the Japanese language. I'm publishing a script that sorts words by popularity. How? If we have a corpus of legitimate sentences from the target language, we can create a histogram of words. Then we can do a cross-reference with our list of words to get a count for each word. An example makes sense.

Say you have a book written in Brazilian Portuguese. The histogram of words can be created with my script. The first 8 words in the histogram can be seen below:

139104 de
124121 que
115228 e
109142 o
64106 se
47148 do
46125 não
42648 um

So now we want cross-reference this list with our short list of words we want to learn. My script (just grep) cross-references the list getting the most popular words in the to the top of the list. The first 8 words in the list can be seen below:

3515 porém
2340 fosse
2170 tanto
2072 são
1508 dizia
1431 algumas
1394 força
1387 ar

Does focusing on the most common words in a corpus improve language learning? Maybe. Does it make sense to focus on the bottom of the list? Actually the bottom of the list is more interesting than the top of the list. There are plenty of reasons for a word to not show up in our corpus. Let's take a look at words not found in the corpus:


Interesting, right? What is more interesting to me are how few of these words can be found in my english-portuguese dictionary. That's right, we're getting into the nitty gritty of language learning. Compiling resources that make it possible to remember words is the smart way to build a structure by which you can learn. But that is not sufficient in itself. You need motivation, interaction, and a native speaker. Whether that native speaker is speaking to you in lessons, on youtube, or in person, native speakers are critical in language learning. If you're looking for a new corpus of native speakers, I recommend Mozilla Voice.

So portuguese_words is a project I wrote on a Sunday afternoon based on code I wrote 10 years ago which uses the availability of a large Portuguese corpus to sort a list of words I want to learn by popularity.

Javantea out.


Comments: 0

Leave a reply »

  • Leave a Reply
    Your gravatar
    Your Name