-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA512
Mar 21, 2017
This blog is a continuation of this blog post which is a pretty good mix of technical and creative projects that this blog was originally written for. If you look at some of the way-way-back posts you'll see some of the things I've thought about and tried to work on from this perspective.
Aside: Learning about limitations
The past year has been a pretty amazing bit of learning for me. I worked on and finished Coursera's Machine Learning course last June. It taught me a lot about the ways that machine learning can improve software. Which reminded me that a lot of the things I thought were limited by the code I could write are limited to the code I limit myself to write. I limit myself in many ways when I write code. When I make a website, I decide whether I can accept using the jQuery library. Most web developers would scoff, since basically all their websites use jQuery or something similar. If we think that jQuery is a free tool, there may be no direct cost but there is some bandwidth and cpu cost adding it to a website. When I write a piece of Python software, I consider how long does it take to load? How stable is the software? How likely is someone to require a sandbox for my code because it imports this library? Small Wide World uses scipy and numpy, two important libraries. The benefit to using them greatly outweighs their downside and they are a standard tool. I use NLTK in AI3 which has caused a lot of problems, but with those costs come many benefits.
AltSci Kanji Classifier is a Japanese kanji handwriting classifier (you draw a kanji and it digitizes it into a font glyph that you can copy and paste into a text editor). It is a brand new project (started Feb 2017), I am releasing today which is in beta. Why do you need a handwriting kanji classifier? If you're reading a book in Japanese and don't know the language, there's no way to find out what the words mean because you can't just type them into a dictionary. That's why a handwriting classifier is useful. Other methods of classifying kanji exist and I use them, but handwriting is actually a pretty fast way to go about it.
When I started writing AltSci Kanji Classifier, I thought about the likely problems that would arise: too complex, kanji are too similar, there isn't a huge kanji handwriting dataset, and I'm not dedicated to the task. But I just started with the first three kanji I knew I could classify -- 一 二 三 and continued to improve the software until it could classify more and more. By the time it was able to do 20 properly, it could do all 164 of the kanji in my original list I wrote last year. This is the right way to go about writing software: can I do one percent of what I want? Can I do a tenth of a percent? Can I do five percent? Can I do fifty percent? Once I got to 164 kanji I new I was ready to release the software. There is a huge bug in that it can't classify a kanji that has been drawn in the incorrect order, but that is on the to do list. Note that kanji are taught to be drawn in the correct order, so this only affects people who don't know the correct order to draw kanji in (which I know is a major audience for this software). To create a classifier in the correct order, I use the excellent Kanji stroke order font. Hopefully I have drawn them correctly.
There are two classifiers in the Kanji Classifier: binary classifier (is it correct or is it not) and probabilistic classifier (how incorrect is it?). The binary classifier takes each stroke and compares it. The comparisons it makes are shape and position. You might think that would be slow, but since you only compare kanji that have the same number of strokes (or one fewer), this is actually quite fast. As more kanji are added it will get slower, but not by a lot. The probabilistic classifier does the same shape and position computation but reduces the probability of successful classification the more differences there are. A 90% result means that 9 out of 10 properties were the same and 1 property was not. Only the highest ranked kanji are displayed to the user and it's up to them to decide whether it is the correct kanji.
If you're having trouble classifying a kanji you're drawing, click Save JSON and ctrl-click the link added to the debug log and send it to me by e-mail.
One of the obvious problems was the issue of taking a large number of inputs from a mouse, trackpad, or tablet and turning them into data that can be used. While this is mostly a quality issue, it also affects speed of recognition. The method I chose to solve this problem was checking how far a point deviated from the two points before. I used a point line distance calculation and compared the distance of the previous position with the line between the current position and the position before the previous. While this has plenty of drawbacks, it produces good results so far and can be improved with a bit of effort.
The Google Translate App on Android has an amazing handwriting classifier that works really well on a phone.
Zinnia is an open source handwriting classifier that works well for many people. Interfaces to Zinnia include Mozc which is an open source version of Google's Android Japanese input method and ibus which is a popular input method for Linux.
How difficult a problem is handwriting recognition? Not very difficult. It took me a few days to write this project. With a bit of improvements, it should be quite good. There are plenty of cases where it doesn't work well now, but just a year ago a lot of the stuff I use worked poorly. If you'd like to contribute to this project, it is open source and free. Contact me about submitting patches.
-----BEGIN PGP SIGNATURE----- iQIzBAEBCgAdFiEEuYE3Yh0wygXiwc1/PGjI28ung+8FAljR4PEACgkQPGjI28un g+8snQ/+I3YiFF/2jOPwPlui2oPDrfk7HMecXO9nmY754x8hnlmDjGNi5XpyKSPO QUZMPVjnXlmnxjWMmbLAEaoACMbSt/hbuSR0ihyIJB8VDVMmlGbcFRk5YhmyeFmi bY2cmBlUTyUreX2QD2C4H9FVwHGOU9pTkBaG11Xg3zY9ZqlQL8tDTxs0DT7fvM0t 5tWURL0KEqkmIfgQKvDPtH0F6+rfPJBAV8+XeJ5TEnXbd+SEDILFnGPVgtwgFHNZ u+ziT1BBTLEzBTIov1bRsR+RyIcEljhPpbwqMT6uIe0+g0u8pB0tifi92MoTLkZL iocGisbXGnwlEY/NrM87h2/p+vGG/aQV2Gji4acZ7iT40m6g3VdzIyOUC9BCdGa4 dupbqUz1f77bTRARC0hFQ/V7n1jkQU1CDIEppXhg/gHDRCOFTFYQ6588pHotcLI1 5w6gePESr4igAoNxCgXBytV+0gsozClFpUYulLIUYId5ojnkDoGm+gS2whFkB7vs adjTfACMiykJUQDhCaSCQu8soE28bGH/NVlIUuTpq/ati5QDloLqJFu6T9Mpa95R idMaDnTSx1sINeV+TyUg+Qj2WTTnRzfzs5zzsFMu5lWn9r9QDgSi0d0HmJ5Cka0p vLAdviHVcNi/J4Jvd3qlMZNwsCHSro87oIG1EcWlRbM8c1T7+Wo= =6xJv -----END PGP SIGNATURE-----Permalink