AltSci Kanji Classifier

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA512

by Javantea
Mar 21, 2017

learn_kanji-0.4.tar.xz [sig]

This blog is a continuation of this blog post which is a pretty good mix of technical and creative projects that this blog was originally written for. If you look at some of the way-way-back posts you'll see some of the things I've thought about and tried to work on from this perspective.

Aside: Learning about limitations

The past year has been a pretty amazing bit of learning for me. I worked on and finished Coursera's Machine Learning course last June. It taught me a lot about the ways that machine learning can improve software. Which reminded me that a lot of the things I thought were limited by the code I could write are limited to the code I limit myself to write. I limit myself in many ways when I write code. When I make a website, I decide whether I can accept using the jQuery library. Most web developers would scoff, since basically all their websites use jQuery or something similar. If we think that jQuery is a free tool, there may be no direct cost but there is some bandwidth and cpu cost adding it to a website. When I write a piece of Python software, I consider how long does it take to load? How stable is the software? How likely is someone to require a sandbox for my code because it imports this library? Small Wide World uses scipy and numpy, two important libraries. The benefit to using them greatly outweighs their downside and they are a standard tool. I use NLTK in AI3 which has caused a lot of problems, but with those costs come many benefits.

The limits of code I can write by adding incredibly powerful libraries is significantly higher than when I write code from scratch. This explains why a lot of my projects have been simple and fit a very small niche, the ones that survive are limited to the libraries I am willing to use. The ones that don't survive are much larger in scope and the availability or my willingness to use a library to make the project work is making it impossible to finish. As time goes on, some of my projects will incorporate powerful libraries to solve the problem of big programs. Sakuracon Unofficial Wiki uses a major modification to Realms Wiki which depends on a huge number of libraries, many of which are very unstable. How did I go about dealing with this monster of a project? I vetted every line of Python (server-side integrity) and fixed the CSRF vulnerability in Realms Wiki (which remains unfixed last time I checked). Then I made a list of all the JavaScript libraries included and saw whether I could remove them. There's no way I could vet all the JavaScript libraries, but I could make sure that I got them from a reputable source and made sure that they were doing what they were supposed to. While this isn't good enough for a company, it's necessary until I get time to fix this problem.

Intro

AltSci Kanji Classifier is a Japanese kanji handwriting classifier (you draw a kanji and it digitizes it into a font glyph that you can copy and paste into a text editor). It is a brand new project (started Feb 2017), I am releasing today which is in beta. Why do you need a handwriting kanji classifier? If you're reading a book in Japanese and don't know the language, there's no way to find out what the words mean because you can't just type them into a dictionary. That's why a handwriting classifier is useful. Other methods of classifying kanji exist and I use them, but handwriting is actually a pretty fast way to go about it.

When I started writing AltSci Kanji Classifier, I thought about the likely problems that would arise: too complex, kanji are too similar, there isn't a huge kanji handwriting dataset, and I'm not dedicated to the task. But I just started with the first three kanji I knew I could classify -- 一 二 三 and continued to improve the software until it could classify more and more. By the time it was able to do 20 properly, it could do all 164 of the kanji in my original list I wrote last year. This is the right way to go about writing software: can I do one percent of what I want? Can I do a tenth of a percent? Can I do five percent? Can I do fifty percent? Once I got to 164 kanji I new I was ready to release the software. There is a huge bug in that it can't classify a kanji that has been drawn in the incorrect order, but that is on the to do list. Note that kanji are taught to be drawn in the correct order, so this only affects people who don't know the correct order to draw kanji in (which I know is a major audience for this software). To create a classifier in the correct order, I use the excellent Kanji stroke order font. Hopefully I have drawn them correctly.

Method

There are two classifiers in the Kanji Classifier: binary classifier (is it correct or is it not) and probabilistic classifier (how incorrect is it?). The binary classifier takes each stroke and compares it. The comparisons it makes are shape and position. You might think that would be slow, but since you only compare kanji that have the same number of strokes (or one fewer), this is actually quite fast. As more kanji are added it will get slower, but not by a lot. The probabilistic classifier does the same shape and position computation but reduces the probability of successful classification the more differences there are. A 90% result means that 9 out of 10 properties were the same and 1 property was not. Only the highest ranked kanji are displayed to the user and it's up to them to decide whether it is the correct kanji.

Creating classifiers is surprisingly easy. The user draws the kanji they want to add, clicks Save Kanji, then types the most common romanji for the kanji and the kanji itself. When the user clicks save, a JSON object is made available for future use and a JSON object with a classifier is saved to memory. When the user is finished with a set of kanji, they click output and the webpage will print out the JavaScript that then should be added to kclasses.js to be included to the webpage. Adding the classes to the list at the bottom of the function makes them available to be classified. To test the classifier, the user reloads the page, draws the kanji, and clicks Classify. If it correctly classified, it will be shown to the right of the canvas. Debug information is printed below so that the user can learn about how the software is working. While most users won't need to read the debug information, it's useful enough during this beta phase.

If you're having trouble classifying a kanji you're drawing, click Save JSON and ctrl-click the link added to the debug log and send it to me by e-mail.

One of the obvious problems was the issue of taking a large number of inputs from a mouse, trackpad, or tablet and turning them into data that can be used. While this is mostly a quality issue, it also affects speed of recognition. The method I chose to solve this problem was checking how far a point deviated from the two points before. I used a point line distance calculation and compared the distance of the previous position with the line between the current position and the position before the previous. While this has plenty of drawbacks, it produces good results so far and can be improved with a bit of effort.

Data

電:
Den 電 Kanji

四:
Yon 四 Kanji

Competitors

The Google Translate App on Android has an amazing handwriting classifier that works really well on a phone.

Zinnia is an open source handwriting classifier that works well for many people. Interfaces to Zinnia include Mozc which is an open source version of Google's Android Japanese input method and ibus which is a popular input method for Linux.

Both of these are excellent competitors but are written in native code while my code is JavaScript. While they have much better quality currently, I have high expectations that improvements to my classifier will make this an excellent option for users.

Conclusion

How difficult a problem is handwriting recognition? Not very difficult. It took me a few days to write this project. With a bit of improvements, it should be quite good. There are plenty of cases where it doesn't work well now, but just a year ago a lot of the stuff I use worked poorly. If you'd like to contribute to this project, it is open source and free. Contact me about submitting patches.

Javantea out.

-----BEGIN PGP SIGNATURE-----

iQIzBAEBCgAdFiEEuYE3Yh0wygXiwc1/PGjI28ung+8FAljR4PEACgkQPGjI28un
g+8snQ/+I3YiFF/2jOPwPlui2oPDrfk7HMecXO9nmY754x8hnlmDjGNi5XpyKSPO
QUZMPVjnXlmnxjWMmbLAEaoACMbSt/hbuSR0ihyIJB8VDVMmlGbcFRk5YhmyeFmi
bY2cmBlUTyUreX2QD2C4H9FVwHGOU9pTkBaG11Xg3zY9ZqlQL8tDTxs0DT7fvM0t
5tWURL0KEqkmIfgQKvDPtH0F6+rfPJBAV8+XeJ5TEnXbd+SEDILFnGPVgtwgFHNZ
u+ziT1BBTLEzBTIov1bRsR+RyIcEljhPpbwqMT6uIe0+g0u8pB0tifi92MoTLkZL
iocGisbXGnwlEY/NrM87h2/p+vGG/aQV2Gji4acZ7iT40m6g3VdzIyOUC9BCdGa4
dupbqUz1f77bTRARC0hFQ/V7n1jkQU1CDIEppXhg/gHDRCOFTFYQ6588pHotcLI1
5w6gePESr4igAoNxCgXBytV+0gsozClFpUYulLIUYId5ojnkDoGm+gS2whFkB7vs
adjTfACMiykJUQDhCaSCQu8soE28bGH/NVlIUuTpq/ati5QDloLqJFu6T9Mpa95R
idMaDnTSx1sINeV+TyUg+Qj2WTTnRzfzs5zzsFMu5lWn9r9QDgSi0d0HmJ5Cka0p
vLAdviHVcNi/J4Jvd3qlMZNwsCHSro87oIG1EcWlRbM8c1T7+Wo=
=6xJv
-----END PGP SIGNATURE-----

Permalink

Comments: 0

Leave a reply »

 
  • Leave a Reply
    Your gravatar
    Your Name