Discuss: Accent Folding for Auto-Complete
by Carlos Bueno
- Editorial Comments
2 Sort results
I think it is important for anyone wanting to implement this, that you remember to sort by relevance. What I mean by that is, if you search for “Jø”, you list all results with the exact unicode match FIRST. Anything else should come after that (results matching “jo” for example).
Otherwise it is a very interesting approach to a common problem.
posted at 10:21 am on February 23, 2010 by hoffmann
3 Typo in first code example
accent_map should be accentMap
posted at 11:54 am on February 23, 2010 by fiveminuteargument
4 stringprep solved this problem already
The commonly accepted solution to “accent folding” in Unicode is called stringprep and is documented in RFC-3454. It handles accents, uppercase/lowercase, and many other nasty details of various character sets.
Various open standards make use of stringprep: SASL, XMPP, IDN, …
I don’t know of any JavaScript stringprep implementations but creating one shouldn’t prove too difficult. Examples exist in many other languages.
posted at 12:19 pm on February 23, 2010 by cbas
5 Semantics
Isn’t what you call accent folding (first time I hear that expression) what is referred to as Unicode normalization?
posted at 12:45 pm on February 23, 2010 by Ned Baldessin
6
cbas, Ned: This technique has the same goal of Unicode Normalization, but is not anywhere near as correct. :D It has the virtue of being fairly easy to understand and implement.
You are right though — I should have talked about normalization a bit and pointed to some of the libraries like Python’s unicodedata:
http://docs.python.org/library/unicodedata.html
posted at 01:50 pm on February 23, 2010 by Carlos Bueno
7 Keep in mind
I think it is worth pointing out that “accented characters” are not, in fact, only guides to pronunciation (or ornamentations for some other reason). Some of them are actually characters in their own right in one language or another.
Therefore any technique like this is really only applicable to English and possible a few other languages unless the code gets a lot more complicated.
I can only speak with any degree of knowledge about my own native language, Swedish, where our accented åäö are not a:s and o:s in the same way as an é is. Our alphabet does not end on z. It goes on to include these three extra characters.
We have special keys on our keyboards for them and have generally no desire to have a and ä treated the same. We do however not have an é key and would probably be thankful if that character WAS treated equal to an e. Multiply this by the number of countries typing on latin keyboards and… See the complexity looming here? Each language and/or country is likely to need their own set of rules as to which characters to “fold” and not.
Any application expecting multiple languages should take these things seriously… even it that turns out to mostly be Facebook, Twitter and Google. :)
posted at 02:07 pm on February 23, 2010 by MartinWestin
8 A functioning version of the example
If you’d like to explore a functioning version of the simple example in Carlos’s article, I’ve put one up here .
posted at 02:36 pm on February 23, 2010 by miraglia
9
Teebz: That’s a tricky one. A good compromise would be to place exact matches above accent-folded / normalized ones.
Martin: You are of course correct. A real system should probably take into account the user’s locale, but paying attention to this complicates the implementation enormously.
posted at 02:37 pm on February 23, 2010 by Carlos Bueno
10 Unicode normalization in PHP
There’s a lot more on unicode normalization here: http://unicode.org/reports/tr15/
And its also worth mentioning PHP has a Normalizer class as part of the intl extension (built into 5.3, or available as PECL extension for >=5.2.4 http://pecl.php.net/package/intl)
http://www.php.net/manual/en/class.normalizer.php (the NFKC form is probably most relevant to accent folding)
posted at 04:03 pm on February 23, 2010 by jwheare
Got something to say?
Discuss this article. We reserve the right to delete flames, trolls, and wood nymphs.
Create a new account or sign in below if you’d like to leave a comment.
Subscribe to this article's comments: RSS (what’s this?)




1
It seems like facebook is actually implementing this in their search function, although I hadn’t noticed it before.
If i try to find my friend named “Åsa” I have to write her full (albeit short) first name or else I won’t find her, the list populates with all names that begin with “A”. This is not a good way to solve the “problem” (exactly what is the problem anyway?).
Where would you need to implement a solution such as this? Won’t the application just become less international/multi-lingual?
Interesting concept nonetheless.
posted at 10:07 am on February 23, 2010 by TeebZ