How to deal with strings in different locale set?

There are many languages around the world. Every single one is a little bit different and has it’s own rules. For example if we look at this two words: “chladny” and “hladny”, most of us will say that their order is right. But not for our Slovak colleagues. For them every words which starts with “ch” is AFTER words that starts with “h” in alphabetical order. Another example – we want to call UPPER CASE function on “Rexamine” word. The result we expect is “REXAMINE”, but in Turkish the big letter “i” has little dot over. Currently R does not have any proper tools to deal with this that could work on any platform in the same manner. But this is about to change. Our package stringi is almost ready and should be available on CRAN soon. Stay tuned for more news!

And some code for the first example
> stri_order(c("chladny", "hladny"), F, stri_opts_collator(locale="en_EN"))
[1] 1 2
> stri_order(c("chladny", "hladny"), F, stri_opts_collator(locale="pl_PL"))
[1] 1 2
> stri_order(c("chladny", "hladny"), F, stri_opts_collator(locale="sk_SK"))
[1] 2 1

Tagged with: , ,
Posted in Blog/News, Blog/R