Exotic regular expression unicode scripts

2015/03/09

After spending some time digging into Regular Expression while working on Polyglot, I learned about Unicode Scripts - a Regular Expression that matches a specific language’s scripting characters - currently supported in JGsoft engine, Perl, PCRE, PHP, Ruby 1.9, Delphi, and XRegExp.

For example, a regular expression that matches all possible Japanese characters:

/[\p{Hiragana}\p{Katakana}\p{Han}\p{Latin}]/

See the full list of Unicode Scripts on Regular-Expressions.

Here is a piece of Ruby code I came up with to separate words for every language (hopefully), including lingual-continua scripts (does not have spaces between words):

str.scan(/[\p{Arabic}\p{Armenian}\p{Bengali}\p{Bopomofo}\p{Buhid}\p{Canadian_Aboriginal}\p{Devanagari}\p{Ethiopic}\p{Han}\p{Hangul}\p{Hanunoo}\p{Hiragana}\p{Katakana}\p{Khmer}\p{Lao}\p{Runic}\p{Tagbanwa}\p{Thai}\p{Tibetan}\p{Yi}]|\b[^\d ,.\/<>?;'\\:"\|\[\]\{\}ยง!@#$%^&*()_+-=\s][\p{Common}\p{Braille}\p{Cherokee}\p{Cyrillic}\p{Georgian}\p{Greek}\p{Gujarati}\p{Gurmukhi}\p{Hebrew}\p{Inherited}\p{Kannada}\p{Latin}\p{Limbu}\p{Malayalam}\p{Mongolian}\p{Myanmar}\p{Ogham}\p{Oriya}\p{Sinhala}\p{Syriac}\p{Tagalog}\p{TaiLe}\p{Tamil}\p{Telugu}\p{Thaana}]+?\b/i)

Ogham - an Early Medieval alphabet used primarily to write the early Irish language - resonates to me the most - for its simplicity, strangely exotic appeals and binary appearance.

“Polyglot” in Ogham characters:

Polyglot

Polyglot in Ogham Characters

The Ogham Transliterator is pretty awesome.