Thijs/wiki.techinc.nl

Author	SHA1	Message	Date
tjones	669d1ed192	(y)etsin fixes, test refactoring, and misc fixes * Fix etsin/етсин/этсин as noted in If933fc67845ac994d9ddfdf8349aff445ec9b13a ** only convert tsin to тсин and let the other rules sort out the e * Refactor most tests to be word-specific, which uncovered a couple of bugs in corner cases ** rol/üst prefix matches should match whole words (original [^ü] regex assumed word could not be end of string * Fixed incidental bugs I noticed while looking into the items above куркчи => kürkçi was in the wrong section cönk => джонк was in the right section, but reversed * Added additional tests cases for all of the above. Change-Id: Ia96be488a7b41c3ddba623b5c9262703b1c82687	2018-05-29 14:30:04 -04:00
tjones	cbb07cdc33	Crimean Tatar/crh transliteration odds and ends * refactor '\b' into WB const to make it easy to update in the future * add new ц-related exceptions Bug: T193764 Change-Id: Ib707136f8f2598d1f8ec995bf129b436dfb53cd9	2018-05-22 14:59:55 -04:00
C. Scott Ananian	685eba4360	Minor fixes to CRH language conversion. * Move a many-to-one mapping from the L2C to the C2L table where it belongs. * Fix some regular expression patterns which ended up with misnumbered replacement strings. * All regular expressions should have the `u` (unicode) flag set. * Typo/spelling fixes in comments Change-Id: If933fc67845ac994d9ddfdf8349aff445ec9b13a	2018-05-12 14:37:09 -04:00
tjones	14f8dc35db	CRH Transliteration Pattern Matching Fixes Refactor to match exceptions as patterns, not words - break exception list to C2L and L2C pattern sets - change main loop to break only on Roman numerals and transliterate everything else, rather than tokenizing on single-script words (this fixes the km² problem, too) - update word anchors from ^ and $ to \b - only process Roman numerals for L2C translit - add exception for single "Roman" character followed by a period which looks like an initial - consolidate multi-step transliteration into regsConverter() - remove regex support from main exception list to support strtr() - re-organize some prefix/suffix/whole word patterns to the right place - add tests for recently fixed use cases - add support for many-to-one mappings in both directions - update character classes, exception lists, and regexes based on speaker feedback and example texts Misc other fixes: - fix some character classes errors - remove unneeded character classes - add tests for Roman numerals and quotes - add tests for affixes and regexes Bug: T188321 Bug: T189512 Change-Id: I056d36ff2b8f63b3998a5d3a442d8d539c15488d	2018-04-27 19:17:51 -04:00
Thiemo Mättig	409da2d8b3	Remove leading backslashes from "use \…" tags Change-Id: I494b029de089a07e3b946ee78293a12d5036f63e	2017-12-28 16:30:05 +01:00
tjones	a0b511319c	Crimean Tatar Transliteration This is a first pass at Latin/Cyrillic translitertion for Crimean Tatar (crh). Includes transliteration tables, prefix/suffix mappings, regex mappings, and exceptions lists for words and abbreviations. Regularize CRH language name in messages/* files. Fix "varient" typos in qqq.json. Add unit tests for CRH transliteration. Bug: T23582 Change-Id: I424703f99adf837f6217872b882d1ea26bfdd068	2017-11-20 16:56:38 -05:00

6 commits