Commit graph

6 commits

Author SHA1 Message Date
tjones
669d1ed192 (y)etsin fixes, test refactoring, and misc fixes
* Fix etsin/етсин/этсин as noted in If933fc67845ac994d9ddfdf8349aff445ec9b13a
** only convert tsin to тсин and let the other rules sort out the e

* Refactor most tests to be word-specific, which uncovered a couple of
bugs in corner cases
** rol/üst prefix matches should match whole words (original [^ü] regex
assumed word could not be end of string

* Fixed incidental bugs I noticed while looking into the items above
** куркчи => kürkçi was in the wrong section
** cönk => джонк was in the right section, but reversed

* Added additional tests cases for all of the above.

Change-Id: Ia96be488a7b41c3ddba623b5c9262703b1c82687
2018-05-29 14:30:04 -04:00
tjones
cbb07cdc33 Crimean Tatar/crh transliteration odds and ends
* refactor '\b' into WB const to make it easy to update in the future
* add new ц-related exceptions

Bug: T193764
Change-Id: Ib707136f8f2598d1f8ec995bf129b436dfb53cd9
2018-05-22 14:59:55 -04:00
C. Scott Ananian
685eba4360 Minor fixes to CRH language conversion.
* Move a many-to-one mapping from the L2C to the C2L table where it
  belongs.

* Fix some regular expression patterns which ended up with misnumbered
  replacement strings.

* All regular expressions should have the `u` (unicode) flag set.

* Typo/spelling fixes in comments

Change-Id: If933fc67845ac994d9ddfdf8349aff445ec9b13a
2018-05-12 14:37:09 -04:00
tjones
14f8dc35db CRH Transliteration Pattern Matching Fixes
Refactor to match exceptions as patterns, not words
- break exception list to C2L and L2C pattern sets
- change main loop to break only on Roman numerals and transliterate
  everything else, rather than tokenizing on single-script words
  (this fixes the km² problem, too)
  - update word anchors from ^ and $ to \b
  - only process Roman numerals for L2C translit
  - add exception for single "Roman" character followed by a period
    which looks like an initial
- consolidate multi-step transliteration into regsConverter()
- remove regex support from main exception list to support strtr()
- re-organize some prefix/suffix/whole word patterns to the right place
- add tests for recently fixed use cases
- add support for many-to-one mappings in both directions
- update character classes, exception lists,  and regexes based on
  speaker feedback and example texts

Misc other fixes:
- fix some character classes errors
- remove unneeded character classes
- add tests for Roman numerals and quotes
- add tests for affixes and regexes

Bug: T188321
Bug: T189512
Change-Id: I056d36ff2b8f63b3998a5d3a442d8d539c15488d
2018-04-27 19:17:51 -04:00
Thiemo Mättig
409da2d8b3 Remove leading backslashes from "use \…" tags
Change-Id: I494b029de089a07e3b946ee78293a12d5036f63e
2017-12-28 16:30:05 +01:00
tjones
a0b511319c Crimean Tatar Transliteration
This is a first pass at Latin/Cyrillic translitertion for Crimean
Tatar (crh).

Includes transliteration tables, prefix/suffix mappings, regex
mappings, and exceptions lists for words and abbreviations.

Regularize CRH language name in messages/* files.

Fix "varient" typos in qqq.json.

Add unit tests for CRH transliteration.

Bug: T23582
Change-Id: I424703f99adf837f6217872b882d1ea26bfdd068
2017-11-20 16:56:38 -05:00