Commit graph

6 commits

Author SHA1 Message Date
tjones
669d1ed192 (y)etsin fixes, test refactoring, and misc fixes
* Fix etsin/етсин/этсин as noted in If933fc67845ac994d9ddfdf8349aff445ec9b13a
** only convert tsin to тсин and let the other rules sort out the e

* Refactor most tests to be word-specific, which uncovered a couple of
bugs in corner cases
** rol/üst prefix matches should match whole words (original [^ü] regex
assumed word could not be end of string

* Fixed incidental bugs I noticed while looking into the items above
** куркчи => kürkçi was in the wrong section
** cönk => джонк was in the right section, but reversed

* Added additional tests cases for all of the above.

Change-Id: Ia96be488a7b41c3ddba623b5c9262703b1c82687
2018-05-29 14:30:04 -04:00
tjones
cbb07cdc33 Crimean Tatar/crh transliteration odds and ends
* refactor '\b' into WB const to make it easy to update in the future
* add new ц-related exceptions

Bug: T193764
Change-Id: Ib707136f8f2598d1f8ec995bf129b436dfb53cd9
2018-05-22 14:59:55 -04:00
tjones
14f8dc35db CRH Transliteration Pattern Matching Fixes
Refactor to match exceptions as patterns, not words
- break exception list to C2L and L2C pattern sets
- change main loop to break only on Roman numerals and transliterate
  everything else, rather than tokenizing on single-script words
  (this fixes the km² problem, too)
  - update word anchors from ^ and $ to \b
  - only process Roman numerals for L2C translit
  - add exception for single "Roman" character followed by a period
    which looks like an initial
- consolidate multi-step transliteration into regsConverter()
- remove regex support from main exception list to support strtr()
- re-organize some prefix/suffix/whole word patterns to the right place
- add tests for recently fixed use cases
- add support for many-to-one mappings in both directions
- update character classes, exception lists,  and regexes based on
  speaker feedback and example texts

Misc other fixes:
- fix some character classes errors
- remove unneeded character classes
- add tests for Roman numerals and quotes
- add tests for affixes and regexes

Bug: T188321
Bug: T189512
Change-Id: I056d36ff2b8f63b3998a5d3a442d8d539c15488d
2018-04-27 19:17:51 -04:00
tjones
70dede013c Fix table loading bug for CRH transliteration
In production, the regex and exception tables were not being loaded,
resulting in very poor transliteration. The loading has been moved to
the contructor, similar to the implementation of the Kazakh
transliteration.

Also, a bug in the mappings for Ö/ö -> Ё/ё and Ü/ü -> Ю/ю has been
fixed.

Test cases for specific additional examples have been added. (Though
it is worth noting that the regex and exception tables did load
properly during unit testing, so the problem wasn't caught there.)

Bug: T186727
Change-Id: I6bacee7d9de6f4a870a8a9ef1f04b819ad489c02
2018-02-26 13:22:04 -05:00
Kunal Mehta
fc23633035 Add @covers tags to languages tests
I removed comments that merely repeated the location of the class being
tested. There are other tests in this directory that don't have a
corresponding class and need further investigation.

Change-Id: Ic16f0887b5030ac53fab4382cfaedfb5426cdb08
2017-12-28 08:52:56 +00:00
tjones
a0b511319c Crimean Tatar Transliteration
This is a first pass at Latin/Cyrillic translitertion for Crimean
Tatar (crh).

Includes transliteration tables, prefix/suffix mappings, regex
mappings, and exceptions lists for words and abbreviations.

Regularize CRH language name in messages/* files.

Fix "varient" typos in qqq.json.

Add unit tests for CRH transliteration.

Bug: T23582
Change-Id: I424703f99adf837f6217872b882d1ea26bfdd068
2017-11-20 16:56:38 -05:00