In cases where we're operating on text data (and not binary data),
use e.g. "\u{00A0}" to refer directly to the Unicode character
'NO-BREAK SPACE' instead of "\xc2\xa0" to specify the bytes C2h A0h
(which correspond to the UTF-8 encoding of that character). This
makes it easier to look up those mysterious sequences, as not all
are as recognizable as the no-break space.
This is not enforced by PHP, but I think we should write those in
uppercase and zero-padded to at least four characters, like the
Unicode standard does.
Note that not all "\xNN" escapes can be automatically replaced:
* We can't use Unicode escapes for binary data that is not UTF-8
(e.g. in code converting from legacy encodings or testing the
handling of invalid UTF-8 byte sequences).
* '\xNN' escapes in regular expressions in single-quoted strings
are actually handled by PCRE and have to be dealt with carefully
(those regexps should probably be changed to use the /u modifier).
* "\xNN" referring to ASCII characters ("\x7F" and lower) should
probably be left as-is.
The replacements in this commit were done semi-manually by piping
the existing "\xNN" escapes through the following terrible Ruby
script I devised:
chars = eval('"' + ARGV[0] + '"').force_encoding('utf-8')
puts chars.split('').map{|char|
'\\u{' + char.ord.to_s(16).upcase.rjust(4, '0') + '}'
}.join('')
Change-Id: Idc3dee3a7fb5ebfaef395754d8859b18f1f8769a
This workaround was needed when ICU in production was broken
but after T189295 this is not needed anymore and we switched off
this collation from all Persian Wikis already
Bug: T139110
Change-Id: Ifad89555b6ac96a3eb36ca24b55e1f8ee57a1f05
These comments do not add anything. I argue they are worse than having
no comments, because I have to read them first to understand they
actually don't explain anything. Removing them makes room for actual
improvements in the future (if needed).
Change-Id: Iee70aad681b3385e9af282d5581c10addbb91ac4
* Adding new class AbkhazUppercaseCollation, mapped to 'uppercase-ab'.
* Extended CustomUppercaseCollation with support for sorting digraphs
and for alphabets larger than 64 letters (up to 4096).
Bug: T183430
Change-Id: I16d44568e44d7ef5b39c38b1a6257b9fe10a34d4
This is based on a numeric uppercase collation. Bashkir characters
will be remapped to the private use area for the purpose of sorting.
Bug: T162823
Change-Id: I65f1af0b57ff6ded7d464e39efd401f178a3519e