For consistency with other data files. Also, like the other data files:
* For automated fetching of the Unicode files,
move the steps from Makefile to a bash script.
* Switch to a static array file format.
Change-Id: If07487950a270283b8eaeda9a507e723ed2d89c4
In cases where we're operating on text data (and not binary data),
use e.g. "\u{00A0}" to refer directly to the Unicode character
'NO-BREAK SPACE' instead of "\xc2\xa0" to specify the bytes C2h A0h
(which correspond to the UTF-8 encoding of that character). This
makes it easier to look up those mysterious sequences, as not all
are as recognizable as the no-break space.
This is not enforced by PHP, but I think we should write those in
uppercase and zero-padded to at least four characters, like the
Unicode standard does.
Note that not all "\xNN" escapes can be automatically replaced:
* We can't use Unicode escapes for binary data that is not UTF-8
(e.g. in code converting from legacy encodings or testing the
handling of invalid UTF-8 byte sequences).
* '\xNN' escapes in regular expressions in single-quoted strings
are actually handled by PCRE and have to be dealt with carefully
(those regexps should probably be changed to use the /u modifier).
* "\xNN" referring to ASCII characters ("\x7F" and lower) should
probably be left as-is.
The replacements in this commit were done semi-manually by piping
the existing "\xNN" escapes through the following terrible Ruby
script I devised:
chars = eval('"' + ARGV[0] + '"').force_encoding('utf-8')
puts chars.split('').map{|char|
'\\u{' + char.ord.to_s(16).upcase.rjust(4, '0') + '}'
}.join('')
Change-Id: Idc3dee3a7fb5ebfaef395754d8859b18f1f8769a
`$a <=> $b` returns `-1` if `$a` is lesser, `1` if `$b` is lesser,
and `0` if they are equal, which are exactly the values 'sort()'
callbacks are supposed to return.
It also enables the neat idiom `$a[x] <=> $b[x] ?: $a[y] <=> $b[y]`
to sort arrays of objects first by 'x', and by 'y' if they are equal.
* Replace a common pattern like `return $a < $b ? -1 : 1` with the
new operator (and similar patterns with the variables, the numbers
or the comparison inverted). Some of the uses were previously not
correctly handling the variables being equal; this is now
automatically fixed.
* Also replace `return $a - $b`, which is equivalent to `return
$a <=> $b` if both variables are integers but less intuitive.
* (Do not replace `return strcmp( $a, $b )`. It is also equivalent
when both variables are strings, but if any of the variables is not,
'strcmp()' converts it to a string before comparison, which could
give different results than '<=>', so changing this would require
careful review and isn't worth it.)
* Also replace `return $a > $b`, which presumably sort of works most
of the time (returns `1` if `$b` is lesser, and `0` if they are
equal or `$a` is lesser) but is erroneous.
Change-Id: I19a3d2fc8fcdb208c10330bd7a42c4e05d7f5cf3
This prevents unexpected cuneiform digits from acting as headings for
2 and 3 on category pages.
Bug: T187645
Change-Id: I0424a24769899cb23b28704f97e1002fa44999fd
And auto-fix all errors.
The `<exclude-pattern>` stanzas are now included in the default ruleset
and don't need to be repeated.
Change-Id: I928af549dc88ac2c6cb82058f64c7c7f3111598a
They should be Ș, Ț (comma-below) and instead they were cedilla-below (Ş, Ţ).
Same as for Romanian (ro) in 486f64f283.
Both of these languages are unsupported by libicu and so the collations
are unlikely to have been used in practice.
Bug: T171043
Bug: T171044
Change-Id: Idd0d593e73cd784fbef7b75e8985f988f5555e26
Can't just clear cache on production, as this
now uses per-server apc instance.
Follow-up 486f64f283
Change-Id: I88df6d5a91c86ef687543d1a6988e0ec050bbfce
It's unreasonable to expect newbies to know that "bug 12345" means "Task T14345"
except where it doesn't, so let's just standardise on the real numbers.
Change-Id: I6f59febaf8fc96e80f8cfc11f4356283f461142a
Instances of subclasses of IcuCollation with customizations for
specific languages probably shouldn't share this cache with instances
of IcuCollation with the same language.
Change-Id: I06d66d199c99448a3375381baef0366c4d99c8c4
This is based solely on looking at the bn.txt collation data
file. It has not been tested by native speakers.
Bug: T148885
Change-Id: Ide926bc5ee8752269ef6a1bfe972e19b7188d193
At this point I think it's safe to assume that these mostly work well,
and the split makes maintenance of the alphabetical list more difficult
(some entries were already in wrong order). We've been enabling these
collations for more and more Wikimedia wikis and not hearing about any
problems. Mistakes, if any are present, should be treated like any
other bug.
Also made some comments consistent.
Change-Id: I4b5fbcf4dbbdd4dc194ed821341296171fa64bb0
Based on CLDR 29 data files.
This did the relatively easy languages in CLDR 29 (Which is most
of them). I skipped languages with complicated tailoring files.
Change-Id: I8367604f7d3a1cdef9cb4e15813893c8cbfff1ff
A few more languages marked as "Verified by native speakers",
based on which collations we've been using in production
on Wikimedia wikis.
(I'm not sure if this makes sense now that we're fairly confident
that these are good in general, but since it's already here...)
Change-Id: I8e1f31fa61509eca8c76a2df4e18638005e68b77
To use, add '-u-kn' to the end of a collation name and set it as
the value for $wgCategoryCollation.
Bug: T8948
Change-Id: Ica7908daf80624fa2648127114d01665e96234c0
Small optimization to IcuCollation::fetchFirstLetterData().
This used to suppress / restore warnings once per every letter of
every alphabet. The workaround for string casting and error
suppression is no longer needed as of PHP 5.3, in which the
bug was fixed.
Change-Id: Idd41a509858c0887df4f632b480b387bd74027b2
* Factor out fetchFirstLetterData() as a separate method.
* Move 'version' into the key instead of checking afterwards.
* Use getWithSetCallback() for the cache handling.
(Depends on version being in the key).
Change-Id: I15bddf5d1dabcdcef47a938447ba59436bd8a294
I noticed that `frwiki:first-letters:fr🇫🇷4.8.1.1` was at the very top of keys
sorted by bandwidth (that is, reqs/sec * size) on one of the memcache servers
on WMF prod.
The data takes ~60 - 80ms to compute, in case of a cache miss. That's not
enough to justify using a tiered cache abstraction here, IMO.
Change-Id: If81ce8f86f2c378565f1f6a0dd2c04dee825c4e9