It was just a wrapper for ArrayUtils::findLowerBound(), it's not used
in core and it shouldn't be used anywhere outside of core (but I
haven't checked).
Change-Id: I53b0aca6bb642bdf6c972098170579fa13746554
Variants included 'in <version>', 'as of <version>' and just the
version number.
Some @deprecated annotations do not have the version number at all,
I want to hunt them down separately.
Change-Id: I8208c6097098f4735d4f51bc42254675f1f27f6d
Swapped some "$var type" to "type $var" or added missing types
before the $var. Changed some other types to match the more common
spelling. Makes beginning of some text in captial.
Change-Id: I64e8cfe478cb0ba438f40b0631d6e9049cdab567
Also removed true as second parameter to it from CloneDatabase.php
since it is the default value of that parameter.
Change-Id: I727ebae2bd4df0e26019985ce8c7ce73381c5642
ICU does not currently support Sorani Kurdish / Central Kurdish
language ('ckb'). CollationCkb uses the same collation rules as
Persian / Farsi ('fa'), but different characters for digits.
For use at ckb.wikipedia, which currently has 'uca-fa' collation
deployed as a workaround.
Added the MW language used for transforming digits to cache key for
first-letters data, in addition to the ICU locale.
Bug: 55630
Change-Id: I7d7f007592ede952859c5c9556b9ea5084b90e89
Previously both '1' and '۱' ("DIGIT ONE" and "EXTENDED ARABIC-INDIC
DIGIT ONE") were sorted under '1' heading, regardless of collation
locale.
Now they will be both sorted under localised heading name (transformed
using Language#formatNum), for example '1' for 'uca-en' collation or
'۱' for 'uca-fa' collation.
Bug: 55630
Change-Id: I0b745d955a6e72f53873c95648087aa5f90a8852
The Line continuation Coding conventions prefers the closing parenthesis
on the same line than the beginning curly braces. This is done for ifs
and functions.
Also move some boolean operator from the end of a line to the beginning
and changed some indentation to make the condition hopefully better
readable.
Change-Id: Id0437b06bde86eb5a75bc59eefa19e7edb624426
The "standard" collation for Swedish sorts V and W
only as secondary differences. Compare this to
the "reformed" collation which sorts them
as separate letters. Which collation is default
for sv seems to vary on icu version, but for icu 4.8
(which wmf uses) reformed is default. svwikisource
wants to use the "standard" collation.
Change-Id: I051590cf687ddea2e2cd84203d6e8eed3a6efd99
And added/removed spaces around some other tokens,
like +, -, *, /, <, >, =, !
Fixed windows newline style
Change-Id: I0b9c8c408f3f6bfc0d685a074d7ec468fb848fc8
First letters are supposed to be primary collation elements.
However, we do not want expansions to be considered
as firstletters (aka thorn "þ" -> "th" which isn't
the same as any other first letter (since "t" !== "th" )
however if þ was a first letter, the word "the" and
even worse the word "too" would be sorted under it, which
is wrong.
Looking for feedback if this all sounds sane. I have tested
it, it got rid of the contractions while at the same time
not removing any letter it wasn't supposed to.
Once this is merged, we could get rid of all the
-<langcode> entries. The other firstLetter array
entries for tailorings could be merged into
generateCollationData.php too, since incorrect
things would get pruned automatically, which
would probably make the logic in Collation.php
simpler.
Bug: 43740
Change-Id: I4bd3d39ec2938a53e2c6728adc48ee6cf9778d74
* Removed spaces around array index
* Removed double spaces or added spaces to begin or end of function
calls, method signature, conditions or foreachs
* Added braces to one-line ifs
* Changed multi line conditions to one line conditions
* Realigned some arrays
Change-Id: Ia04d2a99d663b07101013c2d53b3b2e872fd9cc3
Four non-ASCII letters - Ǥ, Ŋ, Ŧ, Ʒ - are sorted the same as their
unaccented base ASCII versions - G, N, T, Z - causing unexpected
output on category pages.
Bug: 46330
Change-Id: I976dedfdc651fcc96a2291934924aa40b27f4c2f
Just a class constant for now, but that should suffice to deal with the
current emergency. Proper dependency tracking via the CacheDependency
hierarchy would be pretty cool in the long term.
Change-Id: Ibbe7fa2814434d4869aba20f628bd43269e611fa
This is necessary for Swedish, where 'Þ' ("thorn") - considered a
separate letter by default in the first-letters-root.ser file - is
sorted as 'th', causing unexpected output on category pages - words
starting with 'th'..'u' were placed under a heading with the thorn.
There were three obvious ways to do this:
* somehow include information that this letter is to be removed in the
string itself, as in 'sv' => array( "Å", "Ä", "Ö", "-Þ" ) - could
potentially clash with valid uses
* create a separate array other than $tailoringFirstLetters to store
this information - would cause the data to be fragmented all over
the file
* include information about letters to be removed in a separate key
"linked" to the regular one, as in '-sv' => array( "Þ" ) - I see no
obvious downsides, so this is what I ended up doing
Bug: 45446
Change-Id: I57e07a2027c391c5baa767a68f4409b9de7b4618
Doxygen expects parameter types to come before the
parameter name in @param tags. Used a quick regex
to switch everything around where possible. This
only fixes cases where a primitve variable (or a
primitive followed by other types) is the variable
type. Other cases will need to be fixed manually.
Change-Id: Ic59fd20856eb0489d70f3469a56ebce0efb3db13
* en: obviously
* it: per Nemo_bis in comments on change I97273c52
* hu: per Tisza Gergő in comments on bug 45596
* pt: 'uca-default' collation is deployed on pt.wiki, 'uca-pt' is the same thing
* uk: per Dmytro Dziuma in comments on bug 45444
Change-Id: Ia7568a9ad40ef991b73059b5269e6236f52681f1
This is *the* valid way for Hungarian (per bug 45596 comment 10), and
it's likely more appropriate for other languages as well.
I should have done it this way in the first place; the original data
source includes these forms along with the all-uppercase ones (I
checked them all), so they're certainly at least not wrong. Just an
overlooking on my part.
Change-Id: Ie0ca297a082ddba8d757beb85655f86b3ee70b02
Added/removed spaces around logical/arithmetic operator
Reduced multiple empty lines to one empty line
Removed wrong tabs before comments at end of line
Removed too many spaces in assigments
Change-Id: I2bba4e72f9b5f88c53324d7b70e6042f1aad8f6b
More information about what actually sits in that array.
Summary of modifications to the Mimer data so far:
* removed data for "traditional" variants of de (German) and es (Spanish)
* used code 'tl' instead of 'fil' for Tagalog/Filipino
* added be-tarask (Belarusian Taraškievica)
Change-Id: I97273c52599a5eda3f63366d697b077d6b17ba81
This allows one to *finally* get articles to be correctly sorted on
category pages for 67 languages based in latin, greek and cyrillic
alphabets.
Fixes bug 29788, bug 41040, and bug 42412 (implementing collations for
Swedish, Polish, Ukrainian).
Full list of language codes this adds support for: af, ast, az, be,
bg, br, bs, ca, co, cs, cy, da, de, dsb, el, en, eo, es, et, eu, fi,
fo, fr, fur, fy, ga, gd, gl, hr, hsb, hu, is, it, kk, kl, ku, ky, la,
lb, lt, lv, mk, mo, mt, nl, no, oc, pl, pt, rm, ro, ru, rup, sco, sk,
sl, smn, sq, sr, sv, tk, tl, tr, tt, uk, uz, vi.
* Include data about first-letter characters for 67 language
tailorings. This data was generated from based on
http://developer.mimer.com/charts/tailorings.htm by a Ruby script
(https://www.mediawiki.org/wiki/User:Matma_Rex/generateCollationTailoringData.rb),
then adjusted by hand (removed duplicate definitions for Spanish and
German, changed code fil -> tl (Filipino -> Tagalog).
* Mark languages verified by native speakers (currently only pl
(Polish) I verified by myself and fi (Finnish) checked by Niklas).
* Allow for collations named like 'uca-<langcode>', mapping them to
IcuCollation with appropriate parameter. The code doesn't check if
we actually have data for given language, as it's checked after the
IcuCollation class instance is constructed.
* Add the tailoring data to the default first-letter file (for root
collation) before it's cached for given locale.
Change-Id: I838484b9aaf23945fe7880fef2e3da5f5c06877f
It will be necessary to be able to use correct version of Unicode
data files.
The constant INTL_ICU_VERSION this getter returns isn't really
documented. It is available since PHP 5.3.7 (see PHP bug 54561),
the getter will fail gracefully on older PHPs. It should be possible to
determine the ICU version on these by grepping the output of phpinfo(),
but I don't think such a minor improvement is worth such a huge hack.
Change-Id: I85353559439bfddee7c5ba90894d30dd8ef0e0e8
See I4542f57a. Meant as a temporary meassure until such a time
generic tailoring code is implemented for uca. This patch
is mostly Lejonel's code, with the class renamed.
Change-Id: Id39406c37a5277d9e7a9216544de2140411c2b01
By PSR2 PHP Standard, the files should ends with exactly one newline.
Some of our files have 2 or more and some other were missing a newline.
Fix almost all occurences of CodeSniffer sniff:
PSR2.Files.EndFileNewline.TooMany
I have not fixed the selenium files, I believe we will drop them.
Change-Id: I89fca8c1786fee94855b7b77bb0f364001ee84b6
It will be necessary to be able to use correct version of Unicode
data files.
The constant INTL_ICU_VERSION this getter returns isn't really
documented. It is available since PHP 5.3.7 (see PHP bug 54561),
the getter will fail gracefully on older PHPs. It should be possible to
determine the ICU version on these by grepping the output of phpinfo(),
but I don't think such a minor improvement is worth such a huge hack.
Change-Id: Iee4b8380406ae71c980dfdd7b9fdd0b58ecb9cd0
Fix the following edge cases which were previously broken:
* Zero-length input array
* Target value before the start
* Target value past the end
They didn't really matter for my original application, but Liangent
wants to use this function for something else.
Change-Id: Ia5f5ed4ab3cb6c463177a4812fd3ce96c6d37b33
can define new Collation classes.
(I plan to commit such an extension shortly)
Wasn't sure if it would be better to make an array mapping collation names => class names
instead. However, that seemed to be unneededly complicated so I went with
letting that variable take class names.
* Added a maintenance script which generates a list of first letters. Unified Han are omitted for performance, and because they shouldn't be used as headings anyway. A future collation specific to Chinese would provide the KangXi radicals as "first letters".
* Provided a precomputed list of first letters. Used Unicode 6.0.0 data and ICU 4.2.
* Moved collation functionality from Language to a Collation class hierarchy with factory function. Removed the recently-added methods from Language and updated all callers.
* Changed Title::getCategorySortkey() to separate its parts with a line break instead of a null character. All collations supported by the intl extension ignore the null character, i.e. "ab" == "a\0b". It would have required a lot of hacking to make it work.
* Fixed the uppercase collation to handle non-ASCII characters, redundantly with r80436. I don't think it's necessary to change the collation name as was done there, so I reverted that in the course of my conflict merge. A --force option to updateCollation.php might be nice though.