Commit graph

58 commits

Author SHA1 Message Date
Fomafix
0f1858321c Use PHP 7 '??' operator instead of if-then-else
Change-Id: I790b86e2e9e3e41386144637659516a4bfca1cfe
2018-06-12 23:14:18 +02:00
Bartosz Dziewoński
0313128b10 Use PHP 7 "\u{NNNN}" Unicode codepoint escapes in string literals
In cases where we're operating on text data (and not binary data),
use e.g. "\u{00A0}" to refer directly to the Unicode character
'NO-BREAK SPACE' instead of "\xc2\xa0" to specify the bytes C2h A0h
(which correspond to the UTF-8 encoding of that character). This
makes it easier to look up those mysterious sequences, as not all
are as recognizable as the no-break space.

This is not enforced by PHP, but I think we should write those in
uppercase and zero-padded to at least four characters, like the
Unicode standard does.

Note that not all "\xNN" escapes can be automatically replaced:
* We can't use Unicode escapes for binary data that is not UTF-8
  (e.g. in code converting from legacy encodings or testing the
  handling of invalid UTF-8 byte sequences).
* '\xNN' escapes in regular expressions in single-quoted strings
  are actually handled by PCRE and have to be dealt with carefully
  (those regexps should probably be changed to use the /u modifier).
* "\xNN" referring to ASCII characters ("\x7F" and lower) should
  probably be left as-is.

The replacements in this commit were done semi-manually by piping
the existing "\xNN" escapes through the following terrible Ruby
script I devised:

  chars = eval('"' + ARGV[0] + '"').force_encoding('utf-8')
  puts chars.split('').map{|char|
    '\\u{' + char.ord.to_s(16).upcase.rjust(4, '0') + '}'
  }.join('')

Change-Id: Idc3dee3a7fb5ebfaef395754d8859b18f1f8769a
2018-06-04 16:20:13 +00:00
Bartosz Dziewoński
b191e5e860 Use PHP 7 '<=>' operator in 'sort()' callbacks
`$a <=> $b` returns `-1` if `$a` is lesser, `1` if `$b` is lesser,
and `0` if they are equal, which are exactly the values 'sort()'
callbacks are supposed to return.

It also enables the neat idiom `$a[x] <=> $b[x] ?: $a[y] <=> $b[y]`
to sort arrays of objects first by 'x', and by 'y' if they are equal.

* Replace a common pattern like `return $a < $b ? -1 : 1` with the
  new operator (and similar patterns with the variables, the numbers
  or the comparison inverted). Some of the uses were previously not
  correctly handling the variables being equal; this is now
  automatically fixed.
* Also replace `return $a - $b`, which is equivalent to `return
  $a <=> $b` if both variables are integers but less intuitive.
* (Do not replace `return strcmp( $a, $b )`. It is also equivalent
  when both variables are strings, but if any of the variables is not,
  'strcmp()' converts it to a string before comparison, which could
  give different results than '<=>', so changing this would require
  careful review and isn't worth it.)
* Also replace `return $a > $b`, which presumably sort of works most
  of the time (returns `1` if `$b` is lesser, and `0` if they are
  equal or `$a` is lesser) but is erroneous.

Change-Id: I19a3d2fc8fcdb208c10330bd7a42c4e05d7f5cf3
2018-05-30 18:05:20 -07:00
James D. Forrester
70c711a6bc Follow-up If8dfdaf1: Hard-deprecate, drop two uses, other pre-5.3 back-compat code
Change-Id: I1c5eee3fe30d6687d88e07011a3d40b6770d0daf
2018-05-24 17:01:02 -07:00
jenkins-bot
60ee1e8110 Merge "Add unicode mapping for ICU 60 and 61" 2018-05-24 21:46:32 +00:00
Reedy
fdb8724e7f Add unicode mapping for ICU 60 and 61
Change-Id: Ifbbc8d7ecc788bc2c6b07a8ebba46a9648545786
2018-05-24 22:28:19 +01:00
James D. Forrester
a6c4d473de IcuCollation: Deprecate getICUVersion(), no need for PHP53 back-compat
Change-Id: If8dfdaf187b32b7b9a2c09a240416b9f481593f1
2018-05-24 21:23:18 +00:00
Amir Sarabadani
5a21de8abb Remove everything related to CollationFa
This workaround was needed when ICU in production was broken
but after T189295 this is not needed anymore and we switched off
this collation from all Persian Wikis already

Bug: T139110
Change-Id: Ifad89555b6ac96a3eb36ca24b55e1f8ee57a1f05
2018-05-18 18:33:25 +02:00
Bartosz Dziewoński
390ff7fca1 IcuCollation: Use codepoint as tiebreaker when getting first-letters
This prevents unexpected cuneiform digits from acting as headings for
2 and 3 on category pages.

Bug: T187645
Change-Id: I0424a24769899cb23b28704f97e1002fa44999fd
2018-05-11 06:36:24 +00:00
jenkins-bot
1a21a63d52 Merge "Add collation for Abkhaz (ab)" 2018-01-23 18:42:29 +00:00
Umherirrender
23ef520a1c Improve some parameter docs
Change-Id: I31e983d7ac287158101b18ad95779d83537302a2
2018-01-07 11:39:08 +01:00
Bartosz Dziewoński
e94587dfbb Add collation for Abkhaz (ab)
* Adding new class AbkhazUppercaseCollation, mapped to 'uppercase-ab'.
* Extended CustomUppercaseCollation with support for sorting digraphs
  and for alphabets larger than 64 letters (up to 4096).

Bug: T183430
Change-Id: I16d44568e44d7ef5b39c38b1a6257b9fe10a34d4
2017-12-25 14:37:14 +00:00
jhsoby
660caf9b88 Add custom collation for Northern Sami
This commit adds a custom collation order for
Northern Sami ('se'). Northern Sami exists in ICU,
but the version of ICU that Wikimedia uses is a
few years old, and does *not* include Northern
Sami. It could be years before Wikimedia's production
servers use the one that includes Northern Sami (see
bug), so this is a temporary workaround to amend this
issue.

Bug: T181503
Change-Id: Ib8a48b8db99bef8ec4b05144aace5dbdcacfeded
2017-12-07 21:32:11 +00:00
Reedy
7b3add76b1 Add Unicode to ICU mappings for versions 58 and 59
Change-Id: I87a5e6ce3a44a2be1e6bf8adf2f98cd0a4745574
2017-10-25 23:42:28 +01:00
Umherirrender
14dfc3dbc5 Fix typo in 'language'
Change-Id: I3c4d090640892ae07d3da33dcfe3ace397a40808
2017-10-07 18:53:04 +02:00
Umherirrender
f739a8f368 Improve some parameter docs
Add missing @return and @param to function docs and fixed some @param

Change-Id: I810727961057cfdcc274428b239af5975c57468d
2017-09-10 20:32:31 +02:00
jenkins-bot
1d7a1bf8bd Merge "Move around "ا" to after "آ" and not before" 2017-09-06 13:12:13 +00:00
Amir Sarabadani
2ceba3b145 Move around "ا" to after "آ" and not before
Bug: T173601
Change-Id: I0f6b3ecc2800180a2c6a8217803411862a299e04
2017-08-31 08:02:08 +00:00
Umherirrender
3f1a52805e Use short type bool/int in param documentation
Enable the phpcs sniffs for this and used phpcbf

Change-Id: Iaa36687154ddd2bf663b9dd519f5c99409d37925
2017-08-20 13:20:59 +02:00
Umherirrender
5544cef16b Add missing type to @param documentation
Change-Id: I6b2c9c7af9a281fe457099cc3a336a60a25e74aa
2017-08-11 20:37:35 +02:00
Umherirrender
ace44e2064 Use correct variable name in @param documentation
For some varargs a variable name is added with suffix ,... as seen for
many other varargs

Some @param are swapped, because there are in the wrong order

Enable Sniff MediaWiki.Commenting.FunctionComment.ParamNameNoMatch

Change-Id: I60fec6025bce824d5c67563ab7b65ad6cd628ad8
2017-08-11 19:27:19 +02:00
Kunal Mehta
d1cf48a397 build: Update mediawiki/mediawiki-codesniffer to 0.10.1
And auto-fix all errors.

The `<exclude-pattern>` stanzas are now included in the default ruleset
and don't need to be repeated.

Change-Id: I928af549dc88ac2c6cb82058f64c7c7f3111598a
2017-07-22 18:24:09 -07:00
jenkins-bot
e72303c9f3 Merge "Remove auto-generated "Constructor" documentation on constructors" 2017-07-21 13:19:44 +00:00
Thiemo Mättig
91a920fd85 Remove auto-generated "Constructor" documentation on constructors
Having such comments is worse than not having them. They add zero
information. But you must read the text to understand there is
nothing you don't already know from the class and the method name.

This is similar to I994d11e. Even more trivial, because this here is
about comments that don't say anything but "constructor".

Change-Id: I474dcdb5997bea3aafd11c0760ee072dfaff124c
2017-07-21 12:19:30 +02:00
Bartosz Dziewoński
98627d4cab IcuCollation: Fix diacritic characters for Aromanian (rup) and Moldovan (mo) headings
They should be Ș, Ț (comma-below) and instead they were cedilla-below (Ş, Ţ).
Same as for Romanian (ro) in 486f64f283.

Both of these languages are unsupported by libicu and so the collations
are unlikely to have been used in practice.

Bug: T171043
Bug: T171044
Change-Id: Idd0d593e73cd784fbef7b75e8985f988f5555e26
2017-07-19 21:49:27 +02:00
Brian Wolff
22cb66c175 Update FIRST_LETTER_VERSION for rowiki changes
Can't just clear cache on production, as this
now uses per-server apc instance.

Follow-up 486f64f283

Change-Id: I88df6d5a91c86ef687543d1a6988e0ec050bbfce
2017-07-19 17:56:38 +00:00
Bartosz Dziewoński
486f64f283 IcuCollation: Fix diacritic characters for Romanian (ro) headings
They should be Ș, Ț (comma-below) and instead they were cedilla-below (Ş, Ţ).

Bug: T168711
Change-Id: I6dc873c3ce93bca3e425439f70d0fb30aecc9533
2017-07-19 16:28:02 +02:00
Bartosz Dziewoński
b3caa05a38 CollationFa: Avoid PHP 7 Unicode escape syntax
We still support PHP 5.5.

Change-Id: I587cb794cded95afe7ad493614a6090a108efe6c
2017-06-22 16:22:49 +02:00
Brian Wolff
0bfcbd7240 Hack around icu breakage for fa sorting
Bug: T139110
Change-Id: I35bcdaf309f595258289f01bbe5713ce6d1ffad1
2017-05-19 22:14:43 +00:00
Brian Wolff
73f5937047 Add collation for Bashkir (ba)
This is based on a numeric uppercase collation. Bashkir characters
will be remapped to the private use area for the purpose of sorting.

Bug: T162823
Change-Id: I65f1af0b57ff6ded7d464e39efd401f178a3519e
2017-05-10 04:17:46 +00:00
Timo Tijhof
3a2a707546 Clean up remaining get_class() uses
* get_class()        -> __CLASS__ (same as self::class)
* get_called_class() -> static::class
* get_class($this)   -> static::class

Change-Id: I1888a1897ecf4548a2e5a67a942e5c080dd7e3d3
2017-03-07 22:03:47 +00:00
jenkins-bot
17eda64357 Merge "includes: Replace implicit Bugzilla bug numbers with Phab ones" 2017-02-28 00:51:57 +00:00
Bartosz Dziewoński
267efadac7 Collation: Allow uppercase letters in UCA collations' names
We have several such collations defined in IcuCollation:

* bs-Cyrl
* de-AT@collation=phonebook
* fr-CA
* sr-Latn

They couldn't actually be used.

Change-Id: I3a62073583c49d3e90910aa8240fe9fcc0682386
2017-02-22 21:17:54 +01:00
James D. Forrester
9635dda73a includes: Replace implicit Bugzilla bug numbers with Phab ones
It's unreasonable to expect newbies to know that "bug 12345" means "Task T14345"
except where it doesn't, so let's just standardise on the real numbers.

Change-Id: I6f59febaf8fc96e80f8cfc11f4356283f461142a
2017-02-21 18:13:24 +00:00
Bartosz Dziewoński
afc6e7cd15 CollationFa: Third time's the charm
We have to use a tertiary sortkey for everything with the primary
sortkey of 2627. Otherwise, the "Remove duplicate prefixes" logic
in IcuCollation would remove them.

The following characters will now be considered separate letters in
the 'xx-uca-fa' collation for the purpose of displaying the headings
on category pages: ء ئ ا و ٲ ٳ

Bug: T139110
Change-Id: Ibbea5d76348e4cdc38b74cba44286910b2ed592f
2017-01-05 15:54:00 +01:00
Bartosz Dziewoński
611801a38d IcuCollation: Add the current class name to 'first-letters' cache key
Instances of subclasses of IcuCollation with customizations for
specific languages probably shouldn't share this cache with instances
of IcuCollation with the same language.

Change-Id: I06d66d199c99448a3375381baef0366c4d99c8c4
2016-12-15 15:17:56 +01:00
jenkins-bot
ce079cf6ad Merge "Add CollationFa" 2016-12-15 13:37:56 +00:00
Amir Sarabadani
708c02281e Add CollationFa
Bug: T139110
Change-Id: Ie15a2ee1c22ff4a1d2b721ed137227fe83dd12ea
2016-12-15 13:25:56 +00:00
jenkins-bot
ea42d90053 Merge "Make NumericUppercaseCollation use localized digit transforms" 2016-11-16 02:46:31 +00:00
Brian Wolff
779aa4ce5a Add first letter data for bn collation (Standard and Traditional)
This is based solely on looking at the bn.txt collation data
file. It has not been tested by native speakers.

Bug: T148885
Change-Id: Ide926bc5ee8752269ef6a1bfe972e19b7188d193
2016-11-15 16:09:45 -08:00
Bartosz Dziewoński
37b1fc9456 IcuCollation: Do not split $tailoringFirstLetters into verified/not verified
At this point I think it's safe to assume that these mostly work well,
and the split makes maintenance of the alphabetical list more difficult
(some entries were already in wrong order). We've been enabling these
collations for more and more Wikimedia wikis and not hearing about any
problems. Mistakes, if any are present, should be treated like any
other bug.

Also made some comments consistent.

Change-Id: I4b5fbcf4dbbdd4dc194ed821341296171fa64bb0
2016-10-31 16:48:13 +01:00
Brian Wolff
95c299e67f Add firstLetter data for ~50 additional languages
Based on CLDR 29 data files.

This did the relatively easy languages in CLDR 29 (Which is most
of them). I skipped languages with complicated tailoring files.

Change-Id: I8367604f7d3a1cdef9cb4e15813893c8cbfff1ff
2016-10-29 12:10:52 +00:00
Brian Wolff
e7464f3481 Make NumericUppercaseCollation use localized digit transforms
This will cause the numeric collation to sort localized digits
for the current content language the same as how 0-9 are.

This only deals with the localized digit numbers, commas
and other number formatting are still not handled. Weird
"numerical" unicode characters are also not handled.

I was unsure if to make a "family" of numeric collations
where you specify numeric-<lang code>, or if it should
just use $wgContLang. Given that $wgContLang effectively
never changes, and also affects all other digit handling,
I opted to just use $wgContLang.

Any wikis currently using the 'numeric' collation will
have to have updateCollation.php --force run after this
change is deployed. At the moment that includes:
bnwiki, bnwikisource and hewiki

Bug: T148873
Change-Id: I9eda52a8a9752a91134d1118546b0a80d3980ccf
2016-10-29 08:38:39 +00:00
Kaldari
3c8490b1e3 Fixing numeric sorting for numbers with leading zeros
Bug: T148774
Change-Id: I34aa330645d9d82b6c4e57542e891dd2b36e42ad
2016-10-20 11:58:38 -07:00
Bartosz Dziewoński
cf13e01f38 IcuCollation: Update comments on $tailoringFirstLetters
A few more languages marked as "Verified by native speakers",
based on which collations we've been using in production
on Wikimedia wikis.

(I'm not sure if this makes sense now that we're fairly confident
that these are good in general, but since it's already here...)

Change-Id: I8e1f31fa61509eca8c76a2df4e18638005e68b77
2016-09-22 21:02:15 +00:00
Bartosz Dziewoński
3b84eb02c2 Implement NumericUppercaseCollation
This collation orders text with numbers "naturally", so that
'Foo 1' < 'Foo 2' < 'Foo 12'.

Note that this only works in terms of sequences of digits, and the
behavior for decimal fractions or pretty-formatted numbers may be
unexpected.

This is only expected to work mostly correctly for English-language
text. Consider it a proof of concept. You probably want to use
an UCA collation with '-u-kn' suffix rather than this.

Bug: T8948
Change-Id: Ie268f2d92c5c75d0aaecf54ede2bdda1af3b309d
2016-08-23 18:41:01 +00:00
Kaldari
deaf4ff495 Updating $tailoringFirstLetters for Macedonian
Per https://ssl.icu-project.org/trac/browser/icu/trunk/source/data/coll/mk.txt

Bug: T26953
Change-Id: I45938402923a109cfc80f59555af5cede584fc3b
2016-08-08 13:41:28 -07:00
Kaldari
52c1b00dc0 Adding support for numeric collation when using UCA collations
To use, add '-u-kn' to the end of a collation name and set it as
the value for $wgCategoryCollation.

Bug: T8948
Change-Id: Ica7908daf80624fa2648127114d01665e96234c0
2016-07-26 17:29:41 -07:00
jenkins-bot
68978670d2 Merge "Add Unicode to ICU mappings for versions 51-57" 2016-07-21 05:23:22 +00:00
Reedy
997c071301 Add Unicode to ICU mappings for versions 51-57
Change-Id: I35c2cdd2c56b491229f1f6d8b69b1de21af23aab
2016-07-20 20:47:50 +01:00