Commit graph

37 commits

Author SHA1 Message Date
jenkins-bot
dca2c238b8 Merge "Allow uca-sv@collation=standard to be a collation name." 2013-06-27 18:48:21 +00:00
Brian Wolff
ac88b636b8 Allow uca-sv@collation=standard to be a collation name.
The "standard" collation for Swedish sorts V and W
only as secondary differences. Compare this to
the "reformed" collation which sorts them
as separate letters. Which collation is default
for sv seems to vary on icu version, but for icu 4.8
(which wmf uses) reformed is default. svwikisource
wants to use the "standard" collation.

Change-Id: I051590cf687ddea2e2cd84203d6e8eed3a6efd99
2013-06-27 20:37:14 +02:00
Brian Wolff
a075f0de28 Add fa to collation list.
Based on http://collation-charts.org/icu442/icu442-fa.html
Should be verified by a native speaker.

Bug: 30287
Change-Id: I3c30824f7d133cf615ec7c2c39d31f27c39f89fe
2013-05-16 22:59:46 -03:00
umherirrender
da39005596 Removed space after isset
While at it, added/removed some other spaces in the same files

Change-Id: Iabb23a448f6f53eb6020155f9c744f74f8b11786
2013-04-26 14:18:06 +02:00
umherirrender
ef2f507d23 Fixed spacing in files direct in includes folder
Added spaces before if, foreach
Added some braces for one line statements

Change-Id: Ibb8dd102db045522d12ff939075ba7420d95ab6b
2013-04-21 06:38:49 +00:00
umherirrender
15abcf71ca Added/Removed spaces around string concatenation
And added/removed spaces around some other tokens,
like +, -, *, /, <, >, =, !

Fixed windows newline style

Change-Id: I0b9c8c408f3f6bfc0d685a074d7ec468fb848fc8
2013-04-13 13:36:24 +02:00
Brian Wolff
3d70637a42 Remove first letters that have an overlapping prefix.
First letters are supposed to be primary collation elements.
However, we do not want expansions to be considered
as firstletters (aka thorn "þ" -> "th" which isn't
the same as any other first letter (since "t" !== "th" )
however if þ was a first letter, the word "the" and
even worse the word "too" would be sorted under it, which
is wrong.

Looking for feedback if this all sounds sane. I have tested
it, it got rid of the contractions while at the same time
not removing any letter it wasn't supposed to.

Once this is merged, we could get rid of all the
-<langcode> entries. The other firstLetter array
entries for tailorings could be merged into
generateCollationData.php too, since incorrect
things would get pruned automatically, which
would probably make the logic in Collation.php
simpler.

Bug: 43740
Change-Id: I4bd3d39ec2938a53e2c6728adc48ee6cf9778d74
2013-04-08 22:52:40 +00:00
umherirrender
6c278b6d7e fix some spacing
* Removed spaces around array index
* Removed double spaces or added spaces to begin or end of function
  calls, method signature, conditions or foreachs
* Added braces to one-line ifs
* Changed multi line conditions to one line conditions
* Realigned some arrays

Change-Id: Ia04d2a99d663b07101013c2d53b3b2e872fd9cc3
2013-03-25 22:22:46 +00:00
Brian Wolff
6662199c53 Merge "IcuCollation::$tailoringFirstLetters: letter removal rules for Finnish" 2013-03-23 20:53:00 +00:00
MatmaRex
3d7966d28c IcuCollation::$tailoringFirstLetters: letter removal rules for Finnish
Four non-ASCII letters - Ǥ, Ŋ, Ŧ, Ʒ - are sorted the same as their
unaccented base ASCII versions - G, N, T, Z - causing unexpected
output on category pages.

Bug: 46330
Change-Id: I976dedfdc651fcc96a2291934924aa40b27f4c2f
2013-03-21 00:12:00 +01:00
MatmaRex
9c6655adb2 IcuCollation::$tailoringFirstLetters: 'sv', 'vi' verified
* sv: per Lejonel in comments on bug 45446
* vi: per Minh Nguyễn in comments on bug 45979

Change-Id: I96bbcd73e75f9fc85a5c0b402eae87e5cda2259e
2013-03-18 13:24:25 +01:00
Tim Starling
029dcc9953 Allow first letter data to be invalidated
Just a class constant for now, but that should suffice to deal with the
current emergency. Proper dependency tracking via the CacheDependency
hierarchy would be pretty cool in the long term.

Change-Id: Ibbe7fa2814434d4869aba20f628bd43269e611fa
2013-03-13 14:53:20 +11:00
MatmaRex
ae38b340dc IcuCollation::$tailoringFirstLetters: implement letter removal
This is necessary for Swedish, where 'Þ' ("thorn") - considered a
separate letter by default in the first-letters-root.ser file - is
sorted as 'th', causing unexpected output on category pages - words
starting with 'th'..'u' were placed under a heading with the thorn.

There were three obvious ways to do this:
* somehow include information that this letter is to be removed in the
  string itself, as in 'sv' => array( "Å", "Ä", "Ö", "-Þ" ) - could
  potentially clash with valid uses
* create a separate array other than $tailoringFirstLetters to store
  this information - would cause the data to be fragmented all over
  the file
* include information about letters to be removed in a separate key
  "linked" to the regular one, as in '-sv' => array( "Þ" ) - I see no
  obvious downsides, so this is what I ended up doing

Bug: 45446
Change-Id: I57e07a2027c391c5baa767a68f4409b9de7b4618
2013-03-11 22:24:30 +01:00
Tyler Anthony Romeo
4dcc7961df Fixed @param tags to conform with Doxygen format.
Doxygen expects parameter types to come before the
parameter name in @param tags. Used a quick regex
to switch everything around where possible. This
only fixes cases where a primitve variable (or a
primitive followed by other types) is the variable
type. Other cases will need to be fixed manually.

Change-Id: Ic59fd20856eb0489d70f3469a56ebce0efb3db13
2013-03-11 13:15:01 -04:00
MatmaRex
453ed1818e IcuCollation::$tailoringFirstLetters: 'en', 'it', 'hu', 'pt', 'uk' verified
* en: obviously
* it: per Nemo_bis in comments on change I97273c52
* hu: per Tisza Gergő in comments on bug 45596
* pt: 'uca-default' collation is deployed on pt.wiki, 'uca-pt' is the same thing
* uk: per Dmytro Dziuma in comments on bug 45444

Change-Id: Ia7568a9ad40ef991b73059b5269e6236f52681f1
2013-03-11 05:23:18 +00:00
MatmaRex
c95cf323ff lowercase second character in digraph letters in IcuCollation tailorings
This is *the* valid way for Hungarian (per bug 45596 comment 10), and
it's likely more appropriate for other languages as well.

I should have done it this way in the first place; the original data
source includes these forms along with the all-uppercase ones (I
checked them all), so they're certainly at least not wrong. Just an
overlooking on my part.

Change-Id: Ie0ca297a082ddba8d757beb85655f86b3ee70b02
2013-03-11 05:18:29 +00:00
umherirrender
d63121016d fix some spacing
Added/removed spaces around logical/arithmetic operator
Reduced multiple empty lines to one empty line
Removed wrong tabs before comments at end of line
Removed too many spaces in assigments

Change-Id: I2bba4e72f9b5f88c53324d7b70e6042f1aad8f6b
2013-03-07 17:53:21 +01:00
MatmaRex
d01cbb4148 adjusted comments for IcuCollation::$tailoringFirstLetters
More information about what actually sits in that array.

Summary of modifications to the Mimer data so far:
* removed data for "traditional" variants of de (German) and es (Spanish)
* used code 'tl' instead of 'fil' for Tagalog/Filipino
* added be-tarask (Belarusian Taraškievica)

Change-Id: I97273c52599a5eda3f63366d697b077d6b17ba81
2013-03-05 13:45:15 +01:00
Pavel Selitskas
afec7906ad language-specific collations: be-tarask added; be, be-tarask, ru verified
Change-Id: I560d766f9b9e9a4ff79e35aa4eec79be875c84c7
2013-02-27 23:55:48 +00:00
MatmaRex
0c28ca1422 Revert "(bug 29788) Swedish Collation (uppercase-sv). Swaps Ä and Æ"
This workaround is unnecessary now that I838484b9 was merged.

This reverts commit 13dc8ff88f.

Change-Id: I2cd22ad87eb7a56c5742b20c6089a4b8607e5614
2013-02-26 22:18:36 +00:00
MatmaRex
9143494912 (bug 43799) create language-specific collations for category sorting
This allows one to *finally* get articles to be correctly sorted on
category pages for 67 languages based in latin, greek and cyrillic
alphabets.

Fixes bug 29788, bug 41040, and bug 42412 (implementing collations for
Swedish, Polish, Ukrainian).

Full list of language codes this adds support for: af, ast, az, be,
bg, br, bs, ca, co, cs, cy, da, de, dsb, el, en, eo, es, et, eu, fi,
fo, fr, fur, fy, ga, gd, gl, hr, hsb, hu, is, it, kk, kl, ku, ky, la,
lb, lt, lv, mk, mo, mt, nl, no, oc, pl, pt, rm, ro, ru, rup, sco, sk,
sl, smn, sq, sr, sv, tk, tl, tr, tt, uk, uz, vi.

* Include data about first-letter characters for 67 language
  tailorings. This data was generated from based on
  http://developer.mimer.com/charts/tailorings.htm by a Ruby script
  (https://www.mediawiki.org/wiki/User:Matma_Rex/generateCollationTailoringData.rb),
  then adjusted by hand (removed duplicate definitions for Spanish and
  German, changed code fil -> tl (Filipino -> Tagalog).

* Mark languages verified by native speakers (currently only pl
  (Polish) I verified by myself and fi (Finnish) checked by Niklas).

* Allow for collations named like 'uca-<langcode>', mapping them to
  IcuCollation with appropriate parameter. The code doesn't check if
  we actually have data for given language, as it's checked after the
  IcuCollation class instance is constructed.

* Add the tailoring data to the default first-letter file (for root
  collation) before it's cached for given locale.

Change-Id: I838484b9aaf23945fe7880fef2e3da5f5c06877f
2013-02-26 20:58:55 +01:00
MatmaRex
e8c0c2ad46 (bug 43801) add a getter for ICU version to ICUCollation
It will be necessary to be able to use correct version of Unicode
data files.

The constant INTL_ICU_VERSION this getter returns isn't really
documented. It is available since PHP 5.3.7 (see PHP bug 54561),
the getter will fail gracefully on older PHPs. It should be possible to
determine the ICU version on these by grepping the output of phpinfo(),
but I don't think such a minor improvement is worth such a huge hack.

Change-Id: I85353559439bfddee7c5ba90894d30dd8ef0e0e8
2013-02-08 16:57:08 -04:00
jenkins-bot
f8daed077a Merge "(bug 43801) add a getter for ICU version to ICUCollation" 2013-02-06 19:35:36 +00:00
Brian Wolff
13dc8ff88f (bug 29788) Swedish Collation (uppercase-sv). Swaps Ä and Æ
See I4542f57a. Meant as a temporary meassure until such a time
generic tailoring code is implemented for uca. This patch
is mostly Lejonel's code, with the class renamed.

Change-Id: Id39406c37a5277d9e7a9216544de2140411c2b01
2013-02-05 22:21:50 +00:00
Antoine Musso
f6b92231fd style: normalize end of files
By PSR2 PHP Standard, the files should ends with exactly one newline.
Some of our files have 2 or more and some other were missing a newline.

Fix almost all occurences of CodeSniffer sniff:
PSR2.Files.EndFileNewline.TooMany

I have not fixed the selenium files, I believe we will drop them.

Change-Id: I89fca8c1786fee94855b7b77bb0f364001ee84b6
2013-02-03 15:04:39 +01:00
MatmaRex
1bcba60f80 (bug 43801) add a getter for ICU version to ICUCollation
It will be necessary to be able to use correct version of Unicode
data files.

The constant INTL_ICU_VERSION this getter returns isn't really
documented. It is available since PHP 5.3.7 (see PHP bug 54561),
the getter will fail gracefully on older PHPs. It should be possible to
determine the ICU version on these by grepping the output of phpinfo(),
but I don't think such a minor improvement is worth such a huge hack.

Change-Id: Iee4b8380406ae71c980dfdd7b9fdd0b58ecb9cd0
2013-01-30 19:46:25 +01:00
Tim Starling
1eca50c383 Fix various boundary cases in IcuCollation::findLowerBound()
Fix the following edge cases which were previously broken:

* Zero-length input array
* Target value before the start
* Target value past the end

They didn't really matter for my original application, but Liangent
wants to use this function for something else.

Change-Id: Ia5f5ed4ab3cb6c463177a4812fd3ce96c6d37b33
2012-10-17 14:30:49 +11:00
umherirrender
85d8ee1f87 Remove a bunch of trailing spaces and unneeded newlines
Change-Id: I00f369641320acd7f087427ef031f3ee7efa0997
2012-10-10 20:14:40 +02:00
Alexandre Emsenhuber
1082c71e9b Added missing GPLv2 headers in some places.
Also made file/class documentation more consistent.

Change-Id: Ibe7815124d6915792dcbb150d01df21d9b22b0b0
2012-05-21 21:56:39 +02:00
Sam Reed
7b25f8231f Fixing some of the "@return true" or "@return false", need to be "@return bool" and then the metadata can say true if foo, false if bar
Other documentation improvements
2012-02-09 19:30:01 +00:00
Brian Wolff
a658eee7fc (bug 30722) Add an identity collation that sorts things based on what the unicode code point is (aka pre-1.17 behaviour).
I'm tagging this 1.18 because the original bug was for iswiktionary wanting it, so it'd be nice to get it in 1.18.
2011-09-11 01:13:08 +00:00
Brian Wolff
f980458a9b (Follow-up r90759 per CR) Use a hook to register new Collations instead of just taking the collation name as a class name 2011-07-05 05:30:04 +00:00
Brian Wolff
99ee7b7cf6 Let $wgCategoryCollation take a class name as a value so that extensions
can define new Collation classes.

(I plan to commit such an extension shortly)

Wasn't sure if it would be better to make an array mapping collation names => class names
instead. However, that seemed to be unneededly complicated so I went with
letting that variable take class names.
2011-06-25 07:21:29 +00:00
Chad Horohoe
783d4e0862 Remove @static from all over the place. That's what the static keyword is for, this being PHP5 and all 2011-04-21 00:07:09 +00:00
Sam Reed
ca7ea0b1ad More function documentation 2011-04-15 17:44:19 +00:00
Platonides
82eab17c16 Update comments to take into account r80443 and r80614 changes, per CR. 2011-01-28 22:27:52 +00:00
Tim Starling
eaeea84b44 * Introduced a non-dummy collation for $wgCategoryCollation, namely UCA with default tables.
* Added a maintenance script which generates a list of first letters. Unified Han are omitted for performance, and because they shouldn't be used as headings anyway. A future collation specific to Chinese would provide the KangXi radicals as "first letters".
* Provided a precomputed list of first letters. Used Unicode 6.0.0 data and ICU 4.2. 
* Moved collation functionality from Language to a Collation class hierarchy with factory function. Removed the recently-added methods from Language and updated all callers.
* Changed Title::getCategorySortkey() to separate its parts with a line break instead of a null character. All collations supported by the intl extension ignore the null character, i.e. "ab" == "a\0b". It would have required a lot of hacking to make it work.
* Fixed the uppercase collation to handle non-ASCII characters, redundantly with r80436. I don't think it's necessary to change the collation name as was done there, so I reverted that in the course of my conflict merge. A --force option to updateCollation.php might be nice though.
2011-01-17 14:02:22 +00:00