Commit graph

205 commits

Author SHA1 Message Date
Kunal Mehta
4acb7ed51c Add @coversNothing to tests that don't cover specific PHP classes
Change-Id: Idbd364561bc28547e9fac20d7a80b9a44edf14a9
2018-06-12 13:27:40 -07:00
jenkins-bot
e602b197ab Merge "(y)etsin fixes, test refactoring, and misc fixes" 2018-06-08 20:46:12 +00:00
Bartosz Dziewoński
0313128b10 Use PHP 7 "\u{NNNN}" Unicode codepoint escapes in string literals
In cases where we're operating on text data (and not binary data),
use e.g. "\u{00A0}" to refer directly to the Unicode character
'NO-BREAK SPACE' instead of "\xc2\xa0" to specify the bytes C2h A0h
(which correspond to the UTF-8 encoding of that character). This
makes it easier to look up those mysterious sequences, as not all
are as recognizable as the no-break space.

This is not enforced by PHP, but I think we should write those in
uppercase and zero-padded to at least four characters, like the
Unicode standard does.

Note that not all "\xNN" escapes can be automatically replaced:
* We can't use Unicode escapes for binary data that is not UTF-8
  (e.g. in code converting from legacy encodings or testing the
  handling of invalid UTF-8 byte sequences).
* '\xNN' escapes in regular expressions in single-quoted strings
  are actually handled by PCRE and have to be dealt with carefully
  (those regexps should probably be changed to use the /u modifier).
* "\xNN" referring to ASCII characters ("\x7F" and lower) should
  probably be left as-is.

The replacements in this commit were done semi-manually by piping
the existing "\xNN" escapes through the following terrible Ruby
script I devised:

  chars = eval('"' + ARGV[0] + '"').force_encoding('utf-8')
  puts chars.split('').map{|char|
    '\\u{' + char.ord.to_s(16).upcase.rjust(4, '0') + '}'
  }.join('')

Change-Id: Idc3dee3a7fb5ebfaef395754d8859b18f1f8769a
2018-06-04 16:20:13 +00:00
Bartosz Dziewoński
4fd27f006f Use PHP 5.6 '**' operator instead of 'pow()' function
Change-Id: Ieb22e1dbfcffaa4e7b3dcfabbcc999e5dd59a4bf
2018-05-30 18:05:19 -07:00
tjones
669d1ed192 (y)etsin fixes, test refactoring, and misc fixes
* Fix etsin/етсин/этсин as noted in If933fc67845ac994d9ddfdf8349aff445ec9b13a
** only convert tsin to тсин and let the other rules sort out the e

* Refactor most tests to be word-specific, which uncovered a couple of
bugs in corner cases
** rol/üst prefix matches should match whole words (original [^ü] regex
assumed word could not be end of string

* Fixed incidental bugs I noticed while looking into the items above
** куркчи => kürkçi was in the wrong section
** cönk => джонк was in the right section, but reversed

* Added additional tests cases for all of the above.

Change-Id: Ia96be488a7b41c3ddba623b5c9262703b1c82687
2018-05-29 14:30:04 -04:00
tjones
cbb07cdc33 Crimean Tatar/crh transliteration odds and ends
* refactor '\b' into WB const to make it easy to update in the future
* add new ц-related exceptions

Bug: T193764
Change-Id: Ib707136f8f2598d1f8ec995bf129b436dfb53cd9
2018-05-22 14:59:55 -04:00
C. Scott Ananian
685eba4360 Minor fixes to CRH language conversion.
* Move a many-to-one mapping from the L2C to the C2L table where it
  belongs.

* Fix some regular expression patterns which ended up with misnumbered
  replacement strings.

* All regular expressions should have the `u` (unicode) flag set.

* Typo/spelling fixes in comments

Change-Id: If933fc67845ac994d9ddfdf8349aff445ec9b13a
2018-05-12 14:37:09 -04:00
tjones
14f8dc35db CRH Transliteration Pattern Matching Fixes
Refactor to match exceptions as patterns, not words
- break exception list to C2L and L2C pattern sets
- change main loop to break only on Roman numerals and transliterate
  everything else, rather than tokenizing on single-script words
  (this fixes the km² problem, too)
  - update word anchors from ^ and $ to \b
  - only process Roman numerals for L2C translit
  - add exception for single "Roman" character followed by a period
    which looks like an initial
- consolidate multi-step transliteration into regsConverter()
- remove regex support from main exception list to support strtr()
- re-organize some prefix/suffix/whole word patterns to the right place
- add tests for recently fixed use cases
- add support for many-to-one mappings in both directions
- update character classes, exception lists,  and regexes based on
  speaker feedback and example texts

Misc other fixes:
- fix some character classes errors
- remove unneeded character classes
- add tests for Roman numerals and quotes
- add tests for affixes and regexes

Bug: T188321
Bug: T189512
Change-Id: I056d36ff2b8f63b3998a5d3a442d8d539c15488d
2018-04-27 19:17:51 -04:00
jenkins-bot
a6abe2ad7a Merge "Add Russian grammar forms to support Wikiversity" 2018-03-14 08:37:27 +00:00
jenkins-bot
3c198b9dc8 Merge "Fix table loading bug for CRH transliteration" 2018-02-28 21:09:01 +00:00
tjones
70dede013c Fix table loading bug for CRH transliteration
In production, the regex and exception tables were not being loaded,
resulting in very poor transliteration. The loading has been moved to
the contructor, similar to the implementation of the Kazakh
transliteration.

Also, a bug in the mappings for Ö/ö -> Ё/ё and Ü/ü -> Ю/ю has been
fixed.

Test cases for specific additional examples have been added. (Though
it is worth noting that the regex and exception tables did load
properly during unit testing, so the problem wasn't caught there.)

Bug: T186727
Change-Id: I6bacee7d9de6f4a870a8a9ef1f04b819ad489c02
2018-02-26 13:22:04 -05:00
Amire80
398e2a7c9d Add Russian grammar forms to support Wikiversity
Change-Id: I70fcb03db62307116ec96d4c242e6796534b57a1
2018-02-26 14:18:01 +02:00
Fomafix
7855ec8385 SpecialPageAliasTest: Fix arguments of Language::fetchLanguageNames
Language::fetchLanguageNames( 'mwfile' ) means all languages with the
default filter 'mw' and names in the language 'mwfile'.

Language::fetchLanguageNames( null, 'mwfile' ) means language all
languages with the filter 'mwfile' and names in the default language.

This change removes the test for the language codes:
* aa
* als
* bat-smg
* be-x-old
* cho
* fiu-vro
* ho
* hz
* kj
* kr
* mh
* mus
* ng
* no
* rn
* roa-rup
* shi-latn
* shi-tfng
* simple
* tum
* uz-cyrl
* uz-latn
* zh-classical
* zh-min-nan
* zh-yue

Change-Id: I7266a67e37862daf863d1565d84cfeebaf5cb680
2018-02-25 13:31:43 +01:00
jenkins-bot
e46d0694ac Merge "Truncate tag filter descriptions" 2018-02-21 12:52:23 +00:00
Umherirrender
63d96c15fd build: Updating mediawiki/mediawiki-codesniffer to 16.0.0
Change-Id: I59b59f79bbf3ce4feff3b3a20c1c31bc16370531
2018-02-17 13:29:13 +01:00
petarpetkovic
2d2575852c Truncate tag filter descriptions
Introduce truncateInternal() method in Language class, based on
existing truncate() method. New method abstracts string truncation,
allowing users to specify callable functions for text length measurement
and string truncation.

New method, truncateInternal(), is used to provide two options for
text truncation:
* For DB usage: truncateForDatabase() method is truncating text by
number of bytes.
* For UI usage: truncateForVisual() method is truncating text by number
of characters, using multibyte string PHP methods.

Old truncate() method is deprecated and just returns the results of
truncateForDatabase() method.

Newly introduced truncateForVisual() method is used for
truncation of long tag descriptions in RCFilters menu.

Bug: T179626
Change-Id: Ib01a8c303304064dde3ce983b817d93a88a5affd
2018-02-09 22:45:20 +01:00
Timo Tijhof
bee9f4db96 Remove various redundant '@license' tags in file headers
Redundant given this is the project-wide license already,
especially in file headers that already include the GPL license
header.

This and other minor fixups based on feedback from Ie0cea0ef5027c7e5.

* Add @file where missing.
* Move @ingroup and @deprecated from file to class doc where needed.

Change-Id: I7067abb7abee1f0c238cb2536e16192e946d8daa
2018-01-12 18:15:11 +00:00
Bartosz Dziewoński
eb6bb6b7b9 Generalize non-digit-grouping of four-digit numbers
In some languages it's conventional not to insert a thousands
separator in numbers that are four digits long (1000-9999).
Rather than copy-paste the custom code to do this between 13 files,
introduce another option and have the base Language class handle it.

This also fixes an issue in several languages where this logic
previously would not work for negative or fractional numbers.

To implement this, a new option is added to MessagesXx.php files,
`$minimumGroupingDigits = 2;`, with the meaning as defined in
<http://unicode.org/reports/tr35/tr35-numbers.html>. It is a little
roundabout, but it could allow us to migrate the number formatting
(currently all custom code) to some generic library easily.

Bug: T177846
Change-Id: Iedd8de5648cf2de1c94044918626de2f96365d48
2018-01-02 11:17:25 +01:00
Umherirrender
255d76f2a1 build: Updating mediawiki/mediawiki-codesniffer to 15.0.0
Clean up use of @codingStandardsIgnore
- @codingStandardsIgnoreFile -> phpcs:ignoreFile
- @codingStandardsIgnoreLine -> phpcs:ignore
- @codingStandardsIgnoreStart -> phpcs:disable
- @codingStandardsIgnoreEnd -> phpcs:enable

For phpcs:disable always the necessary sniffs are provided.
Some start/end pairs are changed to line ignore

Change-Id: I92ef235849bcc349c69e53504e664a155dd162c8
2018-01-01 14:10:16 +01:00
Kunal Mehta
75160bdd3b Use MediaWikiCoversValidator for tests that don't use MediaWikiTestCase
Change-Id: I8c4de7e9c72c9969088666007b54c6fd23f6cc13
2018-01-01 08:28:02 +00:00
Kunal Mehta
fc23633035 Add @covers tags to languages tests
I removed comments that merely repeated the location of the class being
tested. There are other tests in this directory that don't have a
corresponding class and need further investigation.

Change-Id: Ic16f0887b5030ac53fab4382cfaedfb5426cdb08
2017-12-28 08:52:56 +00:00
Sam Wilson
313675320f Always return a string from Language::formatNum()
It says it returns a string, and so it should.

Bug: T182277
Change-Id: Ic68c65c634c2557a1d07281623cd6c971b000323
2017-12-07 13:59:56 +08:00
tjones
a0b511319c Crimean Tatar Transliteration
This is a first pass at Latin/Cyrillic translitertion for Crimean
Tatar (crh).

Includes transliteration tables, prefix/suffix mappings, regex
mappings, and exceptions lists for words and abbreviations.

Regularize CRH language name in messages/* files.

Fix "varient" typos in qqq.json.

Add unit tests for CRH transliteration.

Bug: T23582
Change-Id: I424703f99adf837f6217872b882d1ea26bfdd068
2017-11-20 16:56:38 -05:00
Reedy
f600b4ede9 Fix phpcs issues from LanguageConverter patches
Change-Id: I34e57c90ffd40fbd9f8afe3c57dd73fa7f655841
2017-11-15 03:37:27 +00:00
Brian Wolff
fbe78cfa09 SECURITY: XSS in langconverter when regex hits pcre.backtrack_limit
Adjust regexes for what not to convert to avoid backtracking by
preferring possesive quantifiers

Add check that we really have matched to the end of the string, and
log error if the regex hits some sort of error preventing the
entire string from being matched. Should the regex not match to the
end, then language conversion is disabled for the string.

Bug: T124404
Change-Id: I4f0c171c7da804e9c1508ef1f59556665a318f6a
2017-11-15 03:33:03 +00:00
Thiemo Mättig
1f2ff32cca Family name of Thiemo changed
Change-Id: I5477d02111e53790e858624c4b7c4f09dbc418fa
2017-11-14 13:59:15 +01:00
zoranzoki21
f0828ff475 Removed Toki Pona localization files
Bug: T132899
Bug: T178730
Change-Id: I4c61b3ef42cdc24fee74587965240ca08242867e
2017-10-24 21:27:47 +00:00
Bartosz Dziewoński
3f62813c51 Add test cases for digit grouping (commafy) in Polish
According to the typographical convention, a thousands separator
should not be inserted in numbers that are four digits long (between
1000 and 9999), unlike in English where it's usually acceptable.
This logic is currently implemented in LanguagePl::commafy().

Bug: T177846
Change-Id: I6dbd8febcf59000067cdd7d3c11111f2f77f4e66
2017-10-10 22:52:11 +02:00
Fomafix
ea0bd74a94 Refactor global function wfBCP47 to static function LanguageCode::bcp47
Deprecate global function wfBCP47.

Change-Id: Ie6bb061b5d6ca67289bb18bc468a87421f38fc94
2017-10-05 09:54:45 +02:00
Fomafix
55ecf3e215 Add new static function LanguageCode::replaceDeprecatedCodes
Refactor the deprecatedLanguageCodeMapping to a private variable.

Change-Id: I5f8e601e53de183e6268c9ef601eef8390b725cd
2017-08-10 15:21:59 -04:00
Liangent
d8375bee24 New language variant 'en-x-piglatin' for easier variant testing
Guarded by the $wgUsePigLatinVariant variable, off by default.

Pig Latin is a language game where words in English are altered
according to the following rules:

* Words starting with a vowel have a '-way' suffix appended.
* Words starting with a consonant have the initial consonants (or 'qu'
  group) moved to the end and an '-ay' suffix appended.

https://en.wikipedia.org/wiki/Pig_Latin

* Added 'en-x-piglatin' as a language name.
* Added 'en' to LanguageConverter::$languagesWithVariants.
* Added LanguageEn class and its corresponding EnConverter which
  provides one-way translation from English to Pig Latin.
* Some minor internal changes in code that assumed that English
  doesn't have a language class or converter.

Bug: T45547
Depends-On: I1d9691c784032669979f8109c9a5f65cbf4122c9
Change-Id: I7fa2d85d6364958c5138366e8b4504a2697a8731
2017-06-12 16:59:57 -04:00
jenkins-bot
bdfa96eb72 Merge "Break up $wgDummyLanguageCodes" 2017-03-08 20:46:47 +00:00
This, that and the other
48ab87d0a3 Break up $wgDummyLanguageCodes
$wgDummyLanguageCodes is a set and mapping of different language codes:

* Renamed language codes: ['als' => 'gsw', 'bat-smg' => 'sgs',
                           'be-xold' => 'be-tarask', 'fiu-vro' => 'vro',
                           'roa-rup' => 'rup', 'zh-classical' => 'lzh',
                           'zh-min-nan' => 'nan', 'zh-yue' => 'yue'].
  The old language codes are deprecated because they are invalid but
  should be supported for compatibility reasons for a while.
* Language codes of macro languages, which get mapped to the main
  language: ['bh' => 'bho', 'no' => 'nb'].
* Language variants which get mapped to main language:
  ['simple' => 'en'].
* Internal language codes of the private-use-area which get mapped to
  itself: ['qqq' => 'qqq', 'qqx' => 'qqx']

This is a very strange conglomeration which should get differentiated,
and were split up in the following ways:

* Renamed language codes are available from
  LanguageCode::getDeprecatedCodeMapping().
* Language codes of macro languages and the variants that are mapped to
  the main language are available as $wgExtraLanguageCodes and are set
  in DefaultSettings.php.
* Internal language codes are set in $wgDummyLanguageCodes in Setup.php.

Change-Id: If73c74ee87d8235381449cab7dcd9f46b0f23590
2017-03-08 12:11:30 -08:00
James D. Forrester
1e9c361960 tests: Replace implicit Bugzilla bug numbers with Phab ones
It's unreasonable to expect newbies to know that "bug 12345" means "Task T14345"
except where it doesn't, so let's just standardise on the real numbers.

Change-Id: I46261416f7603558dceb76ebe695a5cac274e417
2017-02-21 02:14:34 +00:00
Zhuyifei1999
0effd172ce translateBlockExpiry: Duration is block expiry minus current time
For relative timestamps in $str, strtotime( $str, $now ) returns an
absolute Unix timestamp $str since $now, and this timestamp is given
to $time. However, Language::formatDuration expects a time duration,
not an absolute timestamp. We obtain this duration from the difference
between $time, the absolute timestamp of block expiry, and $now, the
absolute timestamp of the time in which the block action happened.

Tests have been added to test both this patch and 01936fa, the patch
that caused this regression.

Bug: T156453
Change-Id: I6fd8c02dc3c6456067fe25cb9f33f5b4c78332aa
2017-01-28 07:22:00 +00:00
Amir E. Aharoni
6b03e2e88e Make the code for grammar data processing common
This makes the code for processing JSON files with
grammar transformations reusable by different languages
and applies the same logic to Russian and Hebrew.
It will be done to other languages in further patches.

This patch is not supposed to change any functionality,
and the tests are intact (except a comment in the test
for Hebrew - the class doesn't exist any longer).

PHP:
* Move the JSON grammar transformation data processing logic
  from LanguageRu.php to convertGrammar() in Language.php.
  By default all these data files are supposed to be
  processed identically, so the code should be common.
  If there is no JSON data file, nothing new happens.
* LanguageRu's own convertGrammar() method is removed.
* The LanguageHe class is removed, now that all its functionality
  is handled by generic JSON data processing in the Language class.
  LanguageHe.php file is removed from the repo and from autoloading.

JavaScript:
* Move the JSON grammar transformation data processing logic
  from ru.js to mediawiki.language.js.
* JavaScript grammar code files he.js and ru.js are removed
  from the repo and from Resources.php, because all the data
  is in JSON, and the default logic in mediawiki.language.js
  works for both languages.

Bug: T115217
Change-Id: I5e75467121c3d791bb84f9e6fdfcf07c1840f81a
2016-12-16 15:52:14 +02:00
Fomafix
7de07e8991 Update weblinks in comments from HTTP to HTTPS
Use HTTPS instead of HTTP where the HTTP link is a redirect to the HTTPS link.

Change-Id: I06d9e043730accc4ae71b927e0f8229f0fc3b340
2016-10-11 17:25:10 +00:00
Marius Hoch
9ca0f6c620 Only attempt to calculate the TTL in Language::sprintfDate if needed
Change-Id: Ifd24c9206be05bb4fd2277efc574c9d1018e1957
2016-06-23 12:36:25 +02:00
daniel
bbd518baff add LanguageTest::testEquals for Id7ed6a21c
Change-Id: I99ea4c51bfc5245eab0bcca73870c56a6fab2c43
2016-05-23 16:45:06 +02:00
Reedy
83fb19cb13 Swap the rest of array() -> []
Change-Id: I76a7259ed952a0673a1941f08b39b545211fba07
2016-03-30 22:04:58 +00:00
Reedy
b5656b6953 Many more function case mismatches
Change-Id: I5d3a5eb8adea1ecbf136415bb9fd7a162633ccca
2016-03-19 00:20:58 +00:00
Timo Tijhof
46b04ec7ae Use static::class instead of get_called_class()
Available as of PHP 5.5 and more idomatic. Foo::class (explicit),
self::class (defined), and static::class (late bound).

Change-Id: I66937f32095a4e4ecde94ca20a935a3c3efc9cee
2016-02-29 22:43:58 +00:00
Kunal Mehta
6e9b4f0e9c Convert all array() syntax to []
Per wikitech-l consensus:
 https://lists.wikimedia.org/pipermail/wikitech-l/2016-February/084821.html

Notes:
* Disabled CallTimePassByReference due to false positives (T127163)

Change-Id: I2c8ce713ce6600a0bb7bf67537c87044c7a45c4b
2016-02-17 01:33:00 -08:00
Tim Starling
f0ba7a69a1 Add tests for LanguageConverter classes that didn't have them
Some of them don't have many test cases, or have test cases that don't
represent the ideal transliteration and so are subject to change. But
this is better than nothing.

Change-Id: I4aae693bd77d9ff365f48113923ed7f9fed8d668
2016-02-08 09:19:25 +11:00
Timo Tijhof
3b35719e74 tests: Remove unused $wgMemc resets
If we really need this we can do it in MediaWikiTestCase, next
to the setting of wgMainCacheType. But from what I can see the
code being tested here already doesn't use the old $wgMemc.

Change-Id: I9e4b2109b2f3c18d8d5551bbadae5711c1d4c0a6
2015-12-06 18:06:08 +00:00
Roan Kattouw
e4d6238c00 Language::truncate(): don't chop up multibyte characters when input contains newlines
To detect whether the truncation had chopped up a multibyte
character after the first byte, a regex was used. But in this
regex, the dot (.) didn't match newlines, so it failed to
detect chopped multibyte characters (after the first byte)
if there was a newline preceding the chopped character.

Bug: T116693
Change-Id: I66e4fd451acac0a1019da7060d5a37d70963a15a
2015-10-26 20:17:37 -07:00
jenkins-bot
88081365b3 Merge "Add new grammar forms for language names in Russian" 2015-09-28 13:41:33 +00:00
Amir E. Aharoni
8b0c0b49ce Add new grammar forms for language names in Russian
CLDR provides translated language names. They are useful for showing
names by themselves in menus and lists, but it's often problematic to add them
to Russian sentences, because they need to be declined, so a message like
"This page is not available in the $1 language" is hard to localize.

This patch adds new cases for Russian -
"languagegen", "languageprep" and "languageadverb".
(The last one, as its name says, it's not actually a grammatical case,
but a transformation to an adverbial expression.)
This covers most of the needs for language names that MediaWiki supports.

Change-Id: Ib6a0afa5c3736f8b9b2e121cd752c53ee50fad75
2015-09-28 15:51:24 +03:00
Amir E. Aharoni
b175f585db Update Ukrainian grammar rules and tests
* Fix the '-ти' rule to match the name of Wikiquote.
* Add tests for '-ти' and '-ник' rules.
* Remove the '-ь' and '-ка' rules, which were copied from Russian
  and are not used in Ukrainian, and remove their tests as well.
* Remove non-implemented ("stub") cases.
* Cleanup the code of commafy().

Change-Id: I98647ceb8806d845f3c8150b92a5d9f7fe5866f2
2015-09-27 15:21:49 +03:00
Amir E. Aharoni
5ccbaf2c48 Update grammar rules and test for Ukrainian
The grammar rules for Ukrainian have several mistakes.
This is the first in a series of commits that fix this.

* Add grammar tests for PHP. There weren't any tests at all,
  and now there are some. Not tests are added for rules that
  are wrong and irrelevant and will be removed in subsequent commits.
* Add tests for JavaScript, and update a grammar rule that was
  incorrectly copied from Russian.

Change-Id: I6de4581e2908eba39b33a13b07d048a34a3bd803
2015-09-27 11:49:07 +03:00