Thijs/wiki.techinc.nl

Author	SHA1	Message	Date
C. Scott Ananian	fcbde8ae4e	Make Language::hasVariant() more strict In `d59f27aeab` we made LanguageConverter::validateVariant() try harder to convert a variant into an acceptable MediaWiki-internal form, looking at deprecated codes and BCP 47 aliases. However, this misled Language::hasVariant() into thinking that bogus names (like all-uppercase strings) were acceptable variant names, which then led exceptions when they were passed to the various conversion methods. This is a belt-and-suspenders patch for T207433 -- in that case we shouldn't have created a Language object with code 'sr-cyrl' in the first place, but once one was created we shouldn't have tried to ask LanguageSr to convert texts to 'sr-cyrl'. The latter problem is fixed by this patch. Bug: T207433 Change-Id: Id993bc7989144b5031a551662e8e492bd23f698a	2018-10-22 16:35:26 -04:00
C. Scott Ananian	103a4f76dc	Deprecate $wgFixArabicUnicode / $wgFixMalayalamUnicode These were introduced in MW 1.17 and are always true in production. They were useful to allow folks to defer title conversion, but it's been a long time now. We don't need to make this optional any more. Change-Id: I65dcfe80dc3e1dfeb4d63924a8928655e012a20c	2018-10-21 21:55:39 -04:00
jenkins-bot	690f563edc	Merge "Accept BCP 47 codes as aliases for nonstandard variants"	2018-10-11 20:46:42 +00:00
jenkins-bot	64ef09d6a8	Merge "Ensure LanguageCode::bcp47() returns a valid BCP 47 language code"	2018-10-11 20:46:35 +00:00
C. Scott Ananian	d59f27aeab	Accept BCP 47 codes as aliases for nonstandard variants The browser Accept-Language header uses BCP 47 codes, which don't precisely match our internal mediawiki variant names in a number of places. Allow proper BCP 47 codes to alias our internal variants for: Accept-Language parsing, URL parsing, user preferences, and explicit enumeration of codes in LanguageConverter rules. This is a replay of an earlier merged patch, `0818070c59`, which had to be reverted because it was based on `8380f0173e` which caused regressions in the Babel extension (T199941). Change-Id: Ica89d9547c58967747ab0fa15d4e83be5378796d	2018-10-11 02:23:20 -04:00
C. Scott Ananian	21ead7a98d	Ensure LanguageCode::bcp47() returns a valid BCP 47 language code MediaWiki uses a number of nonstandard codes which do not validate according to the IANA language subtag registry. Some of them have the wrong semantics entirely: MediaWiki's `sr-ec` variant maps to BCP 47 `sr-EC` which is "Serbian as used in Ethiopia" (!). Extend LanguageCode::bcp47() to map our nonstandard codes to valid BCP 47 language codes. Export the mapping so that it can be used in JavaScript's corresponding mw.language.bcp47() implementation as well, and return the standard BCP 47 codes in the siteinfo API. Thanks to TheDJ (I10b4473c7e53f027812bbccf26bb47aec15fddfd) and Fomafix (I93efc190714ba76247d30ba49fc21ae872fc3555) for previous attempts at this! Also removed a fixme for the name of 'Twi', dating back to 2004 (`f59c3be23b`) -- checking tw.wikipedia.org it certainly appears that the autonym of 'Twi' is correctly 'Twi'. Tracking bugs for invalid language codes are T125073 and T145535. Discussion of zh-XX => zh-HanX-XX mapping is at T198419. This is a replay of an earlier merged patch, `8380f0173e`, which had to be reverted because it caused regressions in the Babel extension (T199941). Bug: T34483 Bug: T106367 Bug: T120847 Depends-On: I27a5b8e45b34c6b57c1b612b11548001c88cd483 Change-Id: Iebbc604af21d7f2af9c1f1ab2574cb5f309bf6ed	2018-10-11 01:53:54 -04:00
Kunal Mehta	a4e8bea57d	tests: Add helper function for ini_set with automatic cleanup Some tests need to change the value of an ini setting, and typically implement cleanup handling themselves, usually imperfectly. Provide a helper function, $this->setIniSetting(), which will take care of teardown in the same way that $this->setMwGlobals() does. Change-Id: I7be4198592f0aaf73a28d3c60acb307a918b1a1f	2018-10-10 22:31:37 -07:00
Fomafix	5632815976	Write Latin and other scripts with captial letter Change-Id: I16c660e54191b63cd6eb3407cb00504665930c4e	2018-10-05 18:49:08 +02:00
Fomafix	50944a1410	Deprecate Language::setCode as public method setCode changes the language code for the Language object but it also replaces the whole language codes for all Language objects. > $lang = Language::factory( 'fr' ) > $lang2 = Language::factory( 'fr' ) > $lang->setCode( 'it' ) > print $lang2->getCode() it > $lang3 = Language::factory( 'fr' ) > print $lang3->getCode() it Better assign a new Language object. Also add more tests for Language::equals. Depends-On: I61439bac82021344c3f9a6056cccd937b3450af2 Depends-On: I2d9e551d6eb33f28f42aeaf48160eba21b83881f Change-Id: I201b479f58e63c9c40fb8a3ec9575a551fb35235	2018-10-02 23:48:53 -07:00
Timo Tijhof	dbe89abb9e	languages: Add coverage for 'ar' and 'ml' normalize() * Exclude the data files from PHPUnit coverage. * Add tests covering the normalize() implementations. * Fix a small todo about using data providers. * Set explicit visibility. Change-Id: Ib104cc3215a36901cff853ad5969d92a6e0cf6a0	2018-08-14 23:19:35 +00:00
Aryeh Gregor	90d4f56fe4	Mass conversion of $wgContLang to service Brought to you by vim macros. Bug: T200246 Change-Id: I79e919f4553e3bd3eb714073fed7a43051b4fb2a	2018-08-11 22:44:29 -06:00
Aryeh Gregor	63d7f2ad13	Automatically reset namespace caches when needed This avoids error-prone code written separately in every test. In addition to no existing tests resetting the TitleFormatter (more services probably need to be reset as well), they mostly reset only the namespace cache on $wgContLang, which wouldn't help for any other language. The parser test runner still doesn't do this, but maybe it should. Change-Id: I44b7a1aec48f14b0950907fa14bd0df80f674296	2018-08-01 16:30:08 +03:00
Aryeh Gregor	355e21590a	Use setContentLang() instead of setMwGlobals() This changes behavior in some tests by making them set $wgLanguageCode as well as $wgContLang, but that seems like a good thing. Bug: T200246 Change-Id: I936888f46ff9fefe2707efba837e2ce3a7ca5e3f	2018-07-26 11:35:58 +00:00
Greg Grossmeier	b302b0cd1c	Revert "Ensure LanguageCode::bcp47() returns a valid BCP 47 language code" This reverts commit `8380f0173e`. Reason for revert: Caused T199941 Bug: T199941 Change-Id: I93af756a2d70d6bc91f828fe6ac19bf10ca8788f	2018-07-23 17:27:23 +00:00
Greg Grossmeier	dc282a46d7	Revert "Accept BCP 47 codes as aliases for nonstandard variants" This reverts commit `0818070c59`. Reason for revert: Caused T199941 Bug: T199941 Change-Id: I24c178eb33890477de79cbb3122861c140578011	2018-07-23 16:44:55 +00:00
C. Scott Ananian	0818070c59	Accept BCP 47 codes as aliases for nonstandard variants The browser Accept-Language header uses BCP 47 codes, which don't precisely match our internal mediawiki variant names in a number of places. Allow proper BCP 47 codes to alias our internal variants for: Accept-Language parsing, URL parsing, user preferences, and explicit enumeration of codes in LanguageConverter rules. Change-Id: I8468a56d5b88f5786abd0a17b67bda2f1687fd0c	2018-07-13 17:43:20 -04:00
C. Scott Ananian	8380f0173e	Ensure LanguageCode::bcp47() returns a valid BCP 47 language code MediaWiki uses a number of nonstandard codes which do not validate according to the IANA language subtag registry. Some of them have the wrong semantics entirely: MediaWiki's `sr-ec` variant maps to BCP 47 `sr-EC` which is "Serbian as used in Ethiopia" (!). Extend LanguageCode::bcp47() to map our nonstandard codes to valid BCP 47 language codes. Export the mapping so that it can be used in JavaScript's corresponding mw.language.bcp47() implementation as well. Thanks to TheDJ (I10b4473c7e53f027812bbccf26bb47aec15fddfd) and Fomafix (I93efc190714ba76247d30ba49fc21ae872fc3555) for previous attempts at this! Also removed a fixme for the name of 'Twi', dating back to 2004 (`f59c3be23b`) -- checking tw.wikipedia.org it certainly appears that the autonym of 'Twi' is correctly 'Twi'. Tracking bugs for invalid language codes are T125073 and T145535. Discussion of zh-XX => zh-HanX-XX mapping is at T198419. Bug: T34483 Bug: T106367 Bug: T120847 Change-Id: I807dd55d49e9bd19443329231326a5b0d3e6c453	2018-07-13 14:56:18 -04:00
jenkins-bot	8c96aec32c	Merge "Fix the bug for dates between 1912 and 1941 in Thai language"	2018-07-10 08:55:56 +00:00
Kunal Mehta	4acb7ed51c	Add @coversNothing to tests that don't cover specific PHP classes Change-Id: Idbd364561bc28547e9fac20d7a80b9a44edf14a9	2018-06-12 13:27:40 -07:00
jenkins-bot	e602b197ab	Merge "(y)etsin fixes, test refactoring, and misc fixes"	2018-06-08 20:46:12 +00:00
Bartosz Dziewoński	0313128b10	Use PHP 7 "\u{NNNN}" Unicode codepoint escapes in string literals In cases where we're operating on text data (and not binary data), use e.g. "\u{00A0}" to refer directly to the Unicode character 'NO-BREAK SPACE' instead of "\xc2\xa0" to specify the bytes C2h A0h (which correspond to the UTF-8 encoding of that character). This makes it easier to look up those mysterious sequences, as not all are as recognizable as the no-break space. This is not enforced by PHP, but I think we should write those in uppercase and zero-padded to at least four characters, like the Unicode standard does. Note that not all "\xNN" escapes can be automatically replaced: * We can't use Unicode escapes for binary data that is not UTF-8 (e.g. in code converting from legacy encodings or testing the handling of invalid UTF-8 byte sequences). * '\xNN' escapes in regular expressions in single-quoted strings are actually handled by PCRE and have to be dealt with carefully (those regexps should probably be changed to use the /u modifier). * "\xNN" referring to ASCII characters ("\x7F" and lower) should probably be left as-is. The replacements in this commit were done semi-manually by piping the existing "\xNN" escapes through the following terrible Ruby script I devised: chars = eval('"' + ARGV[0] + '"').force_encoding('utf-8') puts chars.split('').map{\|char\| '\\u{' + char.ord.to_s(16).upcase.rjust(4, '0') + '}' }.join('') Change-Id: Idc3dee3a7fb5ebfaef395754d8859b18f1f8769a	2018-06-04 16:20:13 +00:00
Bartosz Dziewoński	4fd27f006f	Use PHP 5.6 '**' operator instead of 'pow()' function Change-Id: Ieb22e1dbfcffaa4e7b3dcfabbcc999e5dd59a4bf	2018-05-30 18:05:19 -07:00
tjones	669d1ed192	(y)etsin fixes, test refactoring, and misc fixes * Fix etsin/етсин/этсин as noted in If933fc67845ac994d9ddfdf8349aff445ec9b13a ** only convert tsin to тсин and let the other rules sort out the e * Refactor most tests to be word-specific, which uncovered a couple of bugs in corner cases ** rol/üst prefix matches should match whole words (original [^ü] regex assumed word could not be end of string * Fixed incidental bugs I noticed while looking into the items above куркчи => kürkçi was in the wrong section cönk => джонк was in the right section, but reversed * Added additional tests cases for all of the above. Change-Id: Ia96be488a7b41c3ddba623b5c9262703b1c82687	2018-05-29 14:30:04 -04:00
tjones	cbb07cdc33	Crimean Tatar/crh transliteration odds and ends * refactor '\b' into WB const to make it easy to update in the future * add new ц-related exceptions Bug: T193764 Change-Id: Ib707136f8f2598d1f8ec995bf129b436dfb53cd9	2018-05-22 14:59:55 -04:00
C. Scott Ananian	685eba4360	Minor fixes to CRH language conversion. * Move a many-to-one mapping from the L2C to the C2L table where it belongs. * Fix some regular expression patterns which ended up with misnumbered replacement strings. * All regular expressions should have the `u` (unicode) flag set. * Typo/spelling fixes in comments Change-Id: If933fc67845ac994d9ddfdf8349aff445ec9b13a	2018-05-12 14:37:09 -04:00
superyetkin	3aaa2367b2	Fix the bug for dates between 1912 and 1941 in Thai language Added an if-else block to see if the parameters passed to the function designate a year between 1912 and 1941 or not. Resulting month values are also adjusted. Added a unit test for the related formatting. Bug: T68648 Change-Id: Ic676b5c140de8878971a786a1a1811770a848016	2018-05-12 15:10:13 +00:00
tjones	14f8dc35db	CRH Transliteration Pattern Matching Fixes Refactor to match exceptions as patterns, not words - break exception list to C2L and L2C pattern sets - change main loop to break only on Roman numerals and transliterate everything else, rather than tokenizing on single-script words (this fixes the km² problem, too) - update word anchors from ^ and $ to \b - only process Roman numerals for L2C translit - add exception for single "Roman" character followed by a period which looks like an initial - consolidate multi-step transliteration into regsConverter() - remove regex support from main exception list to support strtr() - re-organize some prefix/suffix/whole word patterns to the right place - add tests for recently fixed use cases - add support for many-to-one mappings in both directions - update character classes, exception lists, and regexes based on speaker feedback and example texts Misc other fixes: - fix some character classes errors - remove unneeded character classes - add tests for Roman numerals and quotes - add tests for affixes and regexes Bug: T188321 Bug: T189512 Change-Id: I056d36ff2b8f63b3998a5d3a442d8d539c15488d	2018-04-27 19:17:51 -04:00
jenkins-bot	a6abe2ad7a	Merge "Add Russian grammar forms to support Wikiversity"	2018-03-14 08:37:27 +00:00
jenkins-bot	3c198b9dc8	Merge "Fix table loading bug for CRH transliteration"	2018-02-28 21:09:01 +00:00
tjones	70dede013c	Fix table loading bug for CRH transliteration In production, the regex and exception tables were not being loaded, resulting in very poor transliteration. The loading has been moved to the contructor, similar to the implementation of the Kazakh transliteration. Also, a bug in the mappings for Ö/ö -> Ё/ё and Ü/ü -> Ю/ю has been fixed. Test cases for specific additional examples have been added. (Though it is worth noting that the regex and exception tables did load properly during unit testing, so the problem wasn't caught there.) Bug: T186727 Change-Id: I6bacee7d9de6f4a870a8a9ef1f04b819ad489c02	2018-02-26 13:22:04 -05:00
Amire80	398e2a7c9d	Add Russian grammar forms to support Wikiversity Change-Id: I70fcb03db62307116ec96d4c242e6796534b57a1	2018-02-26 14:18:01 +02:00
Fomafix	7855ec8385	SpecialPageAliasTest: Fix arguments of Language::fetchLanguageNames Language::fetchLanguageNames( 'mwfile' ) means all languages with the default filter 'mw' and names in the language 'mwfile'. Language::fetchLanguageNames( null, 'mwfile' ) means language all languages with the filter 'mwfile' and names in the default language. This change removes the test for the language codes: * aa * als * bat-smg * be-x-old * cho * fiu-vro * ho * hz * kj * kr * mh * mus * ng * no * rn * roa-rup * shi-latn * shi-tfng * simple * tum * uz-cyrl * uz-latn * zh-classical * zh-min-nan * zh-yue Change-Id: I7266a67e37862daf863d1565d84cfeebaf5cb680	2018-02-25 13:31:43 +01:00
jenkins-bot	e46d0694ac	Merge "Truncate tag filter descriptions"	2018-02-21 12:52:23 +00:00
Umherirrender	63d96c15fd	build: Updating mediawiki/mediawiki-codesniffer to 16.0.0 Change-Id: I59b59f79bbf3ce4feff3b3a20c1c31bc16370531	2018-02-17 13:29:13 +01:00
petarpetkovic	2d2575852c	Truncate tag filter descriptions Introduce truncateInternal() method in Language class, based on existing truncate() method. New method abstracts string truncation, allowing users to specify callable functions for text length measurement and string truncation. New method, truncateInternal(), is used to provide two options for text truncation: * For DB usage: truncateForDatabase() method is truncating text by number of bytes. * For UI usage: truncateForVisual() method is truncating text by number of characters, using multibyte string PHP methods. Old truncate() method is deprecated and just returns the results of truncateForDatabase() method. Newly introduced truncateForVisual() method is used for truncation of long tag descriptions in RCFilters menu. Bug: T179626 Change-Id: Ib01a8c303304064dde3ce983b817d93a88a5affd	2018-02-09 22:45:20 +01:00
Timo Tijhof	bee9f4db96	Remove various redundant '@license' tags in file headers Redundant given this is the project-wide license already, especially in file headers that already include the GPL license header. This and other minor fixups based on feedback from Ie0cea0ef5027c7e5. * Add @file where missing. * Move @ingroup and @deprecated from file to class doc where needed. Change-Id: I7067abb7abee1f0c238cb2536e16192e946d8daa	2018-01-12 18:15:11 +00:00
Bartosz Dziewoński	eb6bb6b7b9	Generalize non-digit-grouping of four-digit numbers In some languages it's conventional not to insert a thousands separator in numbers that are four digits long (1000-9999). Rather than copy-paste the custom code to do this between 13 files, introduce another option and have the base Language class handle it. This also fixes an issue in several languages where this logic previously would not work for negative or fractional numbers. To implement this, a new option is added to MessagesXx.php files, `$minimumGroupingDigits = 2;`, with the meaning as defined in <http://unicode.org/reports/tr35/tr35-numbers.html>. It is a little roundabout, but it could allow us to migrate the number formatting (currently all custom code) to some generic library easily. Bug: T177846 Change-Id: Iedd8de5648cf2de1c94044918626de2f96365d48	2018-01-02 11:17:25 +01:00
Umherirrender	255d76f2a1	build: Updating mediawiki/mediawiki-codesniffer to 15.0.0 Clean up use of @codingStandardsIgnore - @codingStandardsIgnoreFile -> phpcs:ignoreFile - @codingStandardsIgnoreLine -> phpcs:ignore - @codingStandardsIgnoreStart -> phpcs:disable - @codingStandardsIgnoreEnd -> phpcs:enable For phpcs:disable always the necessary sniffs are provided. Some start/end pairs are changed to line ignore Change-Id: I92ef235849bcc349c69e53504e664a155dd162c8	2018-01-01 14:10:16 +01:00
Kunal Mehta	75160bdd3b	Use MediaWikiCoversValidator for tests that don't use MediaWikiTestCase Change-Id: I8c4de7e9c72c9969088666007b54c6fd23f6cc13	2018-01-01 08:28:02 +00:00
Kunal Mehta	fc23633035	Add @covers tags to languages tests I removed comments that merely repeated the location of the class being tested. There are other tests in this directory that don't have a corresponding class and need further investigation. Change-Id: Ic16f0887b5030ac53fab4382cfaedfb5426cdb08	2017-12-28 08:52:56 +00:00
Sam Wilson	313675320f	Always return a string from Language::formatNum() It says it returns a string, and so it should. Bug: T182277 Change-Id: Ic68c65c634c2557a1d07281623cd6c971b000323	2017-12-07 13:59:56 +08:00
tjones	a0b511319c	Crimean Tatar Transliteration This is a first pass at Latin/Cyrillic translitertion for Crimean Tatar (crh). Includes transliteration tables, prefix/suffix mappings, regex mappings, and exceptions lists for words and abbreviations. Regularize CRH language name in messages/* files. Fix "varient" typos in qqq.json. Add unit tests for CRH transliteration. Bug: T23582 Change-Id: I424703f99adf837f6217872b882d1ea26bfdd068	2017-11-20 16:56:38 -05:00
Reedy	f600b4ede9	Fix phpcs issues from LanguageConverter patches Change-Id: I34e57c90ffd40fbd9f8afe3c57dd73fa7f655841	2017-11-15 03:37:27 +00:00
Brian Wolff	fbe78cfa09	SECURITY: XSS in langconverter when regex hits pcre.backtrack_limit Adjust regexes for what not to convert to avoid backtracking by preferring possesive quantifiers Add check that we really have matched to the end of the string, and log error if the regex hits some sort of error preventing the entire string from being matched. Should the regex not match to the end, then language conversion is disabled for the string. Bug: T124404 Change-Id: I4f0c171c7da804e9c1508ef1f59556665a318f6a	2017-11-15 03:33:03 +00:00
Thiemo Mättig	1f2ff32cca	Family name of Thiemo changed Change-Id: I5477d02111e53790e858624c4b7c4f09dbc418fa	2017-11-14 13:59:15 +01:00
zoranzoki21	f0828ff475	Removed Toki Pona localization files Bug: T132899 Bug: T178730 Change-Id: I4c61b3ef42cdc24fee74587965240ca08242867e	2017-10-24 21:27:47 +00:00
Bartosz Dziewoński	3f62813c51	Add test cases for digit grouping (commafy) in Polish According to the typographical convention, a thousands separator should not be inserted in numbers that are four digits long (between 1000 and 9999), unlike in English where it's usually acceptable. This logic is currently implemented in LanguagePl::commafy(). Bug: T177846 Change-Id: I6dbd8febcf59000067cdd7d3c11111f2f77f4e66	2017-10-10 22:52:11 +02:00
Fomafix	ea0bd74a94	Refactor global function wfBCP47 to static function LanguageCode::bcp47 Deprecate global function wfBCP47. Change-Id: Ie6bb061b5d6ca67289bb18bc468a87421f38fc94	2017-10-05 09:54:45 +02:00
Fomafix	55ecf3e215	Add new static function LanguageCode::replaceDeprecatedCodes Refactor the deprecatedLanguageCodeMapping to a private variable. Change-Id: I5f8e601e53de183e6268c9ef601eef8390b725cd	2017-08-10 15:21:59 -04:00
Liangent	d8375bee24	New language variant 'en-x-piglatin' for easier variant testing Guarded by the $wgUsePigLatinVariant variable, off by default. Pig Latin is a language game where words in English are altered according to the following rules: * Words starting with a vowel have a '-way' suffix appended. * Words starting with a consonant have the initial consonants (or 'qu' group) moved to the end and an '-ay' suffix appended. https://en.wikipedia.org/wiki/Pig_Latin * Added 'en-x-piglatin' as a language name. * Added 'en' to LanguageConverter::$languagesWithVariants. * Added LanguageEn class and its corresponding EnConverter which provides one-way translation from English to Pig Latin. * Some minor internal changes in code that assumed that English doesn't have a language class or converter. Bug: T45547 Depends-On: I1d9691c784032669979f8109c9a5f65cbf4122c9 Change-Id: I7fa2d85d6364958c5138366e8b4504a2697a8731	2017-06-12 16:59:57 -04:00

1 2 3 4 5

224 commits