In d59f27aeab we made
LanguageConverter::validateVariant() try harder to convert a variant
into an acceptable MediaWiki-internal form, looking at deprecated
codes and BCP 47 aliases. However, this misled Language::hasVariant()
into thinking that bogus names (like all-uppercase strings) were
acceptable variant names, which then led exceptions when they were
passed to the various conversion methods.
This is a belt-and-suspenders patch for T207433 -- in that case we
shouldn't have created a Language object with code 'sr-cyrl' in the
first place, but once one was created we shouldn't have tried to
ask LanguageSr to convert texts to 'sr-cyrl'. The latter problem
is fixed by this patch.
Bug: T207433
Change-Id: Id993bc7989144b5031a551662e8e492bd23f698a
Facilitate a gradual migration away from non-standard MediaWiki language
codes. This will ensure that (a) rules can be written with standard
BCP 47 codes, and (b) rules written with existing nonstandard codes will
continue to work once these are added to
LanguageCode::$deprecatedLanguageCodeMapping.
Change-Id: I3ba96faafaf40bd47fb5919621f7035f0431a698
The browser Accept-Language header uses BCP 47 codes, which don't
precisely match our internal mediawiki variant names in a number of
places. Allow proper BCP 47 codes to alias our internal variants
for: Accept-Language parsing, URL parsing, user preferences, and
explicit enumeration of codes in LanguageConverter rules.
This is a replay of an earlier merged patch,
0818070c59, which had to be reverted
because it was based on 8380f0173e which
caused regressions in the Babel extension (T199941).
Change-Id: Ica89d9547c58967747ab0fa15d4e83be5378796d
If you feed this method unescaped data, it can cause later calls
to be an XSS, which is something I think deserves a warning.
Bug: T202571
Change-Id: I34cb3da9232a22defffb80466263c2f2233822ef
"continue" statements are equivalent to "break". In PHP 7.3, will generate a warning.
Bug: T200595
Change-Id: I244ecb2e1ce5a76295f014fb1becd8d263196846
The browser Accept-Language header uses BCP 47 codes, which don't
precisely match our internal mediawiki variant names in a number of
places. Allow proper BCP 47 codes to alias our internal variants
for: Accept-Language parsing, URL parsing, user preferences, and
explicit enumeration of codes in LanguageConverter rules.
Change-Id: I8468a56d5b88f5786abd0a17b67bda2f1687fd0c
Clean up use of @codingStandardsIgnore
- @codingStandardsIgnoreFile -> phpcs:ignoreFile
- @codingStandardsIgnoreLine -> phpcs:ignore
- @codingStandardsIgnoreStart -> phpcs:disable
- @codingStandardsIgnoreEnd -> phpcs:enable
For phpcs:disable always the necessary sniffs are provided.
Some start/end pairs are changed to line ignore
Change-Id: I92ef235849bcc349c69e53504e664a155dd162c8
This is a first pass at Latin/Cyrillic translitertion for Crimean
Tatar (crh).
Includes transliteration tables, prefix/suffix mappings, regex
mappings, and exceptions lists for words and abbreviations.
Regularize CRH language name in messages/* files.
Fix "varient" typos in qqq.json.
Add unit tests for CRH transliteration.
Bug: T23582
Change-Id: I424703f99adf837f6217872b882d1ea26bfdd068
This fixes an issue in f21f3942 where if there was an html
element with an alt or title attribute containing an <
entity, an ascii EOT control character (0x04) may become
inserted into the text if language converter was enabled.
Due to a really old bug in language converter, self-closed tags
got turned into non-self closed tags. However due a different
bug which was fixed in f21f3942 this code path was rarely taken
so nobody noticed until now.
Follow-up Idbc45cac12
Bug: T180552
Change-Id: I077d30c50fcb419837fef937d27caca307153d2d
Previously, if one had an attribute with the contents
"-{}-foo-{}-", foo would get replaced by language converter as if
it wasn't in an attribute. This lead to an XSS attack.
This breaks doing manual conversions in url href's (or any
other attribute that goes through an escaping method
other than Sanitizer's). e.g. http://{sr-el:foo';sr-ec:bar}.com
won't work anymore. See also T87332
Bug: T119158
Change-Id: Idbc45cac12c309b0ccb4adeff6474fa527b48edb
Adjust regexes for what not to convert to avoid backtracking by
preferring possesive quantifiers
Add check that we really have matched to the end of the string, and
log error if the regex hits some sort of error preventing the
entire string from being matched. Should the regex not match to the
end, then language conversion is disabled for the string.
Bug: T124404
Change-Id: I4f0c171c7da804e9c1508ef1f59556665a318f6a
Example implementation using this hook: wikiHow's ChineseVariantSelector
extension, installed on zh.wikihow.com, which uses cookies to store the
preferred language variant, allowing anonymous users to change the
language variant without registering/logging in.
Change-Id: I5295a26578b45a8d51f2b7550938088fec18404f
Make the LanguageConverter::reloadTables method actually private,
and use the TestingAccessWrapper to call it when running parser tests.
Follow-up to I65736520cd04bfe8949b29ade07338a6e1b88a4d.
Change-Id: I43b81b8fef6441ad50b858ff7757732ecb5eef91
Conversion rules defined in a previous test case were leaking into
subsequent test cases. Existing tests had worked around this by defining
non-overlapping rules, but it's better to just fix the problem at the
source.
Change-Id: I65736520cd04bfe8949b29ade07338a6e1b88a4d
Guarded by the $wgUsePigLatinVariant variable, off by default.
Pig Latin is a language game where words in English are altered
according to the following rules:
* Words starting with a vowel have a '-way' suffix appended.
* Words starting with a consonant have the initial consonants (or 'qu'
group) moved to the end and an '-ay' suffix appended.
https://en.wikipedia.org/wiki/Pig_Latin
* Added 'en-x-piglatin' as a language name.
* Added 'en' to LanguageConverter::$languagesWithVariants.
* Added LanguageEn class and its corresponding EnConverter which
provides one-way translation from English to Pig Latin.
* Some minor internal changes in code that assumed that English
doesn't have a language class or converter.
Bug: T45547
Depends-On: I1d9691c784032669979f8109c9a5f65cbf4122c9
Change-Id: I7fa2d85d6364958c5138366e8b4504a2697a8731
U+0000 is not allowed in HTML5, there's no reason to allow it in wikitext.
It simplifies our code if we can just strip them at the start. Strip in
PST as well so they don't sneak into our database either.
Tweaked the EXT_LINK URLs to account for the fact that invalid characters
get transformed into U+FFFD when using Preprocessor_DOM. See 73649741ed
(r65967) for context on that change.
Bug: T159174
Change-Id: I3f67e92b61aacc87a40c3662085c84d1dac08bfb
It's unreasonable to expect newbies to know that "bug 12345" means "Task T14345"
except where it doesn't, so let's just standardise on the real numbers.
Change-Id: Id2f9d229d17b8eee66b2ca4e3927f3f66ac62988
I was bored. What? Don't look at me that way.
I mostly targetted mixed tabs and spaces, but others were not spared.
Note that some of the whitespace changes are inside HTML output,
extended regexps or SQL snippets.
Change-Id: Ie206cc946459f6befcfc2d520e35ad3ea3c0f1e0
A "remove HTML tags to avoid disrupting the layout" block is removed
(previously added in f16d1e4ed7).
This is a follow-up to I9b099273203482ffb570a5654d8ba50c833e526d.
Bug: T54192
Change-Id: I565fac58b3b0da7bfaedf64f5001c364f52e2244
Ideally LanguageConverter shouldn't be relying on global state at all.
But as a first step let's make it not try to use the global state when
that global state isn't even there.
Bug: T127233
Change-Id: I391cef3ec211d648b078fc509e0139daa58eb875
I searched for /\$(\S+) = (.+?\(.*?\);)\n.*?\$\1\[/, ignored
everything involving isset(), unset() or array assigments, then
skimmed through the remaining results and changed things where they
made sense. These changes were not automated, so please review them.
Change-Id: Ib37b4c66fc57648470f151ad412210b3629c2538
Previously various language objects would install a hook to update the
shared conversion table cache when the object was constructed. This is
not a good idea since language objects may be constructed even when they
are not the content language, but only the content language is
associated with variant conversion and the conversion cache.
Instead, have WikiPage call a method on $wgContLang directly. I put this
with message cache update since the logic is almost identical.
Change-Id: Ief9c0ef993e39645e74a6e158cb4e6e2139ce91d
* This can avoid MessageCache::load() calls on another
language due to variants. The convertNamespace() method
takes up a significant amount of time for 404 pages.
Change-Id: I4551d5b8e5b5a0bc01d02702b80f93591fc19440
Generating one-time, unique strip markers hurts us in multiple ways:
* The strip marker regexes don't benefit from JIT compilation, so they are
slower to execute than they could be.
* Although the regexes don't benefit from JIT compilation, they are still
compiled, because HHVM bets on regexes getting reused. This extra work is
fairly costly (1-2% of CPU usage on the app servers) and doesn't pay off.
* The size of the PCRE JIT cache is finite, and the caching of one-off regexes
displaces from the cache regexes which are in fact reused.
Tim's preferred solution (per his review comment on
https://gerrit.wikimedia.org/r/167530/) is to use fixed strip markers.
So:
* Replace usage of $parser->mUniqPrefix with Parser::MARKER_PREFIX, which
complements the existing Parser::MARKER_SUFFIX.
* Deprecate Parser::mUniqPrefix and its accessor, Parser::uniqPrefix().
* Deprecate Parser::getRandomString(), since it is no longer useful.
* In Preprocessor_*:preprocessToObj() and Parser::fetchTemplateAndTitle,
replace any occurences of \x7f with '?', to prevent strip marker forgery.
\x7f is not valid input anyway.
* Deprecate the $prefix parameter for StripState::__construct, since a custom
prefix may no longer be specified.
Change-Id: I31d4556bbb07acb72c33fda335fa5a230379a03f