And also update approximated counts, which for the most part are lower
than reported (hooray!)
Bug: T231636
Depends-On: Ica50297ec7c71a81ba2204f9763499da925067bd
Change-Id: I78354bf5f0c831108c8f606e50c87cf6bc00d8bd
The regular expression used by LanguageConverter::autoConvert() is a
constant, but it is being created on-the-fly by every invocation.
This causes an expensive full-string comparison when the compiled
regular expression is fetched from the cache -- since the regex is 332
bytes long, the time taken for this comparison can add up quickly: on
page with a lot of tags, the regexp cache may spend more time looking
up the regexp than it takes to execute it.
Bug: T223969
Change-Id: I53c3e631e47a791cf3f0844dd79d4357605c59e3
We were concatenating a single character to the end of the wikitext
source (which copies the entire string) every time through an inner
loop; when the page was large and the loop count was large this took
an excessive amount of time.
Bug: T223969
Change-Id: Ib80306b0bc6c73b750d492764f0e2dfd3a7a5450
* Title: phan false positive
* McrUndoAction: fixed improper use of @param
* UploadSourceAdapter: fixed wrong type
* XmlTypeCheck: Use null so phan doesn't think we're trying to call the
function ''
* Database: phan false positive
* SpecialBlock: Use phan's advanced type documentation so phan knows
specifically what's being returned
* ChangesListSpecialPage: phan false positive
* BatchRowUpdate: Have default callback take a parameter so phan doesn't
think too many arguments are being passed
* MimeAnalyzer: left FIXME for relying on PHP 7.1 unpack() signature
* LanguageConverter: Specify types for $mTables since phan couldn't
determine it automatically
* preprocessorFuzzTest: Implement User::load() method signature
Change-Id: I08080ab636c5fe67ea6a4e14b2212d7523606e21
Function Content::getNativeData() was deprecated. Replace with
calls to new function TextContent::getText() in most places.
Bug: T155582
Change-Id: I2bd508c72aac4faf474ba45ab1f92e2e8d2eb9be
In d59f27aeab we made
LanguageConverter::validateVariant() try harder to convert a variant
into an acceptable MediaWiki-internal form, looking at deprecated
codes and BCP 47 aliases. However, this misled Language::hasVariant()
into thinking that bogus names (like all-uppercase strings) were
acceptable variant names, which then led exceptions when they were
passed to the various conversion methods.
This is a belt-and-suspenders patch for T207433 -- in that case we
shouldn't have created a Language object with code 'sr-cyrl' in the
first place, but once one was created we shouldn't have tried to
ask LanguageSr to convert texts to 'sr-cyrl'. The latter problem
is fixed by this patch.
Bug: T207433
Change-Id: Id993bc7989144b5031a551662e8e492bd23f698a
Facilitate a gradual migration away from non-standard MediaWiki language
codes. This will ensure that (a) rules can be written with standard
BCP 47 codes, and (b) rules written with existing nonstandard codes will
continue to work once these are added to
LanguageCode::$deprecatedLanguageCodeMapping.
Change-Id: I3ba96faafaf40bd47fb5919621f7035f0431a698
The browser Accept-Language header uses BCP 47 codes, which don't
precisely match our internal mediawiki variant names in a number of
places. Allow proper BCP 47 codes to alias our internal variants
for: Accept-Language parsing, URL parsing, user preferences, and
explicit enumeration of codes in LanguageConverter rules.
This is a replay of an earlier merged patch,
0818070c59, which had to be reverted
because it was based on 8380f0173e which
caused regressions in the Babel extension (T199941).
Change-Id: Ica89d9547c58967747ab0fa15d4e83be5378796d
If you feed this method unescaped data, it can cause later calls
to be an XSS, which is something I think deserves a warning.
Bug: T202571
Change-Id: I34cb3da9232a22defffb80466263c2f2233822ef
"continue" statements are equivalent to "break". In PHP 7.3, will generate a warning.
Bug: T200595
Change-Id: I244ecb2e1ce5a76295f014fb1becd8d263196846
The browser Accept-Language header uses BCP 47 codes, which don't
precisely match our internal mediawiki variant names in a number of
places. Allow proper BCP 47 codes to alias our internal variants
for: Accept-Language parsing, URL parsing, user preferences, and
explicit enumeration of codes in LanguageConverter rules.
Change-Id: I8468a56d5b88f5786abd0a17b67bda2f1687fd0c
Clean up use of @codingStandardsIgnore
- @codingStandardsIgnoreFile -> phpcs:ignoreFile
- @codingStandardsIgnoreLine -> phpcs:ignore
- @codingStandardsIgnoreStart -> phpcs:disable
- @codingStandardsIgnoreEnd -> phpcs:enable
For phpcs:disable always the necessary sniffs are provided.
Some start/end pairs are changed to line ignore
Change-Id: I92ef235849bcc349c69e53504e664a155dd162c8
This is a first pass at Latin/Cyrillic translitertion for Crimean
Tatar (crh).
Includes transliteration tables, prefix/suffix mappings, regex
mappings, and exceptions lists for words and abbreviations.
Regularize CRH language name in messages/* files.
Fix "varient" typos in qqq.json.
Add unit tests for CRH transliteration.
Bug: T23582
Change-Id: I424703f99adf837f6217872b882d1ea26bfdd068
This fixes an issue in f21f3942 where if there was an html
element with an alt or title attribute containing an <
entity, an ascii EOT control character (0x04) may become
inserted into the text if language converter was enabled.
Due to a really old bug in language converter, self-closed tags
got turned into non-self closed tags. However due a different
bug which was fixed in f21f3942 this code path was rarely taken
so nobody noticed until now.
Follow-up Idbc45cac12
Bug: T180552
Change-Id: I077d30c50fcb419837fef937d27caca307153d2d
Previously, if one had an attribute with the contents
"-{}-foo-{}-", foo would get replaced by language converter as if
it wasn't in an attribute. This lead to an XSS attack.
This breaks doing manual conversions in url href's (or any
other attribute that goes through an escaping method
other than Sanitizer's). e.g. http://{sr-el:foo';sr-ec:bar}.com
won't work anymore. See also T87332
Bug: T119158
Change-Id: Idbc45cac12c309b0ccb4adeff6474fa527b48edb
Adjust regexes for what not to convert to avoid backtracking by
preferring possesive quantifiers
Add check that we really have matched to the end of the string, and
log error if the regex hits some sort of error preventing the
entire string from being matched. Should the regex not match to the
end, then language conversion is disabled for the string.
Bug: T124404
Change-Id: I4f0c171c7da804e9c1508ef1f59556665a318f6a
Example implementation using this hook: wikiHow's ChineseVariantSelector
extension, installed on zh.wikihow.com, which uses cookies to store the
preferred language variant, allowing anonymous users to change the
language variant without registering/logging in.
Change-Id: I5295a26578b45a8d51f2b7550938088fec18404f
Make the LanguageConverter::reloadTables method actually private,
and use the TestingAccessWrapper to call it when running parser tests.
Follow-up to I65736520cd04bfe8949b29ade07338a6e1b88a4d.
Change-Id: I43b81b8fef6441ad50b858ff7757732ecb5eef91
Conversion rules defined in a previous test case were leaking into
subsequent test cases. Existing tests had worked around this by defining
non-overlapping rules, but it's better to just fix the problem at the
source.
Change-Id: I65736520cd04bfe8949b29ade07338a6e1b88a4d
Guarded by the $wgUsePigLatinVariant variable, off by default.
Pig Latin is a language game where words in English are altered
according to the following rules:
* Words starting with a vowel have a '-way' suffix appended.
* Words starting with a consonant have the initial consonants (or 'qu'
group) moved to the end and an '-ay' suffix appended.
https://en.wikipedia.org/wiki/Pig_Latin
* Added 'en-x-piglatin' as a language name.
* Added 'en' to LanguageConverter::$languagesWithVariants.
* Added LanguageEn class and its corresponding EnConverter which
provides one-way translation from English to Pig Latin.
* Some minor internal changes in code that assumed that English
doesn't have a language class or converter.
Bug: T45547
Depends-On: I1d9691c784032669979f8109c9a5f65cbf4122c9
Change-Id: I7fa2d85d6364958c5138366e8b4504a2697a8731
U+0000 is not allowed in HTML5, there's no reason to allow it in wikitext.
It simplifies our code if we can just strip them at the start. Strip in
PST as well so they don't sneak into our database either.
Tweaked the EXT_LINK URLs to account for the fact that invalid characters
get transformed into U+FFFD when using Preprocessor_DOM. See 73649741ed
(r65967) for context on that change.
Bug: T159174
Change-Id: I3f67e92b61aacc87a40c3662085c84d1dac08bfb
It's unreasonable to expect newbies to know that "bug 12345" means "Task T14345"
except where it doesn't, so let's just standardise on the real numbers.
Change-Id: Id2f9d229d17b8eee66b2ca4e3927f3f66ac62988
I was bored. What? Don't look at me that way.
I mostly targetted mixed tabs and spaces, but others were not spared.
Note that some of the whitespace changes are inside HTML output,
extended regexps or SQL snippets.
Change-Id: Ie206cc946459f6befcfc2d520e35ad3ea3c0f1e0
A "remove HTML tags to avoid disrupting the layout" block is removed
(previously added in f16d1e4ed7).
This is a follow-up to I9b099273203482ffb570a5654d8ba50c833e526d.
Bug: T54192
Change-Id: I565fac58b3b0da7bfaedf64f5001c364f52e2244