Add the same no-arg options for language code that
{{#dir}} and {{#bcp47}} have, for consistency:
* `{{#language}}` will return the name of the *target language*
(for articles, the content language; for messages, the user language)
The default value for the "in language" argument should be the autonym.
This was working previously but only via a baroque code flow path for
invalid language codes. Make this a bit clearer and add tests.
Since non-autonym language code translations are added via the
[[Extension:CLDR]] in production, hook LanguageGetTranslatedLanguageNames
in the ParserTestRunner to ensure that we can test this.
Followup-To: Ice1c671c5b3cc077d2bb80ea5dc25c5eabbfeb36
Followup-To: I19c3e91a924e080f37dc95a0d4e61493583b533e
Change-Id: Ibf6e7f194cc056eadb48a5ad8e6d01a761d9351c
Follows-up I301f471f86ba2.
For ease of navigation, move Converter subclasses to a group called
"Languages", which for documentation purposes is a subgroup of
"Language". The next commit does the same for Messages* files,
and Language subclasses (done separately for ease of review).
Change-Id: If1cef9aa15f536ebaedd4477ad7453426e7f3b85
Use @phpcs-require-sorted-array from new codesniffer release 32.0.0
Similiar to special page alias in
I827d1f5010d000609324ec398beeb142d9bac299
Bug: T255826
Change-Id: I7b7cbf0c03714001609437af68fe16e06930cc33
In commit 5940aa6344 in 2008 (SVN r35745),
these were accidentally changed from U+F011 (a private use area character)
to U+0011 (a control character). The control characters are certainly
incorrect, as they can't even be used in MediaWiki page titles. Having
a private use area character here is surprising to me, but it was
apparently used since 2006 as a substitute for U+A657, which was only
introduced in Unicode 5.1 in April 2008, hence the aliases.
I noticed this problem when I was doing some work analyzing MediaWiki's
date formats, and my editor warned me that this is a "binary file" :)
Change-Id: I5df29b0072d6dec8dcb8c7492d8b9623c93fdbf3
* Interface strings are now elsewhere
* MessagesQQQ no longer exists
* Prefer https for translatewiki.net
Change-Id: I76652ea94cca80441cd5d978029e4707ee41c4fd
The existing "linkprefix" message is unlikely to be accurately
customized by message translators (as shown by the fact that, of the 10
distinct customizations prior to Iaa7eaa44 (which made them even more
complicated), 3 were broken or entirely ineffective, 1 was half
ineffective, and 2 more seem to have included the Latin-1 Supplement by
accident) or by local wiki admins. So, like linktrail before it, let's
move it out of the system messages and into a separate language
variable.
At the same time, let's make it a simple character set (like
$wgLegalTitleChars) rather than a complicated regular expression. The
complicated regex now lives in the parser.
This also adjusts the output of the API's action=query&meta=siteinfo and
adds an accessor parallel to the linkTrail accessor to Language.
Note the following changes that are not simply extracting the existing
charset from the linkprefix message for $linkPrefixCharset:
* The En message matched all non-ASCII UTF-8 characters by matching the
component bytes (\\x80-\\xff). The new character set is equivalent.
* Various languages were identical to En and so have no $linkPrefixCharset
set. These are: Ary Az Ce Ga Id Ka Kiu Km Ltg Mk Ms Ne Nn Ro Roa_tara Sc Si
Sr_ec Sr_el Tl Tt_cyrl Tt_latn Ug_arab War
* Cu, Uk, and Udm are changed to match any number of „ or « in the prefix.
* Cv tried to include "«" that was redundant to the range \\x80-\\xff
(see En comment). This was removed.
* Diq was entirely bogus, and so was removed.
* Gu included many additional UTF-8 characters that are redundant to the
range \\x80-\\xff (see En comment). These were removed, and the
resulting character set is equivalent to En.
* Mt has been broken since it was introduced in r37242. The charset used is
equivalent to the broken regex.
Bug: 56031
Change-Id: I3369851b33113fc118a1bace38f3ac310cdd9725
The regular expression in the linkprefix message is run against the
entire page up to each wikilink, and is expected to capture one group
having everything except the prefix and another having only the prefix.
For long pages this winds up being a lot of text, so inefficient regular
expressions are going to cause problems.
The current regex is this:
/^(.*?)([a-zA-Z\\x80-\\xff]+)$/sD
This is not efficient: it will scan through the string trying to match
against every run of one or more letters/non-ASCII characters,
backtracking at every one except possibly the last. The only reason this
hasn't been a huge problem everywhere is because only a few languages
have this feature enabled.
This change replaces this with this regex:
/^((?>.*(?<![a-zA-Z\\x80-\\xff])))(.+)$/sD
This is rather more efficient: it will grab the whole string (which is
actually fast even for huge strings), then back off character by
character until it finds one that isn't a letter/non-ASCII.
Note that the above could be simplified somewhat:
/^((?>.*[^a-zA-Z\\x80-\\xff]|))(.+)$/sD
The performance improvement here is minor, and Gujarati, Church Slavic,
Udmurt, and Ukrainian would still need the other style for their current
implementations.
For Gujarati, we also use another regex trick: a look-behind assertion
in PCRE must be fixed length, so something like (?<!a|bb) won't work.
But that regex fragment is equivalent to (?<!a)(?<!bb) which is allowed,
so we use that instead.
Bug: 52865
Change-Id: Iaa7eaa446b3f045a9ce970affcb2a889f44bdefd
Update the order of parts in messages files. Not done for all files. Order set as:
fallback, encoding, namespace related, special pages, magic words, other (no fixed
ordering after magic words).
Change-Id: Ide5ec747ba62a8c2bca8040a14d0aeea8e6c79b9