Follows-up I301f471f86ba2.
For ease of navigation, move Converter subclasses to a group called
"Languages", which for documentation purposes is a subgroup of
"Language". The next commit does the same for Messages* files,
and Language subclasses (done separately for ease of review).
Change-Id: If1cef9aa15f536ebaedd4477ad7453426e7f3b85
The existing "linkprefix" message is unlikely to be accurately
customized by message translators (as shown by the fact that, of the 10
distinct customizations prior to Iaa7eaa44 (which made them even more
complicated), 3 were broken or entirely ineffective, 1 was half
ineffective, and 2 more seem to have included the Latin-1 Supplement by
accident) or by local wiki admins. So, like linktrail before it, let's
move it out of the system messages and into a separate language
variable.
At the same time, let's make it a simple character set (like
$wgLegalTitleChars) rather than a complicated regular expression. The
complicated regex now lives in the parser.
This also adjusts the output of the API's action=query&meta=siteinfo and
adds an accessor parallel to the linkTrail accessor to Language.
Note the following changes that are not simply extracting the existing
charset from the linkprefix message for $linkPrefixCharset:
* The En message matched all non-ASCII UTF-8 characters by matching the
component bytes (\\x80-\\xff). The new character set is equivalent.
* Various languages were identical to En and so have no $linkPrefixCharset
set. These are: Ary Az Ce Ga Id Ka Kiu Km Ltg Mk Ms Ne Nn Ro Roa_tara Sc Si
Sr_ec Sr_el Tl Tt_cyrl Tt_latn Ug_arab War
* Cu, Uk, and Udm are changed to match any number of „ or « in the prefix.
* Cv tried to include "«" that was redundant to the range \\x80-\\xff
(see En comment). This was removed.
* Diq was entirely bogus, and so was removed.
* Gu included many additional UTF-8 characters that are redundant to the
range \\x80-\\xff (see En comment). These were removed, and the
resulting character set is equivalent to En.
* Mt has been broken since it was introduced in r37242. The charset used is
equivalent to the broken regex.
Bug: 56031
Change-Id: I3369851b33113fc118a1bace38f3ac310cdd9725
The regular expression in the linkprefix message is run against the
entire page up to each wikilink, and is expected to capture one group
having everything except the prefix and another having only the prefix.
For long pages this winds up being a lot of text, so inefficient regular
expressions are going to cause problems.
The current regex is this:
/^(.*?)([a-zA-Z\\x80-\\xff]+)$/sD
This is not efficient: it will scan through the string trying to match
against every run of one or more letters/non-ASCII characters,
backtracking at every one except possibly the last. The only reason this
hasn't been a huge problem everywhere is because only a few languages
have this feature enabled.
This change replaces this with this regex:
/^((?>.*(?<![a-zA-Z\\x80-\\xff])))(.+)$/sD
This is rather more efficient: it will grab the whole string (which is
actually fast even for huge strings), then back off character by
character until it finds one that isn't a letter/non-ASCII.
Note that the above could be simplified somewhat:
/^((?>.*[^a-zA-Z\\x80-\\xff]|))(.+)$/sD
The performance improvement here is minor, and Gujarati, Church Slavic,
Udmurt, and Ukrainian would still need the other style for their current
implementations.
For Gujarati, we also use another regex trick: a look-behind assertion
in PCRE must be fixed length, so something like (?<!a|bb) won't work.
But that regex fragment is equivalent to (?<!a)(?<!bb) which is allowed,
so we use that instead.
Bug: 52865
Change-Id: Iaa7eaa446b3f045a9ce970affcb2a889f44bdefd
Format the three messages in header as one paragraph with three
sentences, instead of a paragraph and two split unordered lists with
one item each and inconsistent full stops.
Message changes: In 'wlheader-enotif' and 'wlheader-showupdated',
remove initial bullet point if present and add final full stop if
missing. First used the regexes below, then went through each language
file and manually changed the messages if applicable (e.g., Thai not
using full stops at all, Asian languages using '。', Devanagari
languages using '।' etc.)
Find: ('wlheader-[^']+'\s*=>\s*)(['"])(?:\*\s*)?([\s\S]+?)\.?\2,\n
Replace with: $1$2$3.$2,\n
Bug: 48615
Change-Id: I856f71f36d7f4b4baff5e968d88e4d3f7aeecce2