As noted on the comments, this needed a markup that work better
in bidi scenarios and as a part of replacing bidi control codes
with HTML markup I was able to test different bidi scenarios
using <bdi> HTML tags.
Bug: T375975
Change-Id: If2af751fc9f78869acf7b7e93199fa927de2cc19
Move deduplication of language links out of Parser.php and into the
ParserOutput in order to be compatible with alternate Parsers (Parsoid).
Clean up various inconsistencies: ensure deduplication also happens in
OutputPage when multiple ParserOutputs are merged into the final output,
and ensure that the deduplication in LinksUpdate is done in the same
order (first link prevails) as in Parser/ParserOutput/OutputPage.
Deprecate OutputPage::setLanguageLinks() (the matching
ParserOutput::setLanguageLinks() was deprecated in 1.42).
As a breaking change, return an array, not an array *reference*, from
ParserOutput::getLanguageLinks(). This allows us to safely modify the
internal representation of language links. As far as I can tell, no one
used the returned reference to sneakily modify the list of language
links, and there not a good way to have deprecated this before making
the breaking change.
While we're at it, we've added tests to ensure that language link
fragments are preserved.
Bug: T26502
Bug: T358950
Bug: T375005
Change-Id: I82a05a51d94782ebb9fa87ff889ca0f633b3e15c
New options added: `iwl`, `links`, `special`, `extlinks`, and `templates`,
and handling of existing `ill` option tweaked to be consistent.
Added some tests to exercise these options, focusing on the handling
of title fragments. Attempted to make the output formatting consistent
among options; a future unification (I32df68714ffdf2f0745b974f47bc3ccceef1f41c)
should help DRY these out further.
Bug: T310512
Change-Id: Ic9c766ae4362969de124ad9d66eb47cfa68395c6
Depending on configuration, this returns either the interface language
code of the current user or the current page language.
Bug: T4085
Change-Id: Iab7fda272ec81af88c74612727ff6bed014d4a81
Add the same no-arg options for language code that
{{#dir}} and {{#bcp47}} have, for consistency:
* `{{#language}}` will return the name of the *target language*
(for articles, the content language; for messages, the user language)
The default value for the "in language" argument should be the autonym.
This was working previously but only via a baroque code flow path for
invalid language codes. Make this a bit clearer and add tests.
Since non-autonym language code translations are added via the
[[Extension:CLDR]] in production, hook LanguageGetTranslatedLanguageNames
in the ParserTestRunner to ensure that we can test this.
Followup-To: Ice1c671c5b3cc077d2bb80ea5dc25c5eabbfeb36
Followup-To: I19c3e91a924e080f37dc95a0d4e61493583b533e
Change-Id: Ibf6e7f194cc056eadb48a5ad8e6d01a761d9351c
Template:Bcp47 is one of the most used templates in Wikimedia Commons.
Providing its functionality as a parser function, tied to MediaWiki's
language-handling code, reduces code duplication and will allow us to
reduce template usage on commons.
As with the {{#dir}} parser function, support one special case:
* `{{#bcp47}}` will return the BCP-47 code of the *target language*
(for articles, the content language; for messages, the user language)
Note the following slight differences from [[Template:BCP47]] on Commons,
documented in an added parser test:
* 'simple' maps to 'en-simple' (not just 'en')
* 'roa-tara' maps to 'nap-x-tara' (not 'it-x-tara')
Bug: T366623
Change-Id: Ice1c671c5b3cc077d2bb80ea5dc25c5eabbfeb36
Template:Dir is one of the most used templates in Wikimedia Commons,
this tries to provide parts of its functionality in hope we can
perhaps simplify or get rid of the template eventually for clarity and
performance reasons.
As a convenience, `{{#dir}}` and `{{#dir:}}` are synonyms for
`{{#dir:{{PAGELANGUAGE}}}}`: they return the direction of the target
language. For articles, the target language is the content language;
for messages, the target language is the user language.
In addition, to avoid confusion between BCP-47 language codes and
MediaWiki-internal language codes, an optional second parameter can be
supplied. If the second parameter is the (localizable) string
'bcp47', the language code given in the first parameter will be
treated as a BCP-47 code. For example: `{{#dir:sr-Cyrl|bcp47}}`.
(See LanguageCode::bcp47ToInternal() for a description of the
differences and overlaps between MediaWiki internal and BCP-47
codes. These overlaps *so far* don't result in any case where
encouraging editors to be precise about which set of enumerated
string values they are using for consistency with other
language-related functions, and because MediaWiki internally
differentiates between BCP-47 codes and internal codes.)
Bug: T359761
Change-Id: I19c3e91a924e080f37dc95a0d4e61493583b533e
* Add wgLocaltimezone to the list of global variables which may be set
in parser test options.
* Add userLanguage option, which is passed through to ParserOptions.
Bug: T223772
Change-Id: I8498527c276288feae854868a8f4b1f3205a49e8
Legacy parser can now output headings using a more accessible markup,
which is also identical to the markup used by the Parsoid parser.
Changes to client-side JS and CSS necessary to support the new markup
have already been merged in earlier commits.
includes/skins/Skin.php
includes/ServiceWiring.php
* Define a new skin option, 'supportsMwHeading', which can be used
to toggle the new markup per-skin.
* Update the built-in fallback skin to enable it. This affects the
output in parser tests.
docs/config-schema.yaml
includes/config-schema.php
includes/config-vars.php
includes/MainConfigNames.php
includes/MainConfigSchema.php
* Add a new configuration setting, 'ParserEnableLegacyHeadingDOM',
which can be used to toggle the new markup per-site.
includes/OutputTransform/Stages/HandleSectionLinks.php
* Output new heading HTML for skins that enabled the option.
tests/*
* Duplicate parser tests that cover heading generation to cover both
new and old markup. Update other parser tests to use new markup.
* Add some unit and integration tests for the behavior of the skin
option and some parser tests for edge cases of the new markup.
Bug: T13555
Change-Id: I1180169a8e83af834c2984ba16089e6277f2a8dd
This follows up on I5e87b33a956e296cdaf671fa99c9555944b73479 and makes
(invisible) language links consistent with how we handle (invisible)
category links.
Bug: T359886
Followup-To: I5e87b33a956e296cdaf671fa99c9555944b73479
Change-Id: I3e5567a91b47e0b04da928450644f3f475aaf51b
This follows up on a long series of tweaks to whitespace handling around
[[Category]] links (T2087, T87753, T174639) which aimed to simplify and
make intelligible the whitespace handling around category links without
allowing categories to break lists or paragraphs in which they are found.
Removing newlines but not other whitespace on the left-hand side of
category links should preserve the valuable features of T2087 et al
while still ensuring that the following all render equivalently:
ABC [[Category:Foo]]DEF
ABC[[Category:Foo]] DEF
ABC [[Category:Foo]] DEF
Added parser test to document the new behavior; it's worth noting
that although there were plenty of tests documenting the expected
interaction of category links and newlines, there were previously
no tests covering the interaction of non-newline whitespace and
category links; the one test which needed to be altered added
non-semantic whitespace (ie, extra whitespace to the test output
which did not affect the way the HTML would display).
This patch brings the legacy parser into parity which Parsoid parsing
of category links.
Bug: T359886
Change-Id: I5e87b33a956e296cdaf671fa99c9555944b73479
This ensures uniform treatment of all places that call `addCategory`
without duplicating the `defaultsort` code; it also ensures that the
effect of the {{DEFAULTSORT}} parser function is independent of page
position.
Bug: T40435
Bug: T353530
Change-Id: I4480a6d59e766fa4eddc9ec9117c58b66771bb47
I'm not sure how this ever happened, but I'm sure it's a mistake.
The following test scenario should make it very obvious:
* {{#formatdate:-0002-12-31|mdy}}
* {{#formatdate:-0001-12-31|mdy}}
* {{#formatdate:0000-12-31|mdy}}
* {{#formatdate:0001-12-31|mdy}}
* {{#formatdate:0002-12-31|mdy}}
Expected output: 3 BC, 2 BC, 1 BC, 1, 2, …
Current output: 3 BC, 2 BC, 0 (?), 1, 2, …
Note how "1 BC" is skipped and shown as "0" instead. Everything else
is correct, e.g. the ISO year -1 is already displayed as "2 BC".
It's really only this single outlier.
In case you don't know: There is no year 0 when the BC specifier is
used. There is either year 1 after or year 1 before Christ. This is
different in ISO, mostly to make calculations easier. That's why the
DateFormater already does an extra `- 1` and `+ 1` in the two
makeIsoYear and makeNormalYear methods.
The problematic line of code was originally written in 2003, see
https://phabricator.wikimedia.org/rMW98fc03e6
The core parser function exists since 2009, see
https://phabricator.wikimedia.org/rMWb9ffb5a7
Change-Id: Iaeb7a954579a409fefd87dab4e2a15778ab39fb4
This now aligns with Parsoid commit 51baccc8741108a9e3f763f2c19c6ce6eda55ac4
Three tests needed to be disabled because they had dependencies on features
not included in core's CI:
* {{#if}} used in tests added by I71c38b42ac9bfb7137f2e34df70bdfa139abced7
but only provided by the ParserFunctions extension
* <poem> used in tests added by I5a6356a82251881a5f841b36a7f26879fc611138
but only provided by the Poem extension
In addition, the "multiline" part of the "Expansion of multi-line..."
parser tests seems to have been lost at some point. My best guess is
that the definition of `Template:1x` initially included an extra
newline which was lost, maybe during an unrelated stripping of
leading/trailing whitespace in `!! article` clauses. In any case,
these tests are no longer testing the thing they say they are.
These will be fixed in a follow up.
Change-Id: Ia9144634625f176fbea11f3d2ef4b21a5492e99b
This reverts commit 82da9cf14b.
Passing through Remex seems to have unexpected consequences to be
investigated but, for the sake of unbreaking the UBN, let's revert this
first.
Bug: T353920
Change-Id: Iaac7942aa77aee5ab525852ac5b41dd516ff13c9
The previous implementation was using an ad-hoc regular expression which
was matching inside the data-mw attribute of Parsoid output, eg:
<sup about="#mwt42" [...] typeof="mw:Extension/ref mw:Error" data-mw="{"name":"ref","attrs":{"name":"infobox_stats_ref_rail"},"body":{"html":"<style data-mw-deduplicate=\"TemplateStyles:r1133582631\" typeof=\"...">
After substitution, the <link> element inserted contained " instead of
" and so broke out of the attribute.
Instead use a proper HTML tokenizer (via wikimedia/remex-html) so that
we don't allow bogus matches inside attribute values.
To fix up tests:
* Don't deduplicate styles when parsing UX messages (also helps performance)
* Don't deduplicate styles in ContentHandler integration tests
* Don't deduplicate styles by default in parser tests
(unless explicit option is set)
Depends-On: Id9801a9ff540bd818a32bc6fa35c48a9cff12d3a
Depends-On: I5111f1fdb7140948b82113adbc774af286174ab3
Followup-To: Ic0b17e361bf6eb0e71c498abc17f5f67f82318f8
Change-Id: I32d3d1772243c3819e1e1486351d16871b6e21c4
Two messages were added to wgRawHtmlMessages instead of just
fixing the way they were parsed so they can't contain raw
HTML. This fixes that.
In order to avoid breakage on-wiki for old customized messages
that took advantage of them being parsed as raw HTML, rename
the messages too. Also rename a few other messages from the
same set to stay consistent.
Note: These messages are suppressed in favour of Echo's messages
when Echo is enabled, and Echo is enabled on all Wikimedia wikis,
so the existing customized messages on Wikimedia wikis are basically
no-ops.
Bug: T353316
Change-Id: Ib0d1c79247fe091f2806b7c23ffb2fe22cc4df4a
Allows editors to identify a pseudo-heading as a heading of a given
hierarchical level to assistive technologies. Also allows levels 7
and deeper.
<div role="heading" aria-level="2">Example</div>
See also https://www.w3.org/WAI/GL/wiki/Using_role%3Dheading
Change-Id: Ia465a076db334d08cd1f548f2363a0f7cafe7690
The purpose of which is to improve the performance of the css selectors
targeting the output, as analyzed in T270150#8524965, as well as
eliminate some of the brittleness in depending on direct descendents
and first-child, which can be seen in T320285 and T304010.
mw-file-element is targeted to apply margins, borders, and
vertical-alignment to that element. The current css rules have wildcard
selectors in the rightmost position, which, since css is parsed
right-to-left, can be quite slow on a wiki page. The legacy parser has
an equivalent class, thumbimage, when rendering thumbs but here we apply
the class more broadly.
A follow-up patch in I70c61493fe492445702f036e5b24ef87fc3bdf43 will
remove the redundant wildcard selectors once parsercache has turned
over.
Bug: T270150
Bug: T314097
Depends-On: Ie85ee7048273023a2c51f42a333a9c1493360404
Depends-On: Ie0ec018ac6c2c42c05610b342d7ef87493dfdc42
Depends-On: Ifc17fdf530af515b066de706ca5e69e118fd1c5b
Depends-On: Ib60edacdae2ff41a0de2b2b584718fd9ce925f97
Change-Id: Ifd4001e312a5fa4b7beaad63ba8c4e79e3201b9b
This provides a bit of isolation from the actual layout and names
of properties in the object, as well as being a touch more readable
when debugging test failures.
Change-Id: I5ddca850f577b2ac24e237a2518f03983e79a51d
The LanguageConverter::convert()/::convertTo() methods clear the
converted title and reset other (less important) bits of
LanguageConverter state. Add an optional parameter in order
to skip this reset.
(The LanguageConverter::translate() methods are available which
don't reset LanguageConverter state, but they also don't process
embedded language converter markup. Since headings can contain
embedded markup, the ::translate() methods aren't appropriate.)
Bug: T306862
Bug: T331316
Change-Id: Ifb2745e45974755ba5a6068c13e84be6c4e3f329
If a ParserTest mixes HTML output and metadata properties, it can
complicate HTML normalization and other test processes, especially
for Parsoid-mode bidirectional tests.
Support splitting metadata output into a separate section, named
`!! metadata`, with the standard options for legacy and parsoid
variants, like `!! metadata/php` and `!! metadata/parsoid` and
`!! metadata/parsoid+integrated` etc.
For compatibility, if the metadata flags are present on the test
and the new section is not present, we'll continue to handle the
metadata output as we have before, aka append or prepend the metadata
to the HTML.
Code search for uses of these options (uses in parsoid and core can
be ignored; uses of 'pst' are harmless when they are not combined
with another option):
https://codesearch.wmcloud.org/search/?q=%28%5E%7C%20%29%28%28showtitle%7Cshowindicators%7Cill%7Ccat%7Cpst%7Cshowflags%29%28%20%7C%24%29%7C%28extension%3D%7Cproperty%3D%29%29&i=nope&files=%5Etests%2Fparser%2F.*%5C.txt&excludeFiles=&repos=
Change-Id: I845694d4f2109a8b9125410e8533ca69bbea50fa
In 24949480eb (Oct 2021) injection of
the Table of Contents was moved from Parser to
ParserOutput::getText(); that is, from parse time to "postprocess text
possibly fetched from the cache" time. Unfortunately, this meant that
language conversion wasn't done on the table of contents (!), for
either traditional skins or the vector-2022 skin. This was fixed for
traditional skins by 059e62cde6 (Nov
2021), later amended by 0955046ca5 (Mar
2022), which added explicit language conversion to the TOC injection
process in ParserOptions::getText(). This fix was still not complete,
however, since editor-defined custom language-conversion rules defined
in the article body were no longer available to the language converter
when conversion was done in ParserOutput::getText(); the ToC title was
also being double-converted. Further, neither of these short-term
fixes addressed the output of ParserOutput::getSections() (now
ParserOutput::getTOCData()) which was used by vector-2022 to generate
the ToC in the sidebar and which remained entirely unconverted.
With 439656e019 (Jan 2023), we started
using the ::getSections()/::getTOCData() output for main article text
as well, but we kept the previous hack which post-converted the
generated HTML. This kept old skins at parity with the post-Oct-2021
status, but also didn't address the conversion issue for vector-2022.
The solution here is to perform language conversion on the ToC lines
at parse time along with the rest of the language conversion, and
store *converted* headings in TOCData. This has a number of side
effects:
1. The ToC information array available via the action API
is now language converted. This is *probably* what you wanted in the
first place, but could potentially be disruptive.
2. The ToC is consistently converted with the full set of
editor-defined custom conversion rules. Before Oct 2021, the ToC was
converted using the set of custom conversion rules *active at the
point at which the ToC was inserted* (which was usually near the
beginning of the article). When all conversion rules appear at the
very top of the article (best practice!):
-{en:Foo; en-x-piglatin:Bar;}
Lead section text
== Introduction ==
== Foo ==
There should be no difference before pre-Oct 2021 behavior and the
behavior after this patch: in both cases the rule defined in the
article body will be applied both to the heading and to the TOC, and
they will be consistent. (After Oct 2021 and before this patch, Foo
would be converted in the heading but not in the table of contents.)
But in cases where conversion rules are defined after the
TOC insertion point, the section heading as it appears in the body
text could appear different from the section heading as it appears in
the ToC. For example, if you defined a conversion rule just before
using a term in a heading:
== Introduction ==
-{en:Foo; en-x-piglatin:Bar;}-
== Foo ==
Before Oct 2021, this rule would be applied to the heading, but not to
the TOC (because the TOC insertion point was before the rule
definition). This would also be the behavior before this patch (since
rules defined in the article body are currently not applied at all).
After this patch, the rule will be applied to both the heading and the
TOC (because the rule application location is effectively "at the very
end of the article"). In the rare cases when rules are not defined in
glossaries at the top of the article, this type of usage (definition
immediately preceding first use) is expected to be the most common
and the behavior after this patch is more correct.
But alternatively, if you defined a conversion rule *after* using
the term in a heading:
== Introduction ==
== Foo ==
-{en:Foo; en-x-piglatin:Bar;}-
Before Oct 2021, this rule wouldn't be applied to the heading *or* the
TOC. Before this patch, this would also be the case (because rules
defined in the article body are not applied at all). After this
patch, the rule will be applied to the ToC but not the heading, since
the application point for the TOC is effectively at the end of the
article. This inconsistency is probably not desirable, but this case
is expected to be rare, and (assuming the editor intended 'Foo' to be
unconverted) the editor can work around the inconsistency by
explicitly protecting 'Foo' from conversion:
== -{Foo}- ==
-{en:Foo; en-x-piglatin:Bar;}-
And if the editor /intended/ Foo to be converted, the rule definition
should be moved earlier in the article. Again, putting all rules at
the top of the article is the preferred style, and works better with
the glossary style used by the zhwiki community (see also
https://www.mediawiki.org/wiki/Requests_for_comment/Scoped_language_converter
).
Bug: T306862
Depends-On: I0c9c9fec920f7cb028d935e552a8f11475a23ba7
Change-Id: I321cd31dae64bbf845d53282e5d28a55bc4ec319