Commit graph

855 commits

Author SHA1 Message Date
Ebrahim Byagowi
efda4cae32 Use a better bidi aware markup in CommentParser
As noted on the comments, this needed a markup that work better
in bidi scenarios and as a part of replacing bidi control codes
with HTML markup I was able to test different bidi scenarios
using <bdi> HTML tags.

Bug: T375975
Change-Id: If2af751fc9f78869acf7b7e93199fa927de2cc19
2024-10-04 10:50:02 +03:30
C. Scott Ananian
714a7146d6 Sync up core repo with Parsoid
This now aligns with Parsoid commit b19f73d7beadedcb6991640aac7eb7d6e7aec8f5

Change-Id: Ief91b25769f777169af65c9720faa767850f6239
2024-10-02 10:43:47 -04:00
C. Scott Ananian
7495f9bc15 Deduplicate language links in ParserOutput and OutputPage
Move deduplication of language links out of Parser.php and into the
ParserOutput in order to be compatible with alternate Parsers (Parsoid).
Clean up various inconsistencies: ensure deduplication also happens in
OutputPage when multiple ParserOutputs are merged into the final output,
and ensure that the deduplication in LinksUpdate is done in the same
order (first link prevails) as in Parser/ParserOutput/OutputPage.

Deprecate OutputPage::setLanguageLinks() (the matching
ParserOutput::setLanguageLinks() was deprecated in 1.42).

As a breaking change, return an array, not an array *reference*, from
ParserOutput::getLanguageLinks().  This allows us to safely modify the
internal representation of language links. As far as I can tell, no one
used the returned reference to sneakily modify the list of language
links, and there not a good way to have deprecated this before making
the breaking change.

While we're at it, we've added tests to ensure that language link
fragments are preserved.

Bug: T26502
Bug: T358950
Bug: T375005
Change-Id: I82a05a51d94782ebb9fa87ff889ca0f633b3e15c
2024-09-26 15:28:49 -04:00
C. Scott Ananian
25b27ce309 Sync up core repo with Parsoid
This now aligns with Parsoid commit fc9ab0949952d5e784acb012096860f5c8663fc7

Change-Id: I5d72f551c75de80b0834ea98d8a1d3cb5852e866
2024-09-26 13:04:36 -04:00
C. Scott Ananian
ec4e4648dd Sync up core repo with Parsoid
This now aligns with Parsoid commit dea42dd799d9c40fb7fedb42122ec264d6ef6ded

Change-Id: I4b2614ce3a83bfea0af53927464e7fbde6a92df9
2024-09-24 12:36:03 -04:00
C. Scott Ananian
25da911334 Parser tests: add additional options to test ParserOutput metadata
New options added: `iwl`, `links`, `special`, `extlinks`, and `templates`,
and handling of existing `ill` option tweaked to be consistent.

Added some tests to exercise these options, focusing on the handling
of title fragments.  Attempted to make the output formatting consistent
among options; a future unification (I32df68714ffdf2f0745b974f47bc3ccceef1f41c)
should help DRY these out further.

Bug: T310512
Change-Id: Ic9c766ae4362969de124ad9d66eb47cfa68395c6
2024-09-13 14:42:27 -04:00
Yiannis Giannelos
0509dbebad Sync up core repo with Parsoid
This now aligns with Parsoid commit 80bc41a395b19221e7f26b36dfbe0ab15a025819

Change-Id: Iec571f78e7a55991aea69ede2519803b84c05936
2024-09-12 18:58:43 +03:00
C. Scott Ananian
7249c4c982 parserTests.txt: Update documentation about cat/ill options
Parsoid does support these options now.

Change-Id: I9caedd10b8f7229602ad4f963275b62777aca104
2024-09-10 19:30:07 +00:00
dvorapa
10ab0e40a9 parser: Add a new {{USERLANGUAGE}} magic word for use in wikitext
Depending on configuration, this returns either the interface language
code of the current user or the current page language.

Bug: T4085
Change-Id: Iab7fda272ec81af88c74612727ff6bed014d4a81
2024-09-07 19:16:32 +00:00
jenkins-bot
512c78b8ea Merge "Make {{#language}} consistent with {{#dir}} and {{#bcp47}}" 2024-07-31 11:42:16 +00:00
jenkins-bot
52a10a36b1 Merge "Add {{#bcp47}} parser function" 2024-07-31 11:42:08 +00:00
jenkins-bot
f338ac3295 Merge "Add {{#dir}} parser function" 2024-07-30 20:34:27 +00:00
C. Scott Ananian
450fe7fcd8 Make {{#language}} consistent with {{#dir}} and {{#bcp47}}
Add the same no-arg options for language code that
{{#dir}} and {{#bcp47}} have, for consistency:
* `{{#language}}` will return the name of the *target language*
  (for articles, the content language; for messages, the user language)

The default value for the "in language" argument should be the autonym.
This was working previously but only via a baroque code flow path for
invalid language codes.  Make this a bit clearer and add tests.

Since non-autonym language code translations are added via the
[[Extension:CLDR]] in production, hook LanguageGetTranslatedLanguageNames
in the ParserTestRunner to ensure that we can test this.

Followup-To: Ice1c671c5b3cc077d2bb80ea5dc25c5eabbfeb36
Followup-To: I19c3e91a924e080f37dc95a0d4e61493583b533e
Change-Id: Ibf6e7f194cc056eadb48a5ad8e6d01a761d9351c
2024-07-30 20:27:17 +00:00
C. Scott Ananian
416c33bb6a Add {{#bcp47}} parser function
Template:Bcp47 is one of the most used templates in Wikimedia Commons.
Providing its functionality as a parser function, tied to MediaWiki's
language-handling code, reduces code duplication and will allow us to
reduce template usage on commons.

As with the {{#dir}} parser function, support one special case:

* `{{#bcp47}}` will return the BCP-47 code of the *target language*
  (for articles, the content language; for messages, the user language)

Note the following slight differences from [[Template:BCP47]] on Commons,
documented in an added parser test:

* 'simple' maps to 'en-simple' (not just 'en')
* 'roa-tara' maps to 'nap-x-tara' (not 'it-x-tara')

Bug: T366623
Change-Id: Ice1c671c5b3cc077d2bb80ea5dc25c5eabbfeb36
2024-07-30 20:27:03 +00:00
Ebrahim Byagowi
e1385d3bdf Add {{#dir}} parser function
Template:Dir is one of the most used templates in Wikimedia Commons,
this tries to provide parts of its functionality in hope we can
perhaps simplify or get rid of the template eventually for clarity and
performance reasons.

As a convenience, `{{#dir}}` and `{{#dir:}}` are synonyms for
`{{#dir:{{PAGELANGUAGE}}}}`: they return the direction of the target
language.  For articles, the target language is the content language;
for messages, the target language is the user language.

In addition, to avoid confusion between BCP-47 language codes and
MediaWiki-internal language codes, an optional second parameter can be
supplied.  If the second parameter is the (localizable) string
'bcp47', the language code given in the first parameter will be
treated as a BCP-47 code.  For example: `{{#dir:sr-Cyrl|bcp47}}`.

(See LanguageCode::bcp47ToInternal() for a description of the
differences and overlaps between MediaWiki internal and BCP-47
codes.  These overlaps *so far* don't result in any case where
encouraging editors to be precise about which set of enumerated
string values they are using for consistency with other
language-related functions, and because MediaWiki internally
differentiates between BCP-47 codes and internal codes.)

Bug: T359761
Change-Id: I19c3e91a924e080f37dc95a0d4e61493583b533e
2024-07-19 16:57:48 -04:00
Tim Starling
ebf3c9be86 ParserTestRunner: add timezone and user language options
* Add wgLocaltimezone to the list of global variables which may be set
  in parser test options.
* Add userLanguage option, which is passed through to ParserOptions.

Bug: T223772
Change-Id: I8498527c276288feae854868a8f4b1f3205a49e8
2024-07-12 11:35:33 +10:00
C. Scott Ananian
c8e77a3707 Sync up core repo with Parsoid
This now aligns with Parsoid commit 2508e24a2aeb54b55eb54f7f65bedc4d477fc9cf

Change-Id: Ibb9f1c6287c6ec3e982f0fa3ddf908b01484973a
2024-06-10 23:29:02 -04:00
Bartosz Dziewoński
f0c7fa9234 Move section edit links outside headings (new heading HTML)
Legacy parser can now output headings using a more accessible markup,
which is also identical to the markup used by the Parsoid parser.

Changes to client-side JS and CSS necessary to support the new markup
have already been merged in earlier commits.

includes/skins/Skin.php
includes/ServiceWiring.php
* Define a new skin option, 'supportsMwHeading', which can be used
  to toggle the new markup per-skin.
* Update the built-in fallback skin to enable it. This affects the
  output in parser tests.

docs/config-schema.yaml
includes/config-schema.php
includes/config-vars.php
includes/MainConfigNames.php
includes/MainConfigSchema.php
* Add a new configuration setting, 'ParserEnableLegacyHeadingDOM',
  which can be used to toggle the new markup per-site.

includes/OutputTransform/Stages/HandleSectionLinks.php
* Output new heading HTML for skins that enabled the option.

tests/*
* Duplicate parser tests that cover heading generation to cover both
  new and old markup. Update other parser tests to use new markup.
* Add some unit and integration tests for the behavior of the skin
  option and some parser tests for edge cases of the new markup.

Bug: T13555
Change-Id: I1180169a8e83af834c2984ba16089e6277f2a8dd
2024-05-06 12:25:33 -04:00
Subramanya Sastry
33f2164096 Sync up core repo with Parsoid
This now aligns with Parsoid commit 902eb345ed701b635b98f03557276aa48b564cc2

Change-Id: I91c663a4f2ca00157fbd9337d1d0c72a98452591
2024-04-26 14:57:58 +05:30
Arlo Breault
de01ef7d20 Sync up core repo with Parsoid
This now aligns with Parsoid commit c296dca4af9a1d47200a3699e12d9884acc43150

Change-Id: I5a0e246171e9b58d77b2be945b802f381c1f40b2
2024-04-11 12:59:32 -04:00
jenkins-bot
2472cd9247 Merge "Substitute category default sort key when filling links table, not at parse time" 2024-04-11 14:59:33 +00:00
jenkins-bot
71b809f9c2 Merge "Don't strip non-newline whitespace from left side of language links" 2024-04-04 16:56:28 +00:00
jenkins-bot
31a686f9da Merge "Sync up core repo with Parsoid" 2024-04-01 04:00:01 +00:00
Subramanya Sastry
0cd8ecf2a5 Sync up core repo with Parsoid
This now aligns with Parsoid commit 16e27722c6c50618c78230952c1ad27948fc3a0b

Change-Id: I21067c1b22a494422184abf7c4bb50424b4fad56
2024-04-01 08:16:27 +05:30
C. Scott Ananian
63293370e5 Don't strip non-newline whitespace from left side of language links
This follows up on I5e87b33a956e296cdaf671fa99c9555944b73479 and makes
(invisible) language links consistent with how we handle (invisible)
category links.

Bug: T359886
Followup-To: I5e87b33a956e296cdaf671fa99c9555944b73479
Change-Id: I3e5567a91b47e0b04da928450644f3f475aaf51b
2024-03-29 18:46:16 -04:00
C. Scott Ananian
bf7120f80e Don't strip non-newline whitespace from left side of [[Category]] links
This follows up on a long series of tweaks to whitespace handling around
[[Category]] links (T2087, T87753, T174639) which aimed to simplify and
make intelligible the whitespace handling around category links without
allowing categories to break lists or paragraphs in which they are found.

Removing newlines but not other whitespace on the left-hand side of
category links should preserve the valuable features of T2087 et al
while still ensuring that the following all render equivalently:

  ABC [[Category:Foo]]DEF
  ABC[[Category:Foo]] DEF
  ABC [[Category:Foo]] DEF

Added parser test to document the new behavior; it's worth noting
that although there were plenty of tests documenting the expected
interaction of category links and newlines, there were previously
no tests covering the interaction of non-newline whitespace and
category links; the one test which needed to be altered added
non-semantic whitespace (ie, extra whitespace to the test output
which did not affect the way the HTML would display).

This patch brings the legacy parser into parity which Parsoid parsing
of category links.

Bug: T359886
Change-Id: I5e87b33a956e296cdaf671fa99c9555944b73479
2024-03-29 22:30:59 +00:00
C. Scott Ananian
c2df535b9c Substitute category default sort key when filling links table, not at parse time
This ensures uniform treatment of all places that call `addCategory`
without duplicating the `defaultsort` code; it also ensures that the
effect of the {{DEFAULTSORT}} parser function is independent of page
position.

Bug: T40435
Bug: T353530
Change-Id: I4480a6d59e766fa4eddc9ec9117c58b66771bb47
2024-03-29 18:30:02 -04:00
thiemowmde
a15b6d516f parser: Fix formatdate parser function for ISO year 0 = 1 BC
I'm not sure how this ever happened, but I'm sure it's a mistake.
The following test scenario should make it very obvious:

* {{#formatdate:-0002-12-31|mdy}}
* {{#formatdate:-0001-12-31|mdy}}
* {{#formatdate:0000-12-31|mdy}}
* {{#formatdate:0001-12-31|mdy}}
* {{#formatdate:0002-12-31|mdy}}

Expected output: 3 BC, 2 BC, 1 BC, 1, 2, …
Current output: 3 BC, 2 BC, 0 (?), 1, 2, …

Note how "1 BC" is skipped and shown as "0" instead. Everything else
is correct, e.g. the ISO year -1 is already displayed as "2 BC".
It's really only this single outlier.

In case you don't know: There is no year 0 when the BC specifier is
used. There is either year 1 after or year 1 before Christ. This is
different in ISO, mostly to make calculations easier. That's why the
DateFormater already does an extra `- 1` and `+ 1` in the two
makeIsoYear and makeNormalYear methods.

The problematic line of code was originally written in 2003, see
https://phabricator.wikimedia.org/rMW98fc03e6
The core parser function exists since 2009, see
https://phabricator.wikimedia.org/rMWb9ffb5a7

Change-Id: Iaeb7a954579a409fefd87dab4e2a15778ab39fb4
2024-02-27 17:17:36 +01:00
C. Scott Ananian
3cebc721bb Sync up core repo with Parsoid
This now aligns with Parsoid commit 51baccc8741108a9e3f763f2c19c6ce6eda55ac4

Three tests needed to be disabled because they had dependencies on features
not included in core's CI:

* {{#if}} used in tests added by I71c38b42ac9bfb7137f2e34df70bdfa139abced7
  but only provided by the ParserFunctions extension
* <poem> used in tests added by I5a6356a82251881a5f841b36a7f26879fc611138
  but only provided by the Poem extension

In addition, the "multiline" part of the "Expansion of multi-line..."
parser tests seems to have been lost at some point.  My best guess is
that the definition of `Template:1x` initially included an extra
newline which was lost, maybe during an unrelated stripping of
leading/trailing whitespace in `!! article` clauses.  In any case,
these tests are no longer testing the thing they say they are.

These will be fixed in a follow up.

Change-Id: Ia9144634625f176fbea11f3d2ef4b21a5492e99b
2024-02-21 15:04:08 -05:00
Reedy
2295da3004 Fix more incorrect casing of MediaWiki
Change-Id: I331e5636823a0beae8d804148f648cfaffd6a1f8
2024-02-19 14:35:34 +00:00
Isabelle Hurbain-Palatin
7f63d5250e Revert "Use Remex for DeduplicateStyles transform"
This reverts commit 82da9cf14b.

Passing through Remex seems to have unexpected consequences to be
investigated but, for the sake of unbreaking the UBN, let's revert this
first.

Bug: T353920
Change-Id: Iaac7942aa77aee5ab525852ac5b41dd516ff13c9
2023-12-22 11:26:09 +01:00
jenkins-bot
132a7955ae Merge "Make two messages not raw HTML" 2023-12-18 18:59:57 +00:00
C. Scott Ananian
82da9cf14b Use Remex for DeduplicateStyles transform
The previous implementation was using an ad-hoc regular expression which
was matching inside the data-mw attribute of Parsoid output, eg:

 <sup about="#mwt42" [...] typeof="mw:Extension/ref mw:Error" data-mw="{&quot;name&quot;:&quot;ref&quot;,&quot;attrs&quot;:{&quot;name&quot;:&quot;infobox_stats_ref_rail&quot;},&quot;body&quot;:{&quot;html&quot;:&quot;<style data-mw-deduplicate=\&quot;TemplateStyles:r1133582631\&quot; typeof=\&quot;...">

After substitution, the <link> element inserted contained " instead of
&quot; and so broke out of the attribute.

Instead use a proper HTML tokenizer (via wikimedia/remex-html) so that
we don't allow bogus matches inside attribute values.

To fix up tests:
* Don't deduplicate styles when parsing UX messages (also helps performance)
* Don't deduplicate styles in ContentHandler integration tests
* Don't deduplicate styles by default in parser tests
  (unless explicit option is set)

Depends-On: Id9801a9ff540bd818a32bc6fa35c48a9cff12d3a
Depends-On: I5111f1fdb7140948b82113adbc774af286174ab3
Followup-To: Ic0b17e361bf6eb0e71c498abc17f5f67f82318f8
Change-Id: I32d3d1772243c3819e1e1486351d16871b6e21c4
2023-12-15 17:49:21 +01:00
Jon Harald Søby
0e8a92d9ff Make two messages not raw HTML
Two messages were added to wgRawHtmlMessages instead of just
fixing the way they were parsed so they can't contain raw
HTML. This fixes that.

In order to avoid breakage on-wiki for old customized messages
that took advantage of them being parsed as raw HTML, rename
the messages too. Also rename a few other messages from the
same set to stay consistent.

Note: These messages are suppressed in favour of Echo's messages
when Echo is enabled, and Echo is enabled on all Wikimedia wikis,
so the existing customized messages on Wikimedia wikis are basically
no-ops.

Bug: T353316
Change-Id: Ib0d1c79247fe091f2806b7c23ffb2fe22cc4df4a
2023-12-15 11:10:37 +01:00
Subramanya Sastry
f1772fc150 Sync up core repo with Parsoid
This now aligns with Parsoid commit f73c9f0f665a57f5c0247ad1973a4f33f165f96b

Change-Id: Ibd531ddb1d545c1286e3cd3c3c6c08536f954768
2023-11-07 13:11:04 -06:00
Bartosz Dziewoński
e9a281ef4c Add parser test for escaped wikitext in section heading
Change-Id: I4f0c2107541b668f6ddd093dadcb6f391724d57f
2023-10-09 14:46:51 +00:00
Isabelle Hurbain-Palatin
8706602346 Sync up core repo with Parsoid
This now aligns with Parsoid commit 273c783374efdb148f26d7a0f3d590eb6ae66551

Change-Id: I742825115730b5697a1da47ce5d135adcdef1f8c
2023-07-13 18:15:47 +02:00
Arlo Breault
498c00ab25 Sync up core repo with Parsoid
This now aligns with Parsoid commit 0dc439dd46b5db02bd515d642caa15f9e081270d

Change-Id: I513703b4c1f002c75afd7d4792d47aa3cca0e726
2023-05-26 17:05:07 -04:00
jenkins-bot
a4cb5e6519 Merge "Sync up core repo with Parsoid" 2023-05-25 18:18:29 +00:00
Arlo Breault
a9ea70bf6c Sync up core repo with Parsoid
This now aligns with Parsoid commit db0772cd77d89ea166bf6ea162f9d223264a6f50

Change-Id: I988d8e3bd4953fdf8e71ca0ed72f2f0755e4948c
2023-05-25 13:45:34 -04:00
Matt Fitzpatrick
a7e4d70d45 Sanitizer: Permit the aria-level HTML attribute in wikitext
Allows editors to identify a pseudo-heading as a heading of a given
hierarchical level to assistive technologies. Also allows levels 7
and deeper.

<div role="heading" aria-level="2">Example</div>

See also https://www.w3.org/WAI/GL/wiki/Using_role%3Dheading

Change-Id: Ia465a076db334d08cd1f548f2363a0f7cafe7690
2023-05-21 12:57:53 +03:00
C. Scott Ananian
85a3cc74c4 Sync up core repo with Parsoid
This now aligns with Parsoid commit ede7e1c0afab3dea5c02033b9ad4e9a064e27717

Change-Id: Ib8ec513f3cef75c071b6d08913a18515a15ec82a
2023-05-11 14:49:23 -04:00
Arlo Breault
cd0d6aeba0 Sync up core repo with Parsoid
This now aligns with Parsoid commit eb7a6ce7afac292b7e8a43c622fea6ac65791fc1

Change-Id: Ie704588c71bff4525632e6aa918ae6d0bd3364fb
2023-04-26 14:11:39 -04:00
Arlo Breault
30b8fe564b Add classes on elements inside the media structure
The purpose of which is to improve the performance of the css selectors
targeting the output, as analyzed in T270150#8524965, as well as
eliminate some of the brittleness in depending on direct descendents
and first-child, which can be seen in T320285 and T304010.

mw-file-element is targeted to apply margins, borders, and
vertical-alignment to that element.  The current css rules have wildcard
selectors in the rightmost position, which, since css is parsed
right-to-left, can be quite slow on a wiki page.  The legacy parser has
an equivalent class, thumbimage, when rendering thumbs but here we apply
the class more broadly.

A follow-up patch in I70c61493fe492445702f036e5b24ef87fc3bdf43 will
remove the redundant wildcard selectors once parsercache has turned
over.

Bug: T270150
Bug: T314097
Depends-On: Ie85ee7048273023a2c51f42a333a9c1493360404
Depends-On: Ie0ec018ac6c2c42c05610b342d7ef87493dfdc42
Depends-On: Ifc17fdf530af515b066de706ca5e69e118fd1c5b
Depends-On: Ib60edacdae2ff41a0de2b2b584718fd9ce925f97
Change-Id: Ifd4001e312a5fa4b7beaad63ba8c4e79e3201b9b
2023-04-26 12:39:25 -04:00
C. Scott Ananian
fe40b55f7d ParserTestRunner: use TOCData::prettyPrint() for 'showtocdata'
This provides a bit of isolation from the actual layout and names
of properties in the object, as well as being a touch more readable
when debugging test failures.

Change-Id: I5ddca850f577b2ac24e237a2518f03983e79a51d
2023-03-10 16:41:49 -05:00
C. Scott Ananian
4e4008c976 Don't clear LanguageConverter display title when converting ToC
The LanguageConverter::convert()/::convertTo() methods clear the
converted title and reset other (less important) bits of
LanguageConverter state.  Add an optional parameter in order
to skip this reset.

(The LanguageConverter::translate() methods are available which
don't reset LanguageConverter state, but they also don't process
embedded language converter markup.  Since headings can contain
embedded markup, the ::translate() methods aren't appropriate.)

Bug: T306862
Bug: T331316
Change-Id: Ifb2745e45974755ba5a6068c13e84be6c4e3f329
2023-03-09 13:08:01 -05:00
C. Scott Ananian
93073d4632 ParserTestRunner: handle metadata output as separate section
If a ParserTest mixes HTML output and metadata properties, it can
complicate HTML normalization and other test processes, especially
for Parsoid-mode bidirectional tests.

Support splitting metadata output into a separate section, named
`!! metadata`, with the standard options for legacy and parsoid
variants, like `!! metadata/php` and `!! metadata/parsoid` and
`!! metadata/parsoid+integrated` etc.

For compatibility, if the metadata flags are present on the test
and the new section is not present, we'll continue to handle the
metadata output as we have before, aka append or prepend the metadata
to the HTML.

Code search for uses of these options (uses in parsoid and core can
be ignored; uses of 'pst' are harmless when they are not combined
with another option):

  https://codesearch.wmcloud.org/search/?q=%28%5E%7C%20%29%28%28showtitle%7Cshowindicators%7Cill%7Ccat%7Cpst%7Cshowflags%29%28%20%7C%24%29%7C%28extension%3D%7Cproperty%3D%29%29&i=nope&files=%5Etests%2Fparser%2F.*%5C.txt&excludeFiles=&repos=

Change-Id: I845694d4f2109a8b9125410e8533ca69bbea50fa
2023-02-28 17:26:08 -05:00
C. Scott Ananian
e7a762fd59 Language-convert Table of Contents at parse time
In 24949480eb (Oct 2021) injection of
the Table of Contents was moved from Parser to
ParserOutput::getText(); that is, from parse time to "postprocess text
possibly fetched from the cache" time.  Unfortunately, this meant that
language conversion wasn't done on the table of contents (!), for
either traditional skins or the vector-2022 skin.  This was fixed for
traditional skins by 059e62cde6 (Nov
2021), later amended by 0955046ca5 (Mar
2022), which added explicit language conversion to the TOC injection
process in ParserOptions::getText().  This fix was still not complete,
however, since editor-defined custom language-conversion rules defined
in the article body were no longer available to the language converter
when conversion was done in ParserOutput::getText(); the ToC title was
also being double-converted.  Further, neither of these short-term
fixes addressed the output of ParserOutput::getSections() (now
ParserOutput::getTOCData()) which was used by vector-2022 to generate
the ToC in the sidebar and which remained entirely unconverted.

With 439656e019 (Jan 2023), we started
using the ::getSections()/::getTOCData() output for main article text
as well, but we kept the previous hack which post-converted the
generated HTML. This kept old skins at parity with the post-Oct-2021
status, but also didn't address the conversion issue for vector-2022.

The solution here is to perform language conversion on the ToC lines
at parse time along with the rest of the language conversion, and
store *converted* headings in TOCData.  This has a number of side
effects:

1. The ToC information array available via the action API
is now language converted.  This is *probably* what you wanted in the
first place, but could potentially be disruptive.

2. The ToC is consistently converted with the full set of
editor-defined custom conversion rules.  Before Oct 2021, the ToC was
converted using the set of custom conversion rules *active at the
point at which the ToC was inserted* (which was usually near the
beginning of the article).  When all conversion rules appear at the
very top of the article (best practice!):

 -{en:Foo; en-x-piglatin:Bar;}
 Lead section text
 == Introduction ==
 == Foo ==

There should be no difference before pre-Oct 2021 behavior and the
behavior after this patch: in both cases the rule defined in the
article body will be applied both to the heading and to the TOC, and
they will be consistent.  (After Oct 2021 and before this patch, Foo
would be converted in the heading but not in the table of contents.)

But in cases where conversion rules are defined after the
TOC insertion point, the section heading as it appears in the body
text could appear different from the section heading as it appears in
the ToC.  For example, if you defined a conversion rule just before
using a term in a heading:

 == Introduction ==
 -{en:Foo; en-x-piglatin:Bar;}-
 == Foo ==

Before Oct 2021, this rule would be applied to the heading, but not to
the TOC (because the TOC insertion point was before the rule
definition).  This would also be the behavior before this patch (since
rules defined in the article body are currently not applied at all).
After this patch, the rule will be applied to both the heading and the
TOC (because the rule application location is effectively "at the very
end of the article").  In the rare cases when rules are not defined in
glossaries at the top of the article, this type of usage (definition
immediately preceding first use) is expected to be the most common
and the behavior after this patch is more correct.

But alternatively, if you defined a conversion rule *after* using
the term in a heading:

 == Introduction ==
 == Foo ==
 -{en:Foo; en-x-piglatin:Bar;}-

Before Oct 2021, this rule wouldn't be applied to the heading *or* the
TOC.  Before this patch, this would also be the case (because rules
defined in the article body are not applied at all).  After this
patch, the rule will be applied to the ToC but not the heading, since
the application point for the TOC is effectively at the end of the
article.  This inconsistency is probably not desirable, but this case
is expected to be rare, and (assuming the editor intended 'Foo' to be
unconverted) the editor can work around the inconsistency by
explicitly protecting 'Foo' from conversion:

  == -{Foo}- ==
  -{en:Foo; en-x-piglatin:Bar;}-

And if the editor /intended/ Foo to be converted, the rule definition
should be moved earlier in the article.  Again, putting all rules at
the top of the article is the preferred style, and works better with
the glossary style used by the zhwiki community (see also
https://www.mediawiki.org/wiki/Requests_for_comment/Scoped_language_converter
).

Bug: T306862
Depends-On: I0c9c9fec920f7cb028d935e552a8f11475a23ba7
Change-Id: I321cd31dae64bbf845d53282e5d28a55bc4ec319
2023-02-24 10:09:53 -05:00
C. Scott Ananian
e73aef2d97 Sync up core repo with Parsoid
This now aligns with Parsoid commit 90bc541138035d4ff6b62efa0050bd03161bc43b

Change-Id: I9f16f71996da5e5baf1e0506129342a25c2ece75
2023-02-23 10:30:18 -05:00
jenkins-bot
9200e04403 Merge "ParserTestRunner: Move 'showflags' handling inside ::addParserOutputInfo()" 2023-02-23 03:57:48 +00:00