wiki.techinc.nl/includes/OutputTransform
Máté Szabó acb403ccfd DeduplicateStyles: Only transform possible style nodes
Why:

- DeduplicateStyles runs as a default post-cache output transformation
  for every backend pageview. It tokenizes the article HTML via Remex to
  deduplicate style nodes within.
- This is expensive for large pages. On the Barack Obama page, the
  transform takes 350+ ms on a parser cache hit.
- Some other transforms, like HandleSectionLinks, already use regexes to
  only run Remex-driven transforms on relevant elements to avoid a
  potentially expensive tokenization of the whole page.

What:

- Use a regular expression to limit this transform so that it only
  tokenizes potential <style> nodes. This takes ~2ms to execute on a
  large page[1], compared to ~166ms currently.
- Restrict this optimization to legacy parser output transformations,
  since the naïve regex used might otherwise match encoded style tags
  within data-parsoid attribute values, as described in
  I32d3d1772243c3819e1e1486351d16871b6e21c4.
  Add a test for this.

[1] https://en.m.wikipedia.org/wiki/Democratic_Party_(United_States)?action=render

Bug: T394059
Change-Id: I33ebcc2da7685b4b6dafdad3ed3ef2a9edea9a00
(cherry picked from commit 02f69d5dc99a964981c57b597eedffa1f253a14c)
2025-10-03 23:19:17 +00:00
..
Stages DeduplicateStyles: Only transform possible style nodes 2025-10-03 23:19:17 +00:00
ContentDOMTransformStage.php Namespace all remaining classes in includes/parser 2024-10-15 23:54:32 +01:00
ContentTextTransformStage.php Namespace all remaining classes in includes/parser 2024-10-15 23:54:32 +01:00
DefaultOutputPipelineFactory.php SECURITY: Ensure emitted HTML is safe against Unicode NFC normalization 2025-04-10 15:56:06 +01:00
OutputTransformPipeline.php Namespace all remaining classes in includes/parser 2024-10-15 23:54:32 +01:00
OutputTransformStage.php Namespace all remaining classes in includes/parser 2024-10-15 23:54:32 +01:00
README.md

Output transformations pipelines for wikitext

The classes in the Stages/ subdirectory contains HTML and DOM transforms for use in output processing pipelines, i.e. postprocessors for ParserOutput objects that either directly result from a parse or are fetched from ParserCache.

The default pipeline is created by DefaultOutputTransformFactory; it corresponds to what was previously contained in ParserOutput::getText. The shouldRun method in these stages uses defaults that indicates if the stage runs or not in the default OutputTransformPipeline.