Why: - DeduplicateStyles runs as a default post-cache output transformation for every backend pageview. It tokenizes the article HTML via Remex to deduplicate style nodes within. - This is expensive for large pages. On the Barack Obama page, the transform takes 350+ ms on a parser cache hit. - Some other transforms, like HandleSectionLinks, already use regexes to only run Remex-driven transforms on relevant elements to avoid a potentially expensive tokenization of the whole page. What: - Use a regular expression to limit this transform so that it only tokenizes potential <style> nodes. This takes ~2ms to execute on a large page[1], compared to ~166ms currently. - Restrict this optimization to legacy parser output transformations, since the naïve regex used might otherwise match encoded style tags within data-parsoid attribute values, as described in I32d3d1772243c3819e1e1486351d16871b6e21c4. Add a test for this. [1] https://en.m.wikipedia.org/wiki/Democratic_Party_(United_States)?action=render Bug: T394059 Change-Id: I33ebcc2da7685b4b6dafdad3ed3ef2a9edea9a00 (cherry picked from commit 02f69d5dc99a964981c57b597eedffa1f253a14c) |
||
|---|---|---|
| .. | ||
| Stages | ||
| ContentDOMTransformStage.php | ||
| ContentTextTransformStage.php | ||
| DefaultOutputPipelineFactory.php | ||
| OutputTransformPipeline.php | ||
| OutputTransformStage.php | ||
| README.md | ||
Output transformations pipelines for wikitext
The classes in the Stages/ subdirectory contains HTML and DOM transforms for use in
output processing pipelines, i.e. postprocessors for ParserOutput objects that either
directly result from a parse or are fetched from ParserCache.
The default pipeline is created by DefaultOutputTransformFactory; it corresponds to
what was previously contained in ParserOutput::getText. The shouldRun method in these
stages uses defaults that indicates if the stage runs or not in the default
OutputTransformPipeline.