wiki.techinc.nl/includes/parser/RemexStripTagHandler.php

Ignoring revisions in .git-blame-ignore-revs. Click here to bypass and see the normal blame view.

131 lines
3.3 KiB
PHP
Raw Normal View History

<?php
Add Sanitizer::removeSomeTags() which uses Remex to tokenize The existing Sanitizer::removeHTMLtags() method, in addition to having dodgy capitalization, uses regular expressions to parse the HTML. That produces corner cases like T298401 and T67747 and is not guaranteed to yield balanced or well-formed HTML. Instead, introduce and use a new Sanitizer::removeSomeTags() method which is guaranteed to always return balanced and well-formed HTML. Note that Sanitizer::removeHTMLtags()/::removeSomeTags() take a callback argument which (as far as I can tell) is never used outside core. Mark that argument as @internal, and clean up the version used by ::removeSomeTags(). Use the new ::removeSomeTags() method in the two places where DISPLAYTITLE is handled (following up on T67747). The use by the legacy parser is more difficult to replace (and would have a performace cost), so leave the old ::removeHTMLtags() method in place for that call site for now: when the legacy parser is replaced by Parsoid the need for the old ::removeHTMLtags() will go away. In a follow-up patch we'll rename ::removeHTMLtags() and mark it @internal so that we can deprecate ::removeHTMLtags() for external use. Some benchmarking code added. On my machine, with PHP 7.4, the new method tidies short 30-character title strings at a rate of about 6764/s while the tidy-based method being replaced here managed 6384/s. Sanitizer::removeHTMLtags blazes through short strings 20x faster (120,915/s); some of this difference is due to the set up cost of creating the tag whitelist and the Remex pipeline, so further optimizations could doubtless be done if Sanitizer::removeSomeTags() is more widely used. Bug: T299722 Bug: T67747 Change-Id: Ic864c01471c292f11799c4fbdac4d7d30b8bc50f
2022-01-21 22:03:26 +00:00
namespace MediaWiki\Parser;
use Wikimedia\RemexHtml\Tokenizer\Attributes;
use Wikimedia\RemexHtml\Tokenizer\NullTokenHandler;
/**
Add Sanitizer::removeSomeTags() which uses Remex to tokenize The existing Sanitizer::removeHTMLtags() method, in addition to having dodgy capitalization, uses regular expressions to parse the HTML. That produces corner cases like T298401 and T67747 and is not guaranteed to yield balanced or well-formed HTML. Instead, introduce and use a new Sanitizer::removeSomeTags() method which is guaranteed to always return balanced and well-formed HTML. Note that Sanitizer::removeHTMLtags()/::removeSomeTags() take a callback argument which (as far as I can tell) is never used outside core. Mark that argument as @internal, and clean up the version used by ::removeSomeTags(). Use the new ::removeSomeTags() method in the two places where DISPLAYTITLE is handled (following up on T67747). The use by the legacy parser is more difficult to replace (and would have a performace cost), so leave the old ::removeHTMLtags() method in place for that call site for now: when the legacy parser is replaced by Parsoid the need for the old ::removeHTMLtags() will go away. In a follow-up patch we'll rename ::removeHTMLtags() and mark it @internal so that we can deprecate ::removeHTMLtags() for external use. Some benchmarking code added. On my machine, with PHP 7.4, the new method tidies short 30-character title strings at a rate of about 6764/s while the tidy-based method being replaced here managed 6384/s. Sanitizer::removeHTMLtags blazes through short strings 20x faster (120,915/s); some of this difference is due to the set up cost of creating the tag whitelist and the Remex pipeline, so further optimizations could doubtless be done if Sanitizer::removeSomeTags() is more widely used. Bug: T299722 Bug: T67747 Change-Id: Ic864c01471c292f11799c4fbdac4d7d30b8bc50f
2022-01-21 22:03:26 +00:00
* Helper class for Sanitizer::stripAllTags().
* @internal
*/
class RemexStripTagHandler extends NullTokenHandler {
private $insideNonVisibleTag = false;
private $text = '';
public function getResult() {
return $this->text;
}
public function characters( $text, $start, $length, $sourceStart, $sourceLength ) {
if ( !$this->insideNonVisibleTag ) {
$this->text .= substr( $text, $start, $length );
}
}
public function startTag( $name, Attributes $attrs, $selfClose, $sourceStart, $sourceLength ) {
if ( $this->isNonVisibleTag( $name ) ) {
$this->insideNonVisibleTag = true;
}
// Inject whitespace for typical block-level tags to
// prevent merging unrelated<br>words.
if ( $this->isBlockLevelTag( $name ) ) {
$this->text .= ' ';
}
}
public function endTag( $name, $sourceStart, $sourceLength ) {
if ( $this->isNonVisibleTag( $name ) ) {
$this->insideNonVisibleTag = false;
}
// Inject whitespace for typical block-level tags to
// prevent merging unrelated<br>words.
if ( $this->isBlockLevelTag( $name ) ) {
$this->text .= ' ';
}
}
// Per https://developer.mozilla.org/en-US/docs/Web/HTML/Block-level_elements
// retrieved on sept 12, 2018. <br> is not block level but was added anyways.
// The following is a complete list of all HTML block level elements
// (although "block-level" is not technically defined for elements that are
// new in HTML5).
// Structured as tag => true to allow O(1) membership test.
private const BLOCK_LEVEL_TAGS = [
'address' => true,
'article' => true,
'aside' => true,
'blockquote' => true,
'br' => true,
'canvas' => true,
'dd' => true,
'div' => true,
'dl' => true,
'dt' => true,
'fieldset' => true,
'figcaption' => true,
'figure' => true,
'footer' => true,
'form' => true,
'h1' => true,
'h2' => true,
'h3' => true,
'h4' => true,
'h5' => true,
'h6' => true,
'header' => true,
'hgroup' => true,
'hr' => true,
'li' => true,
'main' => true,
'nav' => true,
'noscript' => true,
'ol' => true,
'output' => true,
'p' => true,
'pre' => true,
'section' => true,
'table' => true,
'td' => true,
'tfoot' => true,
'th' => true,
'tr' => true,
'ul' => true,
'video' => true,
];
/**
* Detect block level tags. Of course css can make anything a block
* level tag, but this is still better than nothing.
*
* @param string $tagName HTML tag name
* @return bool True when tag is an html block level element
*/
private function isBlockLevelTag( $tagName ) {
$key = strtolower( trim( $tagName ) );
return isset( self::BLOCK_LEVEL_TAGS[$key] );
}
private const NON_VISIBLE_TAGS = [
'style' => true,
'script' => true,
];
/**
* Detect block tags which by default are non-visible items.
* Of course css can make anything non-visible,
* but this is still better than nothing.
*
* We use this primarily to hide TemplateStyles
* from output in notifications/emails etc.
*
* @param string $tagName HTML tag name
* @return bool True when tag is a html element which should be filtered out
*/
private function isNonVisibleTag( $tagName ) {
$key = strtolower( trim( $tagName ) );
return isset( self::NON_VISIBLE_TAGS[$key] );
}
}