Commit graph

13 commits

Author SHA1 Message Date
C. Scott Ananian
9f14fbd002 Add Sanitizer::removeSomeTags() which uses Remex to tokenize
The existing Sanitizer::removeHTMLtags() method, in addition to having
dodgy capitalization, uses regular expressions to parse the HTML.
That produces corner cases like T298401 and T67747 and is not guaranteed
to yield balanced or well-formed HTML.

Instead, introduce and use a new Sanitizer::removeSomeTags() method
which is guaranteed to always return balanced and well-formed HTML.

Note that Sanitizer::removeHTMLtags()/::removeSomeTags() take a callback
argument which (as far as I can tell) is never used outside core. Mark
that argument as @internal, and clean up the version used by
::removeSomeTags().

Use the new ::removeSomeTags() method in the two places where
DISPLAYTITLE is handled (following up on T67747).  The use by the
legacy parser is more difficult to replace (and would have a
performace cost), so leave the old ::removeHTMLtags() method in place
for that call site for now: when the legacy parser is replaced by
Parsoid the need for the old ::removeHTMLtags() will go away.  In a
follow-up patch we'll rename ::removeHTMLtags() and mark it @internal
so that we can deprecate ::removeHTMLtags() for external use.

Some benchmarking code added.  On my machine, with PHP 7.4, the new
method tidies short 30-character title strings at a rate of about
6764/s while the tidy-based method being replaced here managed 6384/s.
Sanitizer::removeHTMLtags blazes through short strings 20x faster
(120,915/s); some of this difference is due to the set up cost of
creating the tag whitelist and the Remex pipeline, so further
optimizations could doubtless be done if Sanitizer::removeSomeTags()
is more widely used.

Bug: T299722
Bug: T67747
Change-Id: Ic864c01471c292f11799c4fbdac4d7d30b8bc50f
2022-03-04 14:06:02 -05:00
Derk-Jan Hartman
8e06927190 Make Sanitizer::stripAllTags() strip css and js tag contents
We use Sanitizer::stripAllTags primarily to remove formatting from
html so that we can use it in places like notifications, emails,
search result blurbs etc etc.

It is very unlikely we want the raw contents of css and/or js tags
anywhere in those places, so lets surpress that content, to make it
more readable as template styles are showing up in more and more
places.

Bug: T228856
Change-Id: I7930361068ddcf3a6c2fdebd0177d142f025b64f
2021-12-22 23:26:17 +00:00
C. Scott Ananian
b1f53045d7 Bump wikimedia/remex-html to 2.3.2 and drop 2.3.1
This is a bug fix release of RemexHtml, required by the latest version
of Parsoid.

RemexHtml migrated to a new namespace in 2.3.2.  Since we don't
support aliases in our phan configuration in core, update all uses to
the new namespace to satisfy phan.

Depends-On: I30f01f4a2a5479bb82c9b952ffa68a478215828a
Depends-On: Iedf446635ee2112cfe637d8ebcf8092f0976bd17
Change-Id: I74fc929e4a66b28bfb1800ff0cd751c86e4a9f50
2021-08-08 18:07:29 -04:00
C. Scott Ananian
2fa79194ad Allow core to use remex-html 2.3.2
This is a bug fix release of RemexHtml, required by the latest version
of Parsoid.

RemexHtml migrated to a new namespace in 2.3.2 and uses aliases for
compatibility.  Once we upgrade mediawiki-vendor we can rename all
the uses in core and turn off aliases again.

Due to T287419, we need to suppress some phan issues because phan
ends up running against both remex 2.3.1 *and* 2.3.2 in different
CI jobs.  These suppressions are removed in the follow up
I74fc929e4a66b28bfb1800ff0cd751c86e4a9f50.

Change-Id: I42edd4fb8cd277ea20e331994fcbe56b52bf3f06
2021-08-08 17:55:15 -04:00
jenkins-bot
0ebeb72733 Merge "Simplify RemexStripTagHandler by extending NullTokenHandler" 2019-11-04 14:42:30 +00:00
Max Semenik
8a98dd9d59 Convert some private static arrays to constants
Remove @since for some private ones as we don't guarantee anything
about private class members.

Change-Id: Ifb898353c02082e9ef69d67f69339345c6cd154d
2019-10-16 01:30:54 +00:00
Tim Starling
ee80c3f3f4 Simplify RemexStripTagHandler by extending NullTokenHandler
Reduce code size and improve forwards compatibility.

Change-Id: I844d06923cbe965581e911afe8c9d91e8e61079c
2019-08-19 11:21:56 +10:00
Reedy
9f2ffdfbd4 Remove "Squiz.WhiteSpace.FunctionSpacing" from phpcs exclusions
Change-Id: I78b3315f26ab91b6b443f5b028a635552f82f5a3
2019-05-11 02:44:26 +01:00
Erik Bernhardson
aef02d516d Improve RemexStripTagHandler working with tables
HTML, generated by some infoboxes and perhaps other places, gets
stripped in a way that merges words together that should not be
merged. Add tr, th, and td to the list of tags that should force
word separation.

Bug: T218001
Change-Id: Ib374339628b1f543ea4e07f24aa3e3b76f3117b5
2019-03-14 13:11:59 -07:00
Kunal Mehta
cc5d9a92a2 build: Updating mediawiki/mediawiki-codesniffer to 24.0.0
Change-Id: I66b1775b7c1d36076d9ca78cbeb42787a743f2aa
2019-02-07 18:39:42 +00:00
Jakub Vrana
9f14c02e20 Remove duplicate keys from arrays
Found by PHPStan.

Change-Id: Ie0e0cfa33b3caa4a13f4dfb04c772c8a0284435a
2018-11-26 19:22:08 +01:00
Erik Bernhardson
0d779c1ac6 Preserve whitespace in search index text content
Certain html tags imply a word break, but our html stripping doesn't
understand that at all. Adjust the html stripping to inject whitespace
for all block level tags (per MDN) along with the <br> element.

Bug: T195389
Change-Id: I9fbfac765ea88628e4f9b2794fb54e1cd0060203
2018-09-14 11:10:35 -07:00
Roan Kattouw
ddb4913f53 Use Remex in Sanitizer::stripAllTags()
Using a real HTML tokenizer fixes bugs when < or > appear in attribute
values. The old implementation used delimiterReplace(), which didn't
handle this case:

    > print Sanitizer::stripAllTags( '<p data-foo="a&lt;b>c">Hello</p>' );
    c">Hello

We also can't use PHP's built-in strip_tags() because it doesn't handle
<?php and <? correctly:

    > print strip_tags('1<span class="<?php">2</span>3');
    1
    > print strip_tags('1<span class="<?">2</span>3');
    1

Bug: T179978
Change-Id: I53b98e6c877c00c03ff110914168b398559c9c3e
2017-11-15 17:31:31 -08:00