Thijs/wiki.techinc.nl

Author	SHA1	Message	Date
C. Scott Ananian	9f14fbd002	Add Sanitizer::removeSomeTags() which uses Remex to tokenize The existing Sanitizer::removeHTMLtags() method, in addition to having dodgy capitalization, uses regular expressions to parse the HTML. That produces corner cases like T298401 and T67747 and is not guaranteed to yield balanced or well-formed HTML. Instead, introduce and use a new Sanitizer::removeSomeTags() method which is guaranteed to always return balanced and well-formed HTML. Note that Sanitizer::removeHTMLtags()/::removeSomeTags() take a callback argument which (as far as I can tell) is never used outside core. Mark that argument as @internal, and clean up the version used by ::removeSomeTags(). Use the new ::removeSomeTags() method in the two places where DISPLAYTITLE is handled (following up on T67747). The use by the legacy parser is more difficult to replace (and would have a performace cost), so leave the old ::removeHTMLtags() method in place for that call site for now: when the legacy parser is replaced by Parsoid the need for the old ::removeHTMLtags() will go away. In a follow-up patch we'll rename ::removeHTMLtags() and mark it @internal so that we can deprecate ::removeHTMLtags() for external use. Some benchmarking code added. On my machine, with PHP 7.4, the new method tidies short 30-character title strings at a rate of about 6764/s while the tidy-based method being replaced here managed 6384/s. Sanitizer::removeHTMLtags blazes through short strings 20x faster (120,915/s); some of this difference is due to the set up cost of creating the tag whitelist and the Remex pipeline, so further optimizations could doubtless be done if Sanitizer::removeSomeTags() is more widely used. Bug: T299722 Bug: T67747 Change-Id: Ic864c01471c292f11799c4fbdac4d7d30b8bc50f	2022-03-04 14:06:02 -05:00
Derk-Jan Hartman	8e06927190	Make Sanitizer::stripAllTags() strip css and js tag contents We use Sanitizer::stripAllTags primarily to remove formatting from html so that we can use it in places like notifications, emails, search result blurbs etc etc. It is very unlikely we want the raw contents of css and/or js tags anywhere in those places, so lets surpress that content, to make it more readable as template styles are showing up in more and more places. Bug: T228856 Change-Id: I7930361068ddcf3a6c2fdebd0177d142f025b64f	2021-12-22 23:26:17 +00:00
C. Scott Ananian	b1f53045d7	Bump wikimedia/remex-html to 2.3.2 and drop 2.3.1 This is a bug fix release of RemexHtml, required by the latest version of Parsoid. RemexHtml migrated to a new namespace in 2.3.2. Since we don't support aliases in our phan configuration in core, update all uses to the new namespace to satisfy phan. Depends-On: I30f01f4a2a5479bb82c9b952ffa68a478215828a Depends-On: Iedf446635ee2112cfe637d8ebcf8092f0976bd17 Change-Id: I74fc929e4a66b28bfb1800ff0cd751c86e4a9f50	2021-08-08 18:07:29 -04:00
C. Scott Ananian	2fa79194ad	Allow core to use remex-html 2.3.2 This is a bug fix release of RemexHtml, required by the latest version of Parsoid. RemexHtml migrated to a new namespace in 2.3.2 and uses aliases for compatibility. Once we upgrade mediawiki-vendor we can rename all the uses in core and turn off aliases again. Due to T287419, we need to suppress some phan issues because phan ends up running against both remex 2.3.1 and 2.3.2 in different CI jobs. These suppressions are removed in the follow up I74fc929e4a66b28bfb1800ff0cd751c86e4a9f50. Change-Id: I42edd4fb8cd277ea20e331994fcbe56b52bf3f06	2021-08-08 17:55:15 -04:00
jenkins-bot	0ebeb72733	Merge "Simplify RemexStripTagHandler by extending NullTokenHandler"	2019-11-04 14:42:30 +00:00
Max Semenik	8a98dd9d59	Convert some private static arrays to constants Remove @since for some private ones as we don't guarantee anything about private class members. Change-Id: Ifb898353c02082e9ef69d67f69339345c6cd154d	2019-10-16 01:30:54 +00:00
Tim Starling	ee80c3f3f4	Simplify RemexStripTagHandler by extending NullTokenHandler Reduce code size and improve forwards compatibility. Change-Id: I844d06923cbe965581e911afe8c9d91e8e61079c	2019-08-19 11:21:56 +10:00
Reedy	9f2ffdfbd4	Remove "Squiz.WhiteSpace.FunctionSpacing" from phpcs exclusions Change-Id: I78b3315f26ab91b6b443f5b028a635552f82f5a3	2019-05-11 02:44:26 +01:00
Erik Bernhardson	aef02d516d	Improve RemexStripTagHandler working with tables HTML, generated by some infoboxes and perhaps other places, gets stripped in a way that merges words together that should not be merged. Add tr, th, and td to the list of tags that should force word separation. Bug: T218001 Change-Id: Ib374339628b1f543ea4e07f24aa3e3b76f3117b5	2019-03-14 13:11:59 -07:00
Kunal Mehta	cc5d9a92a2	build: Updating mediawiki/mediawiki-codesniffer to 24.0.0 Change-Id: I66b1775b7c1d36076d9ca78cbeb42787a743f2aa	2019-02-07 18:39:42 +00:00
Jakub Vrana	9f14c02e20	Remove duplicate keys from arrays Found by PHPStan. Change-Id: Ie0e0cfa33b3caa4a13f4dfb04c772c8a0284435a	2018-11-26 19:22:08 +01:00
Erik Bernhardson	0d779c1ac6	Preserve whitespace in search index text content Certain html tags imply a word break, but our html stripping doesn't understand that at all. Adjust the html stripping to inject whitespace for all block level tags (per MDN) along with the <br> element. Bug: T195389 Change-Id: I9fbfac765ea88628e4f9b2794fb54e1cd0060203	2018-09-14 11:10:35 -07:00
Roan Kattouw	ddb4913f53	Use Remex in Sanitizer::stripAllTags() Using a real HTML tokenizer fixes bugs when < or > appear in attribute values. The old implementation used delimiterReplace(), which didn't handle this case: > print Sanitizer::stripAllTags( '<p data-foo="a<b>c">Hello</p>' ); c">Hello We also can't use PHP's built-in strip_tags() because it doesn't handle <?php and <? correctly: > print strip_tags('1<span class="<?php">2</span>3'); 1 > print strip_tags('1<span class="<?">2</span>3'); 1 Bug: T179978 Change-Id: I53b98e6c877c00c03ff110914168b398559c9c3e	2017-11-15 17:31:31 -08:00

13 commits