Commit graph

5 commits

Author SHA1 Message Date
Erik Bernhardson
aef02d516d Improve RemexStripTagHandler working with tables
HTML, generated by some infoboxes and perhaps other places, gets
stripped in a way that merges words together that should not be
merged. Add tr, th, and td to the list of tags that should force
word separation.

Bug: T218001
Change-Id: Ib374339628b1f543ea4e07f24aa3e3b76f3117b5
2019-03-14 13:11:59 -07:00
Kunal Mehta
cc5d9a92a2 build: Updating mediawiki/mediawiki-codesniffer to 24.0.0
Change-Id: I66b1775b7c1d36076d9ca78cbeb42787a743f2aa
2019-02-07 18:39:42 +00:00
Jakub Vrana
9f14c02e20 Remove duplicate keys from arrays
Found by PHPStan.

Change-Id: Ie0e0cfa33b3caa4a13f4dfb04c772c8a0284435a
2018-11-26 19:22:08 +01:00
Erik Bernhardson
0d779c1ac6 Preserve whitespace in search index text content
Certain html tags imply a word break, but our html stripping doesn't
understand that at all. Adjust the html stripping to inject whitespace
for all block level tags (per MDN) along with the <br> element.

Bug: T195389
Change-Id: I9fbfac765ea88628e4f9b2794fb54e1cd0060203
2018-09-14 11:10:35 -07:00
Roan Kattouw
ddb4913f53 Use Remex in Sanitizer::stripAllTags()
Using a real HTML tokenizer fixes bugs when < or > appear in attribute
values. The old implementation used delimiterReplace(), which didn't
handle this case:

    > print Sanitizer::stripAllTags( '<p data-foo="a&lt;b>c">Hello</p>' );
    c">Hello

We also can't use PHP's built-in strip_tags() because it doesn't handle
<?php and <? correctly:

    > print strip_tags('1<span class="<?php">2</span>3');
    1
    > print strip_tags('1<span class="<?">2</span>3');
    1

Bug: T179978
Change-Id: I53b98e6c877c00c03ff110914168b398559c9c3e
2017-11-15 17:31:31 -08:00