HTML, generated by some infoboxes and perhaps other places, gets
stripped in a way that merges words together that should not be
merged. Add tr, th, and td to the list of tags that should force
word separation.
Bug: T218001
Change-Id: Ib374339628b1f543ea4e07f24aa3e3b76f3117b5
Certain html tags imply a word break, but our html stripping doesn't
understand that at all. Adjust the html stripping to inject whitespace
for all block level tags (per MDN) along with the <br> element.
Bug: T195389
Change-Id: I9fbfac765ea88628e4f9b2794fb54e1cd0060203
Using a real HTML tokenizer fixes bugs when < or > appear in attribute
values. The old implementation used delimiterReplace(), which didn't
handle this case:
> print Sanitizer::stripAllTags( '<p data-foo="a<b>c">Hello</p>' );
c">Hello
We also can't use PHP's built-in strip_tags() because it doesn't handle
<?php and <? correctly:
> print strip_tags('1<span class="<?php">2</span>3');
1
> print strip_tags('1<span class="<?">2</span>3');
1
Bug: T179978
Change-Id: I53b98e6c877c00c03ff110914168b398559c9c3e