Commit graph

11 commits

Author SHA1 Message Date
C. Scott Ananian
6153062d48 Protect against long match length in CHAR_REFS_REGEX
Some malformed pages contain "character references" that were so long
that they caused PHP's `hexdec` to return a `float` instead of an
`int`.  This caused Parsoid to crash on a type hint on the argument to
Sanitizer::validateCodepoint().  MediaWiki core has the same issue,
but doesn't have the type hint (yet), so soft-fails instead of
crashes.  Add sanity checks around each call to `hexdec` to protect
against arbitrarily-long entity strings (while allowing arbitrary
zero-padding), and add a note to `intval` to explain why it is not
similarly affected.  New test cases added to SanitizerUnitTest as
well.

Corresponding patch on the Parsoid side:
Ic33196961bb2b86290148fbc3ce33bcd8b28ab56
(And see T247804 re: eventually removing this duplicate code.)

Bug: T322892
Change-Id: I5085c4edbb86e282b92536d05b01ed5f9d5c615e
2022-11-17 16:47:39 -05:00
Michał Turek
1d87b8e7b3 Sanitizer: Don't consider inline var CSS insecure
Since (T208881) "CSS using var() to create exponential sized calc() on wiki page will crash visitor's browser" was fixed by disabling var in inline CSS, the issue with browser crashes appears to have been fixed in Firefox, Chrome, modern Edge, and Opera.
This change reverts T208881.

Bug: T288201
Change-Id: I387a0e9fdd02faa69616890c613462c83b91b789
2022-08-24 07:17:28 +00:00
C. Scott Ananian
9f14fbd002 Add Sanitizer::removeSomeTags() which uses Remex to tokenize
The existing Sanitizer::removeHTMLtags() method, in addition to having
dodgy capitalization, uses regular expressions to parse the HTML.
That produces corner cases like T298401 and T67747 and is not guaranteed
to yield balanced or well-formed HTML.

Instead, introduce and use a new Sanitizer::removeSomeTags() method
which is guaranteed to always return balanced and well-formed HTML.

Note that Sanitizer::removeHTMLtags()/::removeSomeTags() take a callback
argument which (as far as I can tell) is never used outside core. Mark
that argument as @internal, and clean up the version used by
::removeSomeTags().

Use the new ::removeSomeTags() method in the two places where
DISPLAYTITLE is handled (following up on T67747).  The use by the
legacy parser is more difficult to replace (and would have a
performace cost), so leave the old ::removeHTMLtags() method in place
for that call site for now: when the legacy parser is replaced by
Parsoid the need for the old ::removeHTMLtags() will go away.  In a
follow-up patch we'll rename ::removeHTMLtags() and mark it @internal
so that we can deprecate ::removeHTMLtags() for external use.

Some benchmarking code added.  On my machine, with PHP 7.4, the new
method tidies short 30-character title strings at a rate of about
6764/s while the tidy-based method being replaced here managed 6384/s.
Sanitizer::removeHTMLtags blazes through short strings 20x faster
(120,915/s); some of this difference is due to the set up cost of
creating the tag whitelist and the Remex pipeline, so further
optimizations could doubtless be done if Sanitizer::removeSomeTags()
is more widely used.

Bug: T299722
Bug: T67747
Change-Id: Ic864c01471c292f11799c4fbdac4d7d30b8bc50f
2022-03-04 14:06:02 -05:00
Derk-Jan Hartman
8e06927190 Make Sanitizer::stripAllTags() strip css and js tag contents
We use Sanitizer::stripAllTags primarily to remove formatting from
html so that we can use it in places like notifications, emails,
search result blurbs etc etc.

It is very unlikely we want the raw contents of css and/or js tags
anywhere in those places, so lets surpress that content, to make it
more readable as template styles are showing up in more and more
places.

Bug: T228856
Change-Id: I7930361068ddcf3a6c2fdebd0177d142f025b64f
2021-12-22 23:26:17 +00:00
Fomafix
3a322ef9b0 Use PHP \u{xxxx} syntax
Let PHP do the UTF-8 encoding of Unicode characters in PHP strings.

Also use faster str_replace instead of preg_replace.

Change-Id: I4e99de694a607e2b5df52c6efcd3d863bb42f76e
2021-08-27 20:53:19 +00:00
James D. Forrester
a52c933998 Drop Sanitizer::escapeId(), deprecated in MediaWiki 1.30
Hard deprecation was in b79c1e2, which shipped in MediaWiki 1.35.

Change-Id: I7186462c95d346f362ba0cf84b136c083d66a7d3
2020-07-29 17:08:45 -04:00
C. Scott Ananian
b79c1e22ad Hard-deprecate Sanitizer::escapeId()
Deprecated in MW 1.30; time to clean up any remaining uses.

Code search:
https://codesearch.wmflabs.org/deployed/?q=escapeId%5C%28&i=nope&files=&repos=

Depends-On: Ic03a5da2e1d6b8f5656555420dd573a1d698b9cc
Depends-On: I311f44a5035f73c0fb2289f727eb39b73007429b
Depends-On: I76c5b539bae5572c4ac65f28fec9c0c36381348c
Depends-On: Id4cbfc3b113b1b04f949d485187e89ffe0b487f5
Depends-On: I7d5ba4930688ed7f011a4babed5986b8e40910a0
Depends-On: I964f83ce88fb9c66a7c59037c6066f4567bcf4c9
Change-Id: I89504cfdf8e02831d54a26900bfdc63a33b4eade
2020-01-26 22:05:45 +00:00
Thiemo Kreuz
da880b7cfa Prefer assertSame() in SanitizerUnitTest
This test compares strings. I find it critical to know this test will
start failing if, for example, a method that is expected to return the
string "" starts returning null. assertEquals() will not report this
and quite a bunch of other edge-cases.

Change-Id: I9a3f19f91b95aa384ca612f9a58c7af685306d57
2019-11-23 22:22:36 +01:00
Max Semenik
15d270995d Address TODO asking for a dataProvider
Change-Id: Iaf0a277c2f246291706d65f7fca0874bc0e8bdda
2019-11-23 00:35:12 +00:00
Amir Sarabadani
d23af35764 Unset all globals unneeded for unit tests, assert correct directory
* Unset globals to avoid tests that look like unit tests but actually rely on
  globals
* move some tests out of unit directory so that the test suite will pass.
* Assert that tests which extend MediaWikiUnitTestCase are in a directory with
  "/unit/" in its path name

Depends-On: I67b37b1bde94eaa3d4298d9bd98ac57995ce93b9
Depends-On: I90921679518ee95fe393f8b1bbd9134daf0ba032
Bug: T87781
Change-Id: I16691fc8ac063705ba0c2bc63b96c4534ca8660b
2019-07-09 14:09:29 -04:00
Amir Sarabadani
7ec9745444 Split SanitizerTest to unit and integration tests
Out of 150 tests of SanitizerTest.php, 100 of them are pure unit tests
they are moved to the new file in the new structure, the rest stay

Change-Id: I366d37607abff4bcd624a56fb8b2299729fbc088
2019-07-08 09:48:07 +02:00