CVE-2025-32699
Ensure that Unicode NFC normalization can be applied to our HTML
output safely. Even though the W3C officially recommends against
normalizing HTML
https://www.w3.org/International/questions/qa-html-css-normalization#converting
this is still easily done inadvertently, especially when using the
MediaWiki action API which normalizes parameters and results by
default.
See also I671648603c4635a35585c860b4857f5ea085e47f in Parsoid, and
T266140 / I2e78e660ba1867744e34eda7d00ea527ec016b71 for another similar
issue.
The following changes are made:
* The various HTML serializers (Remex/Tidy-derived, as well as the
Html::* helpers) are tweaked to entity-escape U+0338 wherever it
appears.
* Similarly, Message::escaped() is tweaked to entity-escape U+0338.
* Finally, a post-processing pass is added to the OutputTransform
pipeline to catch any remaining U+0338 and entity-escape them.
This catches U+0338 added during any of the previous OutputTransform
stages (like TOC insertion, section edit links, etc).
*When backporting* this code will likely need to be moved to
ParserOutput::getText(), as the OutputTransform pipeline wasn't added
until MW 1.42.
Bug: T387130
Change-Id: I66564e14e730f5393f4fa5780b80f24de6075af5
Add doc-typehints to class properties found by the PropertyDocumentation
sniff to improve the documentation.
Once the sniff is enabled it avoids that new code is missing type
declarations. This is focused on documentation and does not change code.
Change-Id: I1da4b272a6b28c419cc8e860d142dae19ca0bbcf
* Switch out raw Exceptions, mostly for InvalidArgumentExceptions.
* Fake exceptions triggered to give Monolog a backtrace are for
some reason "traditionally" RuntimeExceptions, instead, so we
continue to use that pattern in remaining locations.
* Just entirely give up on PostgresResultWrapper's resource vs. object mess.
* Drop now-unneeded false positive hits.
Change-Id: Id183ab60994cd9c6dc80401d4ce4de0ddf2b3da0
Motivation:
* Avoid code duplication.
* Hopefully make it easier to read.
* Also order stuff from cheap to expensive, if possible.
Change-Id: I575e3f2027ce60a0d0885be5b9bd3e07bc035eee
Split out from the I44045b3b9e78e change.
This is consistent with what Parsoid will use for the TOC marker.
Bug: T287767
Bug: T270199
Bug: T311502
Depends-On: I1f607cf1ef1b61fb4d2e1880de756fb94d5a6b22
Change-Id: Ie63eed07b9bca1bfa07d4c256aba3728cedd8f93
Now largely automated:
VARS=$(grep -o "'[A-Za-z0-9_]*'" includes/MainConfigNames.php | \
tr "\n" '|' | sed "s/|$/\n/;s/'//g")
sed -i -E "s/'($VARS)'/MainConfigNames::\1/g" \
$(grep -ERIl "'($VARS)'" includes/)
Then git add -p with lots of error-prone manual checking. Then
semi-manually add all the necessary "use" lines:
vim $(grep -L 'use MediaWiki\\MainConfigNames;' \
$(git diff --cached --name-only --diff-filter=M HEAD^))
I didn't bother fixing lines that were over 100 characters unless they
were over 120 and triggered phpcs.
Bug: T305805
Change-Id: I74e0ab511abecb276717ad4276a124760a268147
This covers all occurrences of /onfig->.*get( '/ in includes/.
Undoubtedly there are still plenty more to go.
Change-Id: I33196c4153437778496f40436bcde399638ac361
Make phan stricter about null types by setting null_casts_as_any_type to
false (the default in mediawiki-phan-config)
Remaining false positive issues are suppressed.
The suppression and the setting change can only be done together
Bug: T242536
Bug: T301991
Change-Id: I0f295382b96fb3be8037a01c10487d9d591e7e01
This helps phan to detect unreachable code and also impossible types
after the functions.
It helps phan to avoid false positives for array keys
when the keys are checked before
Bug: T240141
Change-Id: I895f70e82b3053a46cd44135b15437e6f82a07b2
This is a bug fix release of RemexHtml, required by the latest version
of Parsoid.
RemexHtml migrated to a new namespace in 2.3.2. Since we don't
support aliases in our phan configuration in core, update all uses to
the new namespace to satisfy phan.
Depends-On: I30f01f4a2a5479bb82c9b952ffa68a478215828a
Depends-On: Iedf446635ee2112cfe637d8ebcf8092f0976bd17
Change-Id: I74fc929e4a66b28bfb1800ff0cd751c86e4a9f50
This is a bug fix release of RemexHtml, required by the latest version
of Parsoid.
RemexHtml migrated to a new namespace in 2.3.2 and uses aliases for
compatibility. Once we upgrade mediawiki-vendor we can rename all
the uses in core and turn off aliases again.
Due to T287419, we need to suppress some phan issues because phan
ends up running against both remex 2.3.1 *and* 2.3.2 in different
CI jobs. These suppressions are removed in the follow up
I74fc929e4a66b28bfb1800ff0cd751c86e4a9f50.
Change-Id: I42edd4fb8cd277ea20e331994fcbe56b52bf3f06
Also copied the tests that used to be in TidyTest into
RemexDriverTest, so that we're not losing coverage when MWTidy is
eventually removed.
Bug: T198214
Change-Id: I0b301f6c98d0943ce4b6dc224f1066cb7bf244d1
Choosing a particular encoding of HTML entities is logically a task
of the Remex formatter (which serializes HTML). Move it out of the
Parser so that it is part of the serialization specification.
This is a follow up to Ic8965e81882d7cf024bdced437f684064a30ac86.
Change-Id: If45907baf24d60987b39cd1f7709c5f7caf19f37
This also means we don't need to take special care for French spacing in
attributes, since it's no longer applied there.
Adds a test that captures this change.
Note that the test "Nowiki and french spacing" wonders whether this
escaping should be applied to nowiki content.
Bug: T255007
Change-Id: Ic8965e81882d7cf024bdced437f684064a30ac86
This is micro-optimization of closure code to avoid binding the closure
to $this where it is not needed.
Created by I25a17fb22b6b669e817317a0f45051ae9c608208
Change-Id: I0ffc6200f6c6693d78a3151cb8cea7dce7c21653
Remove unused variable $parent in RemexCompatMunger::comment(). Also,
RemexMungerData::dump() could have a possibility that all checks fail
and $parts is not defined. There are two ways we can handle this, i.e.
either by doing `$parts = []`(setting $parts to an empty array) or by
safe guarding using an `isset()` check.
This patch uses the former so that $parts is defined and can be used
below in the code.
Change-Id: I4d601a6fe36a1dce0945686cb9880336d08338be
<style> and <link> tags are metadata tags, they shouldn't split the <p>
tag when p-wrapping content.
Bug: T208901
Change-Id: I2ef5da68c9ccde4477d8295dfe4abf8497c5d26e
Remex is pure PHP so there is no reason to use an external tidy any
more. Configuration variables and implementation classes were
deprecated in 1.32 or earlier. We've kept only $wgTidyConfig
which can be used for experimental features or debugging Remex.
Bug: T198214
Change-Id: I99d48f858d97b6e1d1e6cd76a42c960cc2c61f9f
Let's rip the band-aid off. Remex is pure PHP so there's no reason to
be running any of the other tidy implementations any more, and we won't
be able to support them in the future.
Follow-up to 7b23382823.
Bug: T198214
Change-Id: Id3d07d44f8434231826e86e623554cac3decfa96
Use the new RemexHtml trace features. Add two more tracing modes.
Fix missing member variable declarations and remove unused local
variables.
Change-Id: I512462e1019f9a466684abfa4aab7697b324d5b1
This was naïve, the linked bug documents a case where endTag() was
called despite children of the p-wrap still being in TreeBuilder's
stack. Instead, wait for the parent of the p-wrap to have endTag()
called on it, I've submitted a patch which will clean up the node in
that case.
Bug: T200827
Change-Id: I34694813eace9cadabf2db8f9ccca83d1368cfad
The changes to the parserTests.txt highlight the differing opinions that
doBlockLevels and Remex had on whether these should be paragraph wrapped.
Since the only time they wouldn't have been was when found on a line
with other flow tags, this likely isn't a behaviour that was depended on
in practice. And, indeed, the task describes this as a bug.
A sampling of pages from an insource:/\<(ins|del)\>/ search on wiki bears
this out.
Bug: T17491
Change-Id: I311da777a63aa3c45013f2cfc090be35a022497e
In cases where we're operating on text data (and not binary data),
use e.g. "\u{00A0}" to refer directly to the Unicode character
'NO-BREAK SPACE' instead of "\xc2\xa0" to specify the bytes C2h A0h
(which correspond to the UTF-8 encoding of that character). This
makes it easier to look up those mysterious sequences, as not all
are as recognizable as the no-break space.
This is not enforced by PHP, but I think we should write those in
uppercase and zero-padded to at least four characters, like the
Unicode standard does.
Note that not all "\xNN" escapes can be automatically replaced:
* We can't use Unicode escapes for binary data that is not UTF-8
(e.g. in code converting from legacy encodings or testing the
handling of invalid UTF-8 byte sequences).
* '\xNN' escapes in regular expressions in single-quoted strings
are actually handled by PCRE and have to be dealt with carefully
(those regexps should probably be changed to use the /u modifier).
* "\xNN" referring to ASCII characters ("\x7F" and lower) should
probably be left as-is.
The replacements in this commit were done semi-manually by piping
the existing "\xNN" escapes through the following terrible Ruby
script I devised:
chars = eval('"' + ARGV[0] + '"').force_encoding('utf-8')
puts chars.split('').map{|char|
'\\u{' + char.ord.to_s(16).upcase.rjust(4, '0') + '}'
}.join('')
Change-Id: Idc3dee3a7fb5ebfaef395754d8859b18f1f8769a
The Html5Depurate driver was intended to be used with an external Java
service, but it never gained traction due to deployment concerns.
The Html5Internal (Balancer) driver was originally intended for use with
the balanced templates proposal and could also handle tidying. But it was
tightly coupled to MediaWiki, so part of it was used as the basis of the
RemexHtml library. Remex most likely can also implement the balanced
templates proposal, so there isn't any reason to keep the Balancer code
around anymore,
Change-Id: I8542d69e9cdbf0e2fb7ebbb919933a64c1b8c293
https://secure.php.net/manual/en/function.implode.php defines the order
of arguments as
string implode ( string $glue , array $pieces )
string implode ( array $pieces )
Note:
implode() can, for historical reasons, accept its parameters in
either order. For consistency with explode(), however, it may be less
confusing to use the documented order of arguments.
Change-Id: I03bf5712204e283f52d3ede54af9b9ec117d4280