Commit graph

105 commits

Author SHA1 Message Date
C. Scott Ananian
94f193a894 SECURITY: Ensure emitted HTML is safe against Unicode NFC normalization
CVE-2025-32699

Ensure that Unicode NFC normalization can be applied to our HTML
output safely.  Even though the W3C officially recommends against
normalizing HTML

https://www.w3.org/International/questions/qa-html-css-normalization#converting

this is still easily done inadvertently, especially when using the
MediaWiki action API which normalizes parameters and results by
default.

See also I671648603c4635a35585c860b4857f5ea085e47f in Parsoid, and
T266140 / I2e78e660ba1867744e34eda7d00ea527ec016b71 for another similar
issue.

The following changes are made:

* The various HTML serializers (Remex/Tidy-derived, as well as the
  Html::* helpers) are tweaked to entity-escape U+0338 wherever it
  appears.

* Similarly, Message::escaped() is tweaked to entity-escape U+0338.

* Finally, a post-processing pass is added to the OutputTransform
  pipeline to catch any remaining U+0338 and entity-escape them.
  This catches U+0338 added during any of the previous OutputTransform
  stages (like TOC insertion, section edit links, etc).
  *When backporting* this code will likely need to be moved to
  ParserOutput::getText(), as the OutputTransform pipeline wasn't added
  until MW 1.42.

Bug: T387130
Change-Id: I66564e14e730f5393f4fa5780b80f24de6075af5
2025-04-10 15:56:06 +01:00
Umherirrender
6eec17e9a9 Add missing documentation to class properties (miscellaneous classes)
Add doc-typehints to class properties found by the PropertyDocumentation
sniff to improve the documentation.

Once the sniff is enabled it avoids that new code is missing type
declarations. This is focused on documentation and does not change code.

Change-Id: I1da4b272a6b28c419cc8e860d142dae19ca0bbcf
2024-09-14 10:12:18 +02:00
Umherirrender
465777f188 Use const keyword for constant list of strings or ints
Also changed visiblity of some to private

Change-Id: I113b040321d27c84fe9b807c162736909e96fb20
2024-09-11 23:16:24 +02:00
jenkins-bot
6039650aed Merge "HtmlHelper: Fix entity encoding when $html5format = false" 2024-02-15 03:30:11 +00:00
James D. Forrester
102a4f8a35 build: Upgrade mediawiki/mediawiki-phan-config from 0.13.0 to 0.14.0 manually
* Switch out raw Exceptions, mostly for InvalidArgumentExceptions.
  * Fake exceptions triggered to give Monolog a backtrace are for
    some reason "traditionally" RuntimeExceptions, instead, so we
    continue to use that pattern in remaining locations.
* Just entirely give up on PostgresResultWrapper's resource vs. object mess.
* Drop now-unneeded false positive hits.

Change-Id: Id183ab60994cd9c6dc80401d4ce4de0ddf2b3da0
2024-02-10 02:22:41 +00:00
Bartosz Dziewoński
2fec813efa HtmlHelper: Fix entity encoding when $html5format = false
Follow-up to 84d0dff968.

Bug: T354361
Change-Id: I44a98f667a89d0baa25188fc6d43f92b3ad19b84
2024-02-09 21:38:23 +00:00
Dogu
29d8092f5f Replace SerializerNode, Element, and Exception qualifiers with imports
Change-Id: I34e3600632f11adb53847656c605daa3618ff0fa
2024-01-05 08:43:16 +00:00
James D. Forrester
468e69bccc Namespace Sanitizer under \MediaWiki\Parser
Bug: T166010
Change-Id: Id13dcbf7a0372017495958dbc4f601f40c122508
2023-09-21 05:39:23 +00:00
thiemowmde
9b03cde58e Merge sequences of if that end doing the same thing anyway
Motivation:
* Avoid code duplication.
* Hopefully make it easier to read.
* Also order stuff from cheap to expensive, if possible.

Change-Id: I575e3f2027ce60a0d0885be5b9bd3e07bc035eee
2023-06-16 16:09:42 +02:00
Matěj Suchánek
5b34ec2c1f Remove deprecated code from tidy drivers
Change-Id: I88f35425955ed5b189e0741268aa361582d0f1db
2022-11-28 18:05:34 +01:00
Tim Starling
0077c5da15 Use short array destructuring instead of list()
Introduced in PHP 7.1. Because it's shorter and looks nice.

I used regex replacement.

Change-Id: I0555e199d126cd44501f859cb4589f8bd49694da
2022-10-21 15:33:37 +11:00
jenkins-bot
61cbd18ff3 Merge "parser: Use a <meta> tag for the internal TOC_PLACEHOLDER" 2022-09-09 21:12:34 +00:00
Arlo Breault
4703724fe8 Don't reconstruct formatting elements in figures
Similar to I3c55eb5fb8055016f8c4f76d27d953f65ff621be in Parsoid

Bug: T314059
Change-Id: I7b4e9df8490357f44d31d6a869fa9b7a15f029ea
2022-08-31 18:55:23 -04:00
C. Scott Ananian
0b10563895 parser: Use a <meta> tag for the internal TOC_PLACEHOLDER
Split out from the I44045b3b9e78e change.

This is consistent with what Parsoid will use for the TOC marker.

Bug: T287767
Bug: T270199
Bug: T311502
Depends-On: I1f607cf1ef1b61fb4d2e1880de756fb94d5a6b22
Change-Id: Ie63eed07b9bca1bfa07d4c256aba3728cedd8f93
2022-08-16 06:05:17 +00:00
Matěj Suchánek
1865180ae7 Do minor code cleanup
Remove dead code and fix typos. Should cause no change in behavior.

Change-Id: I5d293b842bc93a28b8bcd799a31b5e6e30fe692e
2022-06-24 13:52:42 +02:00
Aryeh Gregor
7b791474a5 Use MainConfigNames instead of string literals, #4
Now largely automated:

VARS=$(grep -o "'[A-Za-z0-9_]*'" includes/MainConfigNames.php | \
  tr "\n" '|' | sed "s/|$/\n/;s/'//g")
sed -i -E "s/'($VARS)'/MainConfigNames::\1/g" \
  $(grep -ERIl "'($VARS)'" includes/)

Then git add -p with lots of error-prone manual checking. Then
semi-manually add all the necessary "use" lines:

vim $(grep -L 'use MediaWiki\\MainConfigNames;' \
  $(git diff --cached --name-only --diff-filter=M HEAD^))

I didn't bother fixing lines that were over 100 characters unless they
were over 120 and triggered phpcs.

Bug: T305805
Change-Id: I74e0ab511abecb276717ad4276a124760a268147
2022-04-26 19:03:37 +03:00
Aryeh Gregor
666ca1bdf3 Use MainConfigNames instead of string literals, #2
This covers all occurrences of /onfig->.*get( '/ in includes/.
Undoubtedly there are still plenty more to go.

Change-Id: I33196c4153437778496f40436bcde399638ac361
2022-04-13 18:55:46 +03:00
Umherirrender
1f71eccf63 phan: Disable null_casts_as_any_type setting
Make phan stricter about null types by setting null_casts_as_any_type to
false (the default in mediawiki-phan-config)
Remaining false positive issues are suppressed.
The suppression and the setting change can only be done together

Bug: T242536
Bug: T301991
Change-Id: I0f295382b96fb3be8037a01c10487d9d591e7e01
2022-03-21 18:25:07 +00:00
Umherirrender
44fd53fee3 Using @return never documentation on always-throw-function
This helps phan to detect unreachable code and also impossible types
after the functions.
It helps phan to avoid false positives for array keys
when the keys are checked before

Bug: T240141
Change-Id: I895f70e82b3053a46cd44135b15437e6f82a07b2
2021-09-07 17:29:03 +02:00
C. Scott Ananian
b1f53045d7 Bump wikimedia/remex-html to 2.3.2 and drop 2.3.1
This is a bug fix release of RemexHtml, required by the latest version
of Parsoid.

RemexHtml migrated to a new namespace in 2.3.2.  Since we don't
support aliases in our phan configuration in core, update all uses to
the new namespace to satisfy phan.

Depends-On: I30f01f4a2a5479bb82c9b952ffa68a478215828a
Depends-On: Iedf446635ee2112cfe637d8ebcf8092f0976bd17
Change-Id: I74fc929e4a66b28bfb1800ff0cd751c86e4a9f50
2021-08-08 18:07:29 -04:00
C. Scott Ananian
2fa79194ad Allow core to use remex-html 2.3.2
This is a bug fix release of RemexHtml, required by the latest version
of Parsoid.

RemexHtml migrated to a new namespace in 2.3.2 and uses aliases for
compatibility.  Once we upgrade mediawiki-vendor we can rename all
the uses in core and turn off aliases again.

Due to T287419, we need to suppress some phan issues because phan
ends up running against both remex 2.3.1 *and* 2.3.2 in different
CI jobs.  These suppressions are removed in the follow up
I74fc929e4a66b28bfb1800ff0cd751c86e4a9f50.

Change-Id: I42edd4fb8cd277ea20e331994fcbe56b52bf3f06
2021-08-08 17:55:15 -04:00
Umherirrender
886643796c docs: Fix @var comments to use doc comment syntax
@var needs /**-comments to work, not /*-comments

Change-Id: If54b3f24d4ca49036fa91aa4c72fab6d841fcc9e
2021-04-29 22:48:52 +00:00
C. Scott Ananian
e99cf5c98d Deprecate MWTidy and TidyDriverBase::supportsValidate()
Also copied the tests that used to be in TidyTest into
RemexDriverTest, so that we're not losing coverage when MWTidy is
eventually removed.

Bug: T198214
Change-Id: I0b301f6c98d0943ce4b6dc224f1066cb7bf244d1
2021-03-16 12:29:55 -07:00
C. Scott Ananian
1fd4a7af4e Introduce Tidy service
Refactor the old MWTidy singleton as a DI service.

Change-Id: I95605ea5fd22f53a7f90fe07a6a73fa6c959597a
2021-03-15 17:22:36 -04:00
C. Scott Ananian
5d317c25be Parser: Move Sanitizer::normalizeCharReferences into RemexCompatFormatter
Choosing a particular encoding of HTML entities is logically a task
of the Remex formatter (which serializes HTML).  Move it out of the
Parser so that it is part of the serialization specification.

This is a follow up to Ic8965e81882d7cf024bdced437f684064a30ac86.

Change-Id: If45907baf24d60987b39cd1f7709c5f7caf19f37
2021-03-15 17:20:14 -04:00
Arlo Breault
c44a3958a3 Don't apply French spacing in raw text elements
This also means we don't need to take special care for French spacing in
attributes, since it's no longer applied there.

Adds a test that captures this change.

Note that the test "Nowiki and french spacing" wonders whether this
escaping should be applied to nowiki content.

Bug: T255007
Change-Id: Ic8965e81882d7cf024bdced437f684064a30ac86
2021-02-16 19:26:29 -05:00
Umherirrender
8de3b7d324 Use static closures where safe to use
This is micro-optimization of closure code to avoid binding the closure
to $this where it is not needed.

Created by I25a17fb22b6b669e817317a0f45051ae9c608208

Change-Id: I0ffc6200f6c6693d78a3151cb8cea7dce7c21653
2021-02-11 00:13:52 +00:00
DannyS712
94169ee873 Whitespace cleanup: Use tabs for indentation, avoid double spaces
Change-Id: I346073b59d283029bd6666356c62c81e687ea5e6
2020-06-27 07:53:07 +00:00
James D. Forrester
4f2d1efdda Coding style: Auto-fix MediaWiki.Classes.UnsortedUseStatements.UnsortedUse
Change-Id: I94a0ae83c65e8ee419bbd1ae1e86ab21ed4d8210
2020-01-10 09:32:25 -08:00
Umherirrender
0688dd7c6d Set method visibility for various constructors
Change-Id: Id3c88257e866923b06e878ccdeddded7f08f2c98
2019-12-03 20:17:30 +01:00
Umherirrender
c7ad21c25f Improve param docs
Change-Id: I746a69f6ed01c3ff000da125457df62b02d13b34
2019-11-28 19:08:59 +01:00
Derick Alangi
d3b7cb742f tidy: Remove unused var and define $parts var to avoid undefined error
Remove unused variable $parent in RemexCompatMunger::comment(). Also,
RemexMungerData::dump() could have a possibility that all checks fail
and $parts is not defined. There are two ways we can handle this, i.e.
either by doing `$parts = []`(setting $parts to an empty array) or by
safe guarding using an `isset()` check.

This patch uses the former so that $parts is defined and can be used
below in the code.

Change-Id: I4d601a6fe36a1dce0945686cb9880336d08338be
2019-06-10 14:34:54 +01:00
Reedy
c13fee87d4 Collapse some nested if statements
Change-Id: I9a97325d738d09370d29d35d5254bc0dadc57ff4
2019-04-04 19:02:22 +00:00
Max Semenik
e6818e6c64 Fix unused vars/pointless assignments
Change-Id: If475c738b4af7208024c866594d4c0048af053dd
2019-03-29 16:52:48 -07:00
Brad Jorsch
4597559d84 RemexCompatMunger: Don't split p-wrapping on style/link tags
<style> and <link> tags are metadata tags, they shouldn't split the <p>
tag when p-wrapping content.

Bug: T208901
Change-Id: I2ef5da68c9ccde4477d8295dfe4abf8497c5d26e
2019-01-30 09:10:24 -08:00
C. Scott Ananian
6db35b3c98 Remove most support for configuring Tidy, including Raggett
Remex is pure PHP so there is no reason to use an external tidy any
more. Configuration variables and implementation classes were
deprecated in 1.32 or earlier.  We've kept only $wgTidyConfig
which can be used for experimental features or debugging Remex.

Bug: T198214
Change-Id: I99d48f858d97b6e1d1e6cd76a42c960cc2c61f9f
2018-11-15 12:22:06 -05:00
C. Scott Ananian
a11a6f619f Hard deprecate non-Remex tidy modes
Let's rip the band-aid off.  Remex is pure PHP so there's no reason to
be running any of the other tidy implementations any more, and we won't
be able to support them in the future.

Follow-up to 7b23382823.

Bug: T198214
Change-Id: Id3d07d44f8434231826e86e623554cac3decfa96
2018-09-21 09:48:38 -04:00
C. Scott Ananian
7b23382823 Soft deprecate non-Remex tidy configurations
Future parsers will not be able to emit output compatible with these
configurations.

Bug: T198214
Change-Id: Id7921a166a62457f289e6c0c4bba6c8563be4760
2018-09-20 15:10:44 +00:00
Tim Starling
690bc4cb6a RemexDriver: improved tracing
Use the new RemexHtml trace features. Add two more tracing modes.

Fix missing member variable declarations and remove unused local
variables.

Change-Id: I512462e1019f9a466684abfa4aab7697b324d5b1
2018-08-14 13:40:11 -07:00
Tim Starling
10c8cfea30 RemexCompatMunger: Don't call endTag() in case B/b
This was naïve, the linked bug documents a case where endTag() was
called despite children of the p-wrap still being in TreeBuilder's
stack. Instead, wait for the parent of the p-wrap to have endTag()
called on it, I've submitted a patch which will clean up the node in
that case.

Bug: T200827
Change-Id: I34694813eace9cadabf2db8f9ccca83d1368cfad
2018-08-07 14:07:31 +10:00
Arlo Breault
5a7f860b78 <ins>/<del> elements can be phrasing or flow
The changes to the parserTests.txt highlight the differing opinions that
doBlockLevels and Remex had on whether these should be paragraph wrapped.

Since the only time they wouldn't have been was when found on a line
with other flow tags, this likely isn't a behaviour that was depended on
in practice.  And, indeed, the task describes this as a bug.

A sampling of pages from an insource:/\<(ins|del)\>/ search on wiki bears
this out.

Bug: T17491
Change-Id: I311da777a63aa3c45013f2cfc090be35a022497e
2018-07-13 11:28:10 -04:00
Umherirrender
130ec2523d Fix PhanTypeMismatchDeclaredParam
Auto fix MediaWiki.Commenting.FunctionComment.DefaultNullTypeParam sniff

Change-Id: I865323fd0295aabd06f3e3c75e0e5043fb31069e
2018-07-07 00:34:30 +00:00
Bartosz Dziewoński
0313128b10 Use PHP 7 "\u{NNNN}" Unicode codepoint escapes in string literals
In cases where we're operating on text data (and not binary data),
use e.g. "\u{00A0}" to refer directly to the Unicode character
'NO-BREAK SPACE' instead of "\xc2\xa0" to specify the bytes C2h A0h
(which correspond to the UTF-8 encoding of that character). This
makes it easier to look up those mysterious sequences, as not all
are as recognizable as the no-break space.

This is not enforced by PHP, but I think we should write those in
uppercase and zero-padded to at least four characters, like the
Unicode standard does.

Note that not all "\xNN" escapes can be automatically replaced:
* We can't use Unicode escapes for binary data that is not UTF-8
  (e.g. in code converting from legacy encodings or testing the
  handling of invalid UTF-8 byte sequences).
* '\xNN' escapes in regular expressions in single-quoted strings
  are actually handled by PCRE and have to be dealt with carefully
  (those regexps should probably be changed to use the /u modifier).
* "\xNN" referring to ASCII characters ("\x7F" and lower) should
  probably be left as-is.

The replacements in this commit were done semi-manually by piping
the existing "\xNN" escapes through the following terrible Ruby
script I devised:

  chars = eval('"' + ARGV[0] + '"').force_encoding('utf-8')
  puts chars.split('').map{|char|
    '\\u{' + char.ord.to_s(16).upcase.rjust(4, '0') + '}'
  }.join('')

Change-Id: Idc3dee3a7fb5ebfaef395754d8859b18f1f8769a
2018-06-04 16:20:13 +00:00
Kunal Mehta
853b8fe34c tidy: Remove obsolete Depurate and Balancer drivers
The Html5Depurate driver was intended to be used with an external Java
service, but it never gained traction due to deployment concerns.

The Html5Internal (Balancer) driver was originally intended for use with
the balanced templates proposal and could also handle tidying. But it was
tightly coupled to MediaWiki, so part of it was used as the basis of the
RemexHtml library. Remex most likely can also implement the balanced
templates proposal, so there isn't any reason to keep the Balancer code
around anymore,

Change-Id: I8542d69e9cdbf0e2fb7ebbb919933a64c1b8c293
2018-05-08 15:32:49 +00:00
Umherirrender
95ebece410 Add missing use statement
Change-Id: Id14d97b5b74edf6c6bafb29b643ac9b9357bb681
2018-04-27 23:13:43 +02:00
jenkins-bot
4e7673c5b0 Merge "Immediately drop wgValidateAllHtml and related code" 2018-04-12 05:29:53 +00:00
James D. Forrester
0da97e7a03 Immediately drop wgValidateAllHtml and related code
Bug: T191670
Change-Id: If13d02ee1b30fec1c701226af9d363c6e08b3737
2018-04-10 10:51:28 -07:00
Arlo Breault
25a08cc5f9 Munge inline elements found in tidy.conf as well
Bug: T184900
Bug: T184228
Change-Id: I421c4c7cf1eeeb6c44bb64081b49ae05937d1a8b
2018-04-04 20:20:38 -04:00
Fomafix
d59af4c341 Use PHP's implode() with the suggested order of arguments
https://secure.php.net/manual/en/function.implode.php defines the order
of arguments as

 string implode ( string $glue , array $pieces )
 string implode ( array $pieces )

Note:
  implode() can, for historical reasons, accept its parameters in
  either order. For consistency with explode(), however, it may be less
  confusing to use the documented order of arguments.

Change-Id: I03bf5712204e283f52d3ede54af9b9ec117d4280
2018-02-22 20:24:00 +01:00
Thiemo Mättig
409da2d8b3 Remove leading backslashes from "use \…" tags
Change-Id: I494b029de089a07e3b946ee78293a12d5036f63e
2017-12-28 16:30:05 +01:00