Thijs/wiki.techinc.nl

Author	SHA1	Message	Date
C. Scott Ananian	94f193a894	SECURITY: Ensure emitted HTML is safe against Unicode NFC normalization CVE-2025-32699 Ensure that Unicode NFC normalization can be applied to our HTML output safely. Even though the W3C officially recommends against normalizing HTML https://www.w3.org/International/questions/qa-html-css-normalization#converting this is still easily done inadvertently, especially when using the MediaWiki action API which normalizes parameters and results by default. See also I671648603c4635a35585c860b4857f5ea085e47f in Parsoid, and T266140 / I2e78e660ba1867744e34eda7d00ea527ec016b71 for another similar issue. The following changes are made: * The various HTML serializers (Remex/Tidy-derived, as well as the Html::* helpers) are tweaked to entity-escape U+0338 wherever it appears. * Similarly, Message::escaped() is tweaked to entity-escape U+0338. * Finally, a post-processing pass is added to the OutputTransform pipeline to catch any remaining U+0338 and entity-escape them. This catches U+0338 added during any of the previous OutputTransform stages (like TOC insertion, section edit links, etc). When backporting this code will likely need to be moved to ParserOutput::getText(), as the OutputTransform pipeline wasn't added until MW 1.42. Bug: T387130 Change-Id: I66564e14e730f5393f4fa5780b80f24de6075af5	2025-04-10 15:56:06 +01:00
Umherirrender	6eec17e9a9	Add missing documentation to class properties (miscellaneous classes) Add doc-typehints to class properties found by the PropertyDocumentation sniff to improve the documentation. Once the sniff is enabled it avoids that new code is missing type declarations. This is focused on documentation and does not change code. Change-Id: I1da4b272a6b28c419cc8e860d142dae19ca0bbcf	2024-09-14 10:12:18 +02:00
Umherirrender	465777f188	Use const keyword for constant list of strings or ints Also changed visiblity of some to private Change-Id: I113b040321d27c84fe9b807c162736909e96fb20	2024-09-11 23:16:24 +02:00
jenkins-bot	6039650aed	Merge "HtmlHelper: Fix entity encoding when $html5format = false"	2024-02-15 03:30:11 +00:00
James D. Forrester	102a4f8a35	build: Upgrade mediawiki/mediawiki-phan-config from 0.13.0 to 0.14.0 manually * Switch out raw Exceptions, mostly for InvalidArgumentExceptions. * Fake exceptions triggered to give Monolog a backtrace are for some reason "traditionally" RuntimeExceptions, instead, so we continue to use that pattern in remaining locations. * Just entirely give up on PostgresResultWrapper's resource vs. object mess. * Drop now-unneeded false positive hits. Change-Id: Id183ab60994cd9c6dc80401d4ce4de0ddf2b3da0	2024-02-10 02:22:41 +00:00
Bartosz Dziewoński	2fec813efa	HtmlHelper: Fix entity encoding when $html5format = false Follow-up to `84d0dff968`. Bug: T354361 Change-Id: I44a98f667a89d0baa25188fc6d43f92b3ad19b84	2024-02-09 21:38:23 +00:00
Dogu	29d8092f5f	Replace SerializerNode, Element, and Exception qualifiers with imports Change-Id: I34e3600632f11adb53847656c605daa3618ff0fa	2024-01-05 08:43:16 +00:00
James D. Forrester	468e69bccc	Namespace Sanitizer under \MediaWiki\Parser Bug: T166010 Change-Id: Id13dcbf7a0372017495958dbc4f601f40c122508	2023-09-21 05:39:23 +00:00
thiemowmde	9b03cde58e	Merge sequences of `if` that end doing the same thing anyway Motivation: * Avoid code duplication. * Hopefully make it easier to read. * Also order stuff from cheap to expensive, if possible. Change-Id: I575e3f2027ce60a0d0885be5b9bd3e07bc035eee	2023-06-16 16:09:42 +02:00
Matěj Suchánek	5b34ec2c1f	Remove deprecated code from tidy drivers Change-Id: I88f35425955ed5b189e0741268aa361582d0f1db	2022-11-28 18:05:34 +01:00
Tim Starling	0077c5da15	Use short array destructuring instead of list() Introduced in PHP 7.1. Because it's shorter and looks nice. I used regex replacement. Change-Id: I0555e199d126cd44501f859cb4589f8bd49694da	2022-10-21 15:33:37 +11:00
jenkins-bot	61cbd18ff3	Merge "parser: Use a <meta> tag for the internal TOC_PLACEHOLDER"	2022-09-09 21:12:34 +00:00
Arlo Breault	4703724fe8	Don't reconstruct formatting elements in figures Similar to I3c55eb5fb8055016f8c4f76d27d953f65ff621be in Parsoid Bug: T314059 Change-Id: I7b4e9df8490357f44d31d6a869fa9b7a15f029ea	2022-08-31 18:55:23 -04:00
C. Scott Ananian	0b10563895	parser: Use a <meta> tag for the internal TOC_PLACEHOLDER Split out from the I44045b3b9e78e change. This is consistent with what Parsoid will use for the TOC marker. Bug: T287767 Bug: T270199 Bug: T311502 Depends-On: I1f607cf1ef1b61fb4d2e1880de756fb94d5a6b22 Change-Id: Ie63eed07b9bca1bfa07d4c256aba3728cedd8f93	2022-08-16 06:05:17 +00:00
Matěj Suchánek	1865180ae7	Do minor code cleanup Remove dead code and fix typos. Should cause no change in behavior. Change-Id: I5d293b842bc93a28b8bcd799a31b5e6e30fe692e	2022-06-24 13:52:42 +02:00
Aryeh Gregor	7b791474a5	Use MainConfigNames instead of string literals, #4 Now largely automated: VARS=$(grep -o "'[A-Za-z0-9_]*'" includes/MainConfigNames.php \| \ tr "\n" '\|' \| sed "s/\|$/\n/;s/'//g") sed -i -E "s/'($VARS)'/MainConfigNames::\1/g" \ $(grep -ERIl "'($VARS)'" includes/) Then git add -p with lots of error-prone manual checking. Then semi-manually add all the necessary "use" lines: vim $(grep -L 'use MediaWiki\\MainConfigNames;' \ $(git diff --cached --name-only --diff-filter=M HEAD^)) I didn't bother fixing lines that were over 100 characters unless they were over 120 and triggered phpcs. Bug: T305805 Change-Id: I74e0ab511abecb276717ad4276a124760a268147	2022-04-26 19:03:37 +03:00
Aryeh Gregor	666ca1bdf3	Use MainConfigNames instead of string literals, #2 This covers all occurrences of /onfig->.*get( '/ in includes/. Undoubtedly there are still plenty more to go. Change-Id: I33196c4153437778496f40436bcde399638ac361	2022-04-13 18:55:46 +03:00
Umherirrender	1f71eccf63	phan: Disable null_casts_as_any_type setting Make phan stricter about null types by setting null_casts_as_any_type to false (the default in mediawiki-phan-config) Remaining false positive issues are suppressed. The suppression and the setting change can only be done together Bug: T242536 Bug: T301991 Change-Id: I0f295382b96fb3be8037a01c10487d9d591e7e01	2022-03-21 18:25:07 +00:00
Umherirrender	44fd53fee3	Using @return never documentation on always-throw-function This helps phan to detect unreachable code and also impossible types after the functions. It helps phan to avoid false positives for array keys when the keys are checked before Bug: T240141 Change-Id: I895f70e82b3053a46cd44135b15437e6f82a07b2	2021-09-07 17:29:03 +02:00
C. Scott Ananian	b1f53045d7	Bump wikimedia/remex-html to 2.3.2 and drop 2.3.1 This is a bug fix release of RemexHtml, required by the latest version of Parsoid. RemexHtml migrated to a new namespace in 2.3.2. Since we don't support aliases in our phan configuration in core, update all uses to the new namespace to satisfy phan. Depends-On: I30f01f4a2a5479bb82c9b952ffa68a478215828a Depends-On: Iedf446635ee2112cfe637d8ebcf8092f0976bd17 Change-Id: I74fc929e4a66b28bfb1800ff0cd751c86e4a9f50	2021-08-08 18:07:29 -04:00
C. Scott Ananian	2fa79194ad	Allow core to use remex-html 2.3.2 This is a bug fix release of RemexHtml, required by the latest version of Parsoid. RemexHtml migrated to a new namespace in 2.3.2 and uses aliases for compatibility. Once we upgrade mediawiki-vendor we can rename all the uses in core and turn off aliases again. Due to T287419, we need to suppress some phan issues because phan ends up running against both remex 2.3.1 and 2.3.2 in different CI jobs. These suppressions are removed in the follow up I74fc929e4a66b28bfb1800ff0cd751c86e4a9f50. Change-Id: I42edd4fb8cd277ea20e331994fcbe56b52bf3f06	2021-08-08 17:55:15 -04:00
Umherirrender	886643796c	docs: Fix @var comments to use doc comment syntax @var needs /*-comments to work, not /-comments Change-Id: If54b3f24d4ca49036fa91aa4c72fab6d841fcc9e	2021-04-29 22:48:52 +00:00
C. Scott Ananian	e99cf5c98d	Deprecate MWTidy and TidyDriverBase::supportsValidate() Also copied the tests that used to be in TidyTest into RemexDriverTest, so that we're not losing coverage when MWTidy is eventually removed. Bug: T198214 Change-Id: I0b301f6c98d0943ce4b6dc224f1066cb7bf244d1	2021-03-16 12:29:55 -07:00
C. Scott Ananian	1fd4a7af4e	Introduce Tidy service Refactor the old MWTidy singleton as a DI service. Change-Id: I95605ea5fd22f53a7f90fe07a6a73fa6c959597a	2021-03-15 17:22:36 -04:00
C. Scott Ananian	5d317c25be	Parser: Move Sanitizer::normalizeCharReferences into RemexCompatFormatter Choosing a particular encoding of HTML entities is logically a task of the Remex formatter (which serializes HTML). Move it out of the Parser so that it is part of the serialization specification. This is a follow up to Ic8965e81882d7cf024bdced437f684064a30ac86. Change-Id: If45907baf24d60987b39cd1f7709c5f7caf19f37	2021-03-15 17:20:14 -04:00
Arlo Breault	c44a3958a3	Don't apply French spacing in raw text elements This also means we don't need to take special care for French spacing in attributes, since it's no longer applied there. Adds a test that captures this change. Note that the test "Nowiki and french spacing" wonders whether this escaping should be applied to nowiki content. Bug: T255007 Change-Id: Ic8965e81882d7cf024bdced437f684064a30ac86	2021-02-16 19:26:29 -05:00
Umherirrender	8de3b7d324	Use static closures where safe to use This is micro-optimization of closure code to avoid binding the closure to $this where it is not needed. Created by I25a17fb22b6b669e817317a0f45051ae9c608208 Change-Id: I0ffc6200f6c6693d78a3151cb8cea7dce7c21653	2021-02-11 00:13:52 +00:00
DannyS712	94169ee873	Whitespace cleanup: Use tabs for indentation, avoid double spaces Change-Id: I346073b59d283029bd6666356c62c81e687ea5e6	2020-06-27 07:53:07 +00:00
James D. Forrester	4f2d1efdda	Coding style: Auto-fix MediaWiki.Classes.UnsortedUseStatements.UnsortedUse Change-Id: I94a0ae83c65e8ee419bbd1ae1e86ab21ed4d8210	2020-01-10 09:32:25 -08:00
Umherirrender	0688dd7c6d	Set method visibility for various constructors Change-Id: Id3c88257e866923b06e878ccdeddded7f08f2c98	2019-12-03 20:17:30 +01:00
Umherirrender	c7ad21c25f	Improve param docs Change-Id: I746a69f6ed01c3ff000da125457df62b02d13b34	2019-11-28 19:08:59 +01:00
Derick Alangi	d3b7cb742f	tidy: Remove unused var and define $parts var to avoid undefined error Remove unused variable $parent in RemexCompatMunger::comment(). Also, RemexMungerData::dump() could have a possibility that all checks fail and $parts is not defined. There are two ways we can handle this, i.e. either by doing `$parts = []`(setting $parts to an empty array) or by safe guarding using an `isset()` check. This patch uses the former so that $parts is defined and can be used below in the code. Change-Id: I4d601a6fe36a1dce0945686cb9880336d08338be	2019-06-10 14:34:54 +01:00
Reedy	c13fee87d4	Collapse some nested if statements Change-Id: I9a97325d738d09370d29d35d5254bc0dadc57ff4	2019-04-04 19:02:22 +00:00
Max Semenik	e6818e6c64	Fix unused vars/pointless assignments Change-Id: If475c738b4af7208024c866594d4c0048af053dd	2019-03-29 16:52:48 -07:00
Brad Jorsch	4597559d84	RemexCompatMunger: Don't split p-wrapping on style/link tags <style> and <link> tags are metadata tags, they shouldn't split the <p> tag when p-wrapping content. Bug: T208901 Change-Id: I2ef5da68c9ccde4477d8295dfe4abf8497c5d26e	2019-01-30 09:10:24 -08:00
C. Scott Ananian	6db35b3c98	Remove most support for configuring Tidy, including Raggett Remex is pure PHP so there is no reason to use an external tidy any more. Configuration variables and implementation classes were deprecated in 1.32 or earlier. We've kept only $wgTidyConfig which can be used for experimental features or debugging Remex. Bug: T198214 Change-Id: I99d48f858d97b6e1d1e6cd76a42c960cc2c61f9f	2018-11-15 12:22:06 -05:00
C. Scott Ananian	a11a6f619f	Hard deprecate non-Remex tidy modes Let's rip the band-aid off. Remex is pure PHP so there's no reason to be running any of the other tidy implementations any more, and we won't be able to support them in the future. Follow-up to `7b23382823`. Bug: T198214 Change-Id: Id3d07d44f8434231826e86e623554cac3decfa96	2018-09-21 09:48:38 -04:00
C. Scott Ananian	7b23382823	Soft deprecate non-Remex tidy configurations Future parsers will not be able to emit output compatible with these configurations. Bug: T198214 Change-Id: Id7921a166a62457f289e6c0c4bba6c8563be4760	2018-09-20 15:10:44 +00:00
Tim Starling	690bc4cb6a	RemexDriver: improved tracing Use the new RemexHtml trace features. Add two more tracing modes. Fix missing member variable declarations and remove unused local variables. Change-Id: I512462e1019f9a466684abfa4aab7697b324d5b1	2018-08-14 13:40:11 -07:00
Tim Starling	10c8cfea30	RemexCompatMunger: Don't call endTag() in case B/b This was naïve, the linked bug documents a case where endTag() was called despite children of the p-wrap still being in TreeBuilder's stack. Instead, wait for the parent of the p-wrap to have endTag() called on it, I've submitted a patch which will clean up the node in that case. Bug: T200827 Change-Id: I34694813eace9cadabf2db8f9ccca83d1368cfad	2018-08-07 14:07:31 +10:00
Arlo Breault	5a7f860b78	<ins>/<del> elements can be phrasing or flow The changes to the parserTests.txt highlight the differing opinions that doBlockLevels and Remex had on whether these should be paragraph wrapped. Since the only time they wouldn't have been was when found on a line with other flow tags, this likely isn't a behaviour that was depended on in practice. And, indeed, the task describes this as a bug. A sampling of pages from an insource:/\<(ins\|del)\>/ search on wiki bears this out. Bug: T17491 Change-Id: I311da777a63aa3c45013f2cfc090be35a022497e	2018-07-13 11:28:10 -04:00
Umherirrender	130ec2523d	Fix PhanTypeMismatchDeclaredParam Auto fix MediaWiki.Commenting.FunctionComment.DefaultNullTypeParam sniff Change-Id: I865323fd0295aabd06f3e3c75e0e5043fb31069e	2018-07-07 00:34:30 +00:00
Bartosz Dziewoński	0313128b10	Use PHP 7 "\u{NNNN}" Unicode codepoint escapes in string literals In cases where we're operating on text data (and not binary data), use e.g. "\u{00A0}" to refer directly to the Unicode character 'NO-BREAK SPACE' instead of "\xc2\xa0" to specify the bytes C2h A0h (which correspond to the UTF-8 encoding of that character). This makes it easier to look up those mysterious sequences, as not all are as recognizable as the no-break space. This is not enforced by PHP, but I think we should write those in uppercase and zero-padded to at least four characters, like the Unicode standard does. Note that not all "\xNN" escapes can be automatically replaced: * We can't use Unicode escapes for binary data that is not UTF-8 (e.g. in code converting from legacy encodings or testing the handling of invalid UTF-8 byte sequences). * '\xNN' escapes in regular expressions in single-quoted strings are actually handled by PCRE and have to be dealt with carefully (those regexps should probably be changed to use the /u modifier). * "\xNN" referring to ASCII characters ("\x7F" and lower) should probably be left as-is. The replacements in this commit were done semi-manually by piping the existing "\xNN" escapes through the following terrible Ruby script I devised: chars = eval('"' + ARGV[0] + '"').force_encoding('utf-8') puts chars.split('').map{\|char\| '\\u{' + char.ord.to_s(16).upcase.rjust(4, '0') + '}' }.join('') Change-Id: Idc3dee3a7fb5ebfaef395754d8859b18f1f8769a	2018-06-04 16:20:13 +00:00
Kunal Mehta	853b8fe34c	tidy: Remove obsolete Depurate and Balancer drivers The Html5Depurate driver was intended to be used with an external Java service, but it never gained traction due to deployment concerns. The Html5Internal (Balancer) driver was originally intended for use with the balanced templates proposal and could also handle tidying. But it was tightly coupled to MediaWiki, so part of it was used as the basis of the RemexHtml library. Remex most likely can also implement the balanced templates proposal, so there isn't any reason to keep the Balancer code around anymore, Change-Id: I8542d69e9cdbf0e2fb7ebbb919933a64c1b8c293	2018-05-08 15:32:49 +00:00
Umherirrender	95ebece410	Add missing use statement Change-Id: Id14d97b5b74edf6c6bafb29b643ac9b9357bb681	2018-04-27 23:13:43 +02:00
jenkins-bot	4e7673c5b0	Merge "Immediately drop wgValidateAllHtml and related code"	2018-04-12 05:29:53 +00:00
James D. Forrester	0da97e7a03	Immediately drop wgValidateAllHtml and related code Bug: T191670 Change-Id: If13d02ee1b30fec1c701226af9d363c6e08b3737	2018-04-10 10:51:28 -07:00
Arlo Breault	25a08cc5f9	Munge inline elements found in tidy.conf as well Bug: T184900 Bug: T184228 Change-Id: I421c4c7cf1eeeb6c44bb64081b49ae05937d1a8b	2018-04-04 20:20:38 -04:00
Fomafix	d59af4c341	Use PHP's implode() with the suggested order of arguments https://secure.php.net/manual/en/function.implode.php defines the order of arguments as string implode ( string $glue , array $pieces ) string implode ( array $pieces ) Note: implode() can, for historical reasons, accept its parameters in either order. For consistency with explode(), however, it may be less confusing to use the documented order of arguments. Change-Id: I03bf5712204e283f52d3ede54af9b9ec117d4280	2018-02-22 20:24:00 +01:00
Thiemo Mättig	409da2d8b3	Remove leading backslashes from "use \…" tags Change-Id: I494b029de089a07e3b946ee78293a12d5036f63e	2017-12-28 16:30:05 +01:00

1 2 3

105 commits