The Preprocessor_DOM implementation doesn't interact well with PHP memory
profiling, and has some limitations not present in the Preprocessor_Hash
implementation (see T216664). There is no reason to keep around two
versions of the preprocessor: it just complicates on-going wikitext
feature development.
Hard deprecate use of Preprocessor_DOM, so we can remove the redundant
code in a future release.
Bug: T204945
Depends-On: Id38c9360e4d02b570996dbf7a660f964f02f1a2c
Change-Id: Ica5d1ad5b1e677542962fc36d582a793f941155e
These global functions were deprecated in 1.34 and services made
available to replace them. See services below;
* wfFindFile() - MediaWikiServices::getInstance()->getRepoGroup()->findFile()
* wfLocalFind() - MediaWikiServices::getInstance()->getRepoGroup()->getLocalRepo()->newFile()
NOTES:
* wfFindFile() and wfLocalFind() usages in tests have been ignored
in this change per @Timo's comments about state of objects.
* includes/upload/UploadBase.php also maintained for now as it causes
some failures I don't fully understand, will investigate and handle
it in a follow up patch.
* Also, includes/MovePage.php
Change-Id: I9437494de003f40fbe591321da7b42d16bb732d6
DerivedPageDataUpdater::prepareContent already locks in the revision
timestamp before insertion, so inject that into the parser options
used for any pre-save parse (e.g for edit filters).
This means that a reparse is no longer needed within in the same save
request to get the post-save canonical output. A parse will still be
required if the edit filter output used an edit stash output, since
the revision timestamp is not set at stash time.
Instead of using vary-revision, add a vary-revision-timestamp flag
for the revision timestamp words. The month/day/hour variants retain
their prior optimizations for allowing edit stash output reuse for
the post-save canonical output.
Change-Id: Ic2c13db4d21197c79a89de0de56745ca32918eb6
This avoids a double parse when the edit stash is not used,
which can be confirmed via the SaveParse log for a page
using {{REVISIONID}} when edit stashing is disabled. This
now matches the reuse for the edit stash hit case.
Change-Id: I405c39d4d7ac04e39fbdfe400f73238b734c7833
This code is functionally identical, but less error prone (not so easy
to forget or mix these numerical indexes).
This patch happens to touch the Parser, which might be a bit scary. We
can remove this file from this patch if you prefer.
Change-Id: I8cbe3a9a6725d1c42b86e67678c1af15fbc5961a
This only applies to content namespaces for now since
the cost of vary-revision-id is much less of a concern.
The potential to harm page save time is far worse than what
use they have, which is almost entirely just hacks to check
for preview mode. These have nothing to do with the actual
revision ID nor timestamp itself. They simply check whether
the value is the empty string. Since this magic word still
only returns an empty string in preview mode, such checks
will keep working.
Bug: T137900
Depends-on: I1809354055513a5b9d9589e2d6acda7579af76e2
Change-Id: Ieff8423ae3804b42d264f630e1a029199abf5976
It's a temporary feature flag not included in any release, just
removing it outright. The functonality will now be always enabled.
Bug: T205040
Change-Id: Ia9da82e6f6b2d270f1790a99fc8c35ad5e6aee5e
HTML doesn't allow certain semicolon-less HTML entities in attribute
values to avoid breaking legacy markup like:
<a href="http://example.com?foo¶m=bar">...</a>
(Note that the & in that URL is not properly entity-escaped as `&`.)
Unlike wikitext, HTML generally allows semicolon-less legacy entities
in text.
Our alt and link option processing shove text through
Sanitizer::stripAllTags, which does entity decoding including these
legacy semicolon-less entities. Wikitext doesn't allow semicolon-less
entities, so escape & characters where appropriate to protect alt/link
options and avoid breaking URLs.
This was a "regression" in how alt options were handled starting in
ddb4913f53 when we switched to using
Remex for Sanitizer::stripAllTags -- semicolon-less entities (previously
invalid in wikitext) were now being decoded when stripAllTags was
called on alt text. This change became a problem when
ad80f0bca2 sent link option text through
Sanitizer::stripAllTags (with the new semicolon-less entity decode)
instead of PHP's strip_tags (which, in addition to its other faults,
doesn't do entity decode at all). This suddenly started decoding
"non-wikitext" entities like `¶` inside URLs, breaking links.
Filed T210437 as a follow-up to consider changing the behavior
of Sanitizer::stripAllTags() globally to prevent it from decoding
semicolon-less entities for all callers.
Bug: T209236
Change-Id: I5925e110e335d83eafa9de935c4e06806322f4a9
This adds a method to LinkFilter to build the query conditions necessary
to properly use it, and adjusts code to use it.
This also takes the opportunity to clean up the calculation of el_index:
IPs are handled more sensibly and IDNs are canonicalized.
Also weird edge cases for invalid hosts like "http://.example.com" and
corresponding searches like "http://*..example.com" are now handled more
regularly instead of being treated as if the extra dot were omitted,
while explicit specification of the DNS root like "http://example.com./"
is canonicalized to the usual implicit specification.
Note that this patch will break link searches for links where the host
is an IP or IDN until refreshExternallinksIndex.php is run.
Bug: T59176
Bug: T130482
Change-Id: I84d224ef23de22dfe179009ec3a11fd0e4b5f56d
Future parsers will not support the output generated with tidy disabled.
Parser tests using untidied output will also be deprecated (and
rewritten) in a follow-up patch.
No new release notes necessary since user-visible tidy configuration
was deprecated previously (in 1.32), and individual methods which had
disabled tidy during execution were individually release-noted as they
were updated.
Bug: T198214
Depends-On: I0f417f75a49dfea873e9a2f44d81796a48b9f428
Depends-On: If5c619cdd3e7f786687cfc2ca166074d9197ca11
Change-Id: I592e0e0dfef7d929f05c60ffe4d60e09725b39cc
Previously, they were always displayed in defult language unless
forced explicitly in wikitext, e.g. [[File:Foo.svg|lang=ru]].
This change adds a feature flag that would enable always trying to
display in page language.
* If enabled, Parser will pass a new parameter - 'pagelang' - to
the media handler.
* SvgHandler uses page language when determining what language to
render the image in.
* 'pagelang' can always be overridden by 'lang'.
* If no translation in page language is available, the default
language (English) will be used for thumbnail URLs, to prevent
cluttering media storage and HTTP caches with useless copies.
Performance: this requires accessing image's metadata during parsing.
My testing indicates there were no code path where this wasn't the
case already, so no performance hit is expected, however we should
still keep an eye on page save performance.
Bug: T205040
Change-Id: I348840ef405e1370cc0c17d69051bce30153c9c0
Use Parser::stripAltText() consistently to handle link and alt options
in both Parser::makeImage() and Parser::renderImageGallery(). This
ensures that link option text can use <nowiki> to escape problematic
text so that (for example) the following works:
```
[[File:Foobar.jpg|link=<nowiki>a''b''c</nowiki>|alt=<nowiki>a''b''c</nowiki>]]
<gallery>
File:Foobar.jpg|link=<nowiki>a''b''c</nowiki>|alt=<nowiki>a''b''c</nowiki>
</gallery>
```
Previously the handling of the link option in
Parser::renderImageGallery() used a bespoke `strip_tags` invocation
which didn't replace <nowiki>s, and the handling of the link option in
Parser::makeImage() didn't strip tags at all, nor did it replace
<nowiki>s. For example, in Parser::makeImage() double quotes in
titles would be converted to embedded `<i>` tags before being passed
to Parser::parseLinkParameter(), with predictably poor results.
Tests added to confirm behavior of alt/link with HTML-escaped
entities and <nowiki>s exposed a bug in Remex: T207088. Tests
will fail on PHP 7.0 until that is fixed.
Bug: T206940
Depends-On: Ide67bba20f771868c0e119cb2874464dcf1d758a
Change-Id: Ife4c0edaa85e0cb294c5d4c1e31d5d7d828d9df4
This injects the new, unsaved RevisionRecord object into the Parser used
for Pre-Save Transform, and sets the user and timestamp on that revision,
to allow {{subst:REVISIONUSER}} and {{subst:REVISIONTIMESTAMP}} to function.
Bug: T203583
Change-Id: I31a97d0168ac22346b2dad6b88bf7f6f8a0dd9d0
This replaces the builtin taints that are removed in
Ic1e1983a51c. Additionally, parse will no longer warn about
double escaping - there's many situations where such warnings
are wrong (e.g. Using Html::rawElement()). However this also
means that Parser::parse( wfMessage( 'foo' )->parse() ); will
no longer give a double escaping warning, which is unfortunate.
Bug: T202380
Change-Id: Ia52d37411beb62b112c6ff102438063c3d750769
This is not strictly accurate, because Parser::internalParse() actually
returns half-parsed HTML, which is not safe for output. But it is safe for
output from a parser tag.
Maybe phan-taint-check plugin needs to learn about half-parsed HTML as an
extra taint type, and make that an acceptable thing for parser tags to return,
but not other things.
But this fixes the failures for the Listings extension, so I think it's
worthwhile in the meantime.
Change-Id: Idf87f5c3dcf81dd210de73a4ff15e3b1aabd9f89
RevisionRenderer is the MCR replacement for Content::getParserOutput,
as outlined in <https://www.mediawiki.org/wiki/User:Daniel_Kinzler_(WMDE)/MCR-PageUpdater>.
Note: This change also introduces quite a bit of code for
merging ParserOutput objects.
Bug: T194048
Change-Id: I871978bf79f67c9e7954fb3fc8528d6e365f2cc1