This fixes a bug introduced inadvertently in
9f14fbd002 when the static cache for
Sanitizer::getRecognizedTagData() was refactored to reduce setup
overhead.
Bug: T303360
Change-Id: I606eeda1fcdbb6d4c62e2dc8db5b6e1659ae3f3f
Rename Sanitizer::removeHTMLtags() into an @internal method named
::internalRemoveHtmlTags() so that we can deprecate external use.
Code search:
https://codesearch.wmcloud.org/deployed/?q=removeHTMLtags&i=nope&files=&excludeFiles=&repos=
Followup-To: Ic864c01471c292f11799c4fbdac4d7d30b8bc50f
Depends-On: Iaca83ed06e9c61d8366579cd2283cba653c82319
Depends-On: I1963bfe9a99198ea02ca482a5769467ce806cd58
Depends-On: I83923d8b38d33f3638cd53958dd10f257ec21f7c
Depends-On: I018b34bb5f6e113056da9b04cc72d4318422adce
Change-Id: I202826f8b27519f7be89643e24eda47a6e3fc9f6
This moves the taint information to be directly on the method,
moving it out of the SecurityCheckPlugin. See discussion on
Ieb202ef92bd9888ce767f8dd4d97f19eeb10a073.
We also fix a legit "double-escape" issue flagged by the phan
SecurityCheckPlugin once the correct taint information has been
added.
Followup-To: Ic864c01471c292f11799c4fbdac4d7d30b8bc50f
Change-Id: I0f873618d43cb6daf9c43394a669125469462223
The existing Sanitizer::removeHTMLtags() method, in addition to having
dodgy capitalization, uses regular expressions to parse the HTML.
That produces corner cases like T298401 and T67747 and is not guaranteed
to yield balanced or well-formed HTML.
Instead, introduce and use a new Sanitizer::removeSomeTags() method
which is guaranteed to always return balanced and well-formed HTML.
Note that Sanitizer::removeHTMLtags()/::removeSomeTags() take a callback
argument which (as far as I can tell) is never used outside core. Mark
that argument as @internal, and clean up the version used by
::removeSomeTags().
Use the new ::removeSomeTags() method in the two places where
DISPLAYTITLE is handled (following up on T67747). The use by the
legacy parser is more difficult to replace (and would have a
performace cost), so leave the old ::removeHTMLtags() method in place
for that call site for now: when the legacy parser is replaced by
Parsoid the need for the old ::removeHTMLtags() will go away. In a
follow-up patch we'll rename ::removeHTMLtags() and mark it @internal
so that we can deprecate ::removeHTMLtags() for external use.
Some benchmarking code added. On my machine, with PHP 7.4, the new
method tidies short 30-character title strings at a rate of about
6764/s while the tidy-based method being replaced here managed 6384/s.
Sanitizer::removeHTMLtags blazes through short strings 20x faster
(120,915/s); some of this difference is due to the set up cost of
creating the tag whitelist and the Remex pipeline, so further
optimizations could doubtless be done if Sanitizer::removeSomeTags()
is more widely used.
Bug: T299722
Bug: T67747
Change-Id: Ic864c01471c292f11799c4fbdac4d7d30b8bc50f
In PHP 8.1 the default $flags argument to htmlspecialchars() has changed
from ENT_COMPAT to ENT_QUOTES | ENT_SUBSTITUTE | ENT_HTML401. This
breaks some tests.
I changed all the calls that break unit tests, and some others
based on a quick code review. A lot of callers just use the default for
convenience, and were already over-quoting, so the default should still
be good enough for them.
Change-Id: Ie9fbeae6f0417c6cf29dceaf429243a135f9fecb
This is a bug fix release of RemexHtml, required by the latest version
of Parsoid.
RemexHtml migrated to a new namespace in 2.3.2. Since we don't
support aliases in our phan configuration in core, update all uses to
the new namespace to satisfy phan.
Depends-On: I30f01f4a2a5479bb82c9b952ffa68a478215828a
Depends-On: Iedf446635ee2112cfe637d8ebcf8092f0976bd17
Change-Id: I74fc929e4a66b28bfb1800ff0cd751c86e4a9f50
array_fill_keys() was introduced in PHP 5.2.0 and works like
array_flip() except that it does only one thing (copying keys) instead
of two things (copying keys and values). That makes it faster and more
obvious.
When array_flip() calls were paired, I left them as is, because that
pattern is too cute. I couldn't kill something so cute.
Sometimes it was hard to figure out whether the values in array_flip()
result were used. That's the point of this change. If you use
array_fill_keys(), the intention is obvious.
Change-Id: If8d340a8bc816a15afec37e64f00106ae45e10ed
Our PortableInfobox extension uses the HTML5 <aside> tag in its generated HTML.
This tag isn't recognized as a block element (in the way e.g. <div> is) by the
legacy parser, resulting in some spurious empty paragraphs in the output.
As a fix, make the legacy parser aware of <aside> tags to avoid unnecessary
p-wrapping. Also add <aside> to the Sanitizer's internal attribute check.
I3e57f55ac69d2c1ee8a1d41c21b692e56fc7e628 takes care of updating Parsoid-PHP
accordingly.
Bug: T278565
Change-Id: I89dbdf7770e13e1b62320228a366c64e64217b0b
This was only necessary when French spacing was added before
doBlockLevels.
Follow up to I654a09b0f98937379b9fad3f325134ead7f2d8a6
Change-Id: I9bff6b7599d97c39334a0bd0f731f29875da17bb
We lost some insight in c44a395 because we're no longer analysing the
entire dom as a serialized string, but instead running our regexp on
individual text nodes.
This patch as written here just allows for the space to be at the start
of the text node. However, some git spelunking shows that in 9dc65ef,
the condition for there being a non-whitespace character previous to the
space was only because armoring French spacing happened before
doBlockLevels and wanted to protect indent pre's.
That's certainly not the case anymore, so we can probably get away with
dropping the condition altogether now.
Bug: T275918
Change-Id: I654a09b0f98937379b9fad3f325134ead7f2d8a6
This also means we don't need to take special care for French spacing in
attributes, since it's no longer applied there.
Adds a test that captures this change.
Note that the test "Nowiki and french spacing" wonders whether this
escaping should be applied to nowiki content.
Bug: T255007
Change-Id: Ic8965e81882d7cf024bdced437f684064a30ac86
This is micro-optimization of closure code to avoid binding the closure
to $this where it is not needed.
Created by I25a17fb22b6b669e817317a0f45051ae9c608208
Change-Id: I0ffc6200f6c6693d78a3151cb8cea7dce7c21653
These are mostly easy fixes. Tests were fixed when that didn't require
any change to the tested code, and moved to /integration otherwise.
MediaWikiUnitTestCase::setTemporaryHook was removed: the
caller should provide a HookContainer, at which point it would just
become a useless wrapper around HookContainer::register. (We don't
really need it to be temporary, if proper DI is used).
The method was only used in the tests touched by this commit.
Change-Id: I2aba02560c41b77eea9dd4bff0e4d1c4bb0da9a2
This was added in f6038b0 to keep Parsoid and the legacy parser in sync.
However, in T251641, we're moving away from using it in both.
Bug: T251641
Change-Id: I148bcf09e64ae443104723f94e6bbdb4ad23a8ef
Reduce code duplication by using the authoritative HTML entity list
from Remex, instead of duplicating the table inside MediaWiki.
This also extends the set of entities accepted in wikitext to nearly
match HTML5. (HTML5 allows some entities which are not
semicolon-terminated; wiktext insists on the semicolon.)
This patch brings the core parser closer to Parsoid output, as in most
cases Parsoid already accepted the full HTML5 entity list.
(I873a6120e4bd1c69fee9da76d266e24e97a22add is a corresponding patch to
Parsoid to unify its copy of Sanitizer.)
Also deprecate Sanitizer::hackDocType() while we're updating it, since
this method should not be public.
Bug: T94603
Change-Id: Ia08bc261c3644f83109f13df04b692101b4e8ef2
Overly-long anchors can cause OOMs later on during TOC processing, and
are needless.
The method Sanitizer::escapeIdReferenceList() is also deprecated in
this patch, since it is a way to get around the ID length limit and
appears to be unused outside the Sanitizer class. Since the use
within Sanitizer (for ARIA attributes) appears safe, we'll just make
this private in a future release and avoid the potential that someone
will misuse this.
Bug: T251506
Change-Id: Ifce057b0c436eabec310f812394e86ee7123e7c8
Deprecating something means to say something nasty about it, or to draw
its character into question. For example, "this function is lazy and good
for nothing". Deprecatory remarks by a developer are generally taken as a
warning that violence will soon be done against the function in question.
Other developers are thus warned to avoid associating with the deprecated
function.
However, since wfDeprecated() was introduced, it has become obvious that
the targets of deprecation are not limited to functions. Developers can
deprecate literally anything: a parameter, a return value, a file
format, Mondays, the concept of being, etc. wfDeprecated() requires
every deprecatory statement to begin with "use of", leading to some
awkward sentences. For example, one might say: "Use of your mouth to
cough without it being covered by your arm is deprecated since 2020."
So, introduce wfDeprecatedMsg(), which allows deprecation messages to be
specified in plain text, with the caller description being optionally
appended. Migrate incorrect or gramatically awkward uses of wfDeprecated()
to wfDeprecatedMsg().
Change-Id: Ib3dd2fe37677d98425d0f3692db5c9e988943ae8
The future Parsoid parser will not support this, and it appears to be unused.
It could be reimplemented as an extension tag once it is removed from core.
Code search:
https://codesearch.wmflabs.org/search/?q=allowimagetag&i=fosho&files=&repos=
Bug: T254802
Change-Id: I1b532a7a8794766f8df6fdf375a6ffd78fee94e5
Migrate all callers of Hooks::run() to use the new
HookContainer/HookRunner system.
General principles:
* Use DI if it is already used. We're not changing the way state is
managed in this patch.
* HookContainer is always injected, not HookRunner. HookContainer
is a service, it's a more generic interface, it is the only
thing that provides isRegistered() which is needed in some cases,
and a HookRunner can be efficiently constructed from it
(confirmed by benchmark). Because HookContainer is needed
for object construction, it is also needed by all factories.
* "Ask your friendly local base class". Big hierarchies like
SpecialPage and ApiBase have getHookContainer() and getHookRunner()
methods in the base class, and classes that extend that base class
are not expected to know or care where the base class gets its
HookContainer from.
* ProtectedHookAccessorTrait provides protected getHookContainer() and
getHookRunner() methods, getting them from the global service
container. The point of this is to ease migration to DI by ensuring
that call sites ask their local friendly base class rather than
getting a HookRunner from the service container directly.
* Private $this->hookRunner. In some smaller classes where accessor
methods did not seem warranted, there is a private HookRunner property
which is accessed directly. Very rarely (two cases), there is a
protected property, for consistency with code that conventionally
assumes protected=private, but in cases where the class might actually
be overridden, a protected accessor is preferred over a protected
property.
* The last resort: Hooks::runner(). Mostly for static, file-scope and
global code. In a few cases it was used for objects with broken
construction schemes, out of horror or laziness.
Constructors with new required arguments:
* AuthManager
* BadFileLookup
* BlockManager
* ClassicInterwikiLookup
* ContentHandlerFactory
* ContentSecurityPolicy
* DefaultOptionsManager
* DerivedPageDataUpdater
* FullSearchResultWidget
* HtmlCacheUpdater
* LanguageFactory
* LanguageNameUtils
* LinkRenderer
* LinkRendererFactory
* LocalisationCache
* MagicWordFactory
* MessageCache
* NamespaceInfo
* PageEditStash
* PageHandlerFactory
* PageUpdater
* ParserFactory
* PermissionManager
* RevisionStore
* RevisionStoreFactory
* SearchEngineConfig
* SearchEngineFactory
* SearchFormWidget
* SearchNearMatcher
* SessionBackend
* SpecialPageFactory
* UserNameUtils
* UserOptionsManager
* WatchedItemQueryService
* WatchedItemStore
Constructors with new optional arguments:
* DefaultPreferencesFactory
* Language
* LinkHolderArray
* MovePage
* Parser
* ParserCache
* PasswordReset
* Router
setHookContainer() now required after construction:
* AuthenticationProvider
* ResourceLoaderModule
* SearchEngine
Change-Id: Id442b0dbe43aba84bd5cf801d86dedc768b082c7
This behavior has been deprecated and with a tracking category since
1.28. Time to remove the temporary parameter added to
Sanitizer::removeHTMLtags() and (finally) tweak the behavior to match
HTML5.
Bug: T134423
Change-Id: I5c725175d05854139c95a2b3d8d35ff63cb6707b
Disabling tidy has been deprecated since 1.33. This cleans up the code
paths which still used untidy output.
Bug: T198214
Change-Id: I821ef3b8f59b272d983583d407b2f0794fe1e791
Important for keyboard focusability of elements in order to ensure for
example users with motoric impairments to reach those elements.
This patch does not allow setting tabindex="-1" or tabindex > 0.
tabindex > 1 seems like a terrible idea to allow users to do.
I don't see any valid reason for tabindex="-1" in wikitext, so
lets not allow that for now either.
Bug: T247910
Change-Id: I5065b2deeb14bdb3682dd176b87f254ac6f2cf88
HTML5 says id attributes should not have whitespace, where
whitespace is defined as LF, CR, FF, TAB or SPACE (oddly enough
VT does not count). Firefox in my testing actually was fine with
these except CR. Nonetheless we should follow the spec, so this converts
these whitespace characters to _. I don't think this will
cause any back-compat issues, since its very hard to make these
characters in wikitext (other than space which was already
being converted) and basically requires either Lua or html entities
to make these (with FF seeming to be impossible).
Bug: T238385
Depends-On: Ie6fa40798f06a358f6082110b4d8cc0028c80321
Change-Id: Ie2b7c9429691e2c491c3359d5b400d8f078aa789
Currently if you combine a valid percent encoding and a non
escaped character that is reserved in urls in a headline, the toc
link does not work. E.g. ==`%41== needs #`%2541 but we currently
generate #`%41 which matches ==`A== instead.
Tested in firefox and chrome
Bug: T238385
Change-Id: Ice2bbf79bed612d488ed6feb7510035e9dfb33af
* Deprecate WebRequest::checkUrlExtension() and have it always return
true. This reverts the security fixes made for T30235.
* Remove IEUrlExtension. This is a helper for checkUrlExtension() which
is not used in any extensions.
* Remove CSS sanitization code which is specific to IE6. This reverts
the changes made to fix T57332, and related followups. I confirmed
that the relevant test cases do not result in XSS on IE8.
* Remove related tests.
Bug: T232563
Change-Id: I7318ea4a63210252ebc64968691d4f62d79a63e9
phan-taint-check (aka SecurityCheckPlugin) doesn't recognize
Sanitizer::stripAllTags' output as tainted in certain situations.
Adding a @return-taint of tainted to ensure that it does, which
may result in the reporting of more issues.
Bug: T230234
Change-Id: I357c168417a26882c7c460df20f36ec2be401096
These methods should be made private in the next release, but
hard-deprecate them for 1.34.
Tweak the return value of the attribute whitelist to be an
associative rather than a sequential array, which makes the
lookup of allowed attributes more efficient and avoids an
array_flip for every html element sanitized.
Bug: T221677
Change-Id: I17d734937accec6c2679dbe17328cf9554bd556a