Commit graph

70 commits

Author SHA1 Message Date
jenkins-bot
be2ff28b48 Merge "Reorg: Move MagicWord related files to under parser/" 2022-12-11 18:15:48 +00:00
Amir Sarabadani
a1b4699fea Reorg: Move MagicWord related files to under parser/
This is approved as part of T166010 RFC.

Bug: T321882
Change-Id: Ia4498c0a20e38a6a288dc14065ea8242c84fbc49
2022-12-09 13:48:35 +01:00
thiemowmde
3d65c37ba4 Fix remaining link numeric ids in LinkHolderArrayTest
These specific placeholders are meant to stay untouched in the test.
Because of this it technically doesn't matter if the namespace numbers
and link ids (from Parser::nextLinkID) are numeric. Still I find it
less confusing when the test setup reflects how the code actually
behaves. An IWLINK is identified by a single numeric id, and a LINK by
a namespace:numeric id pair.

This is split from Ie994059 to make it easier to review.

Change-Id: I1141de06eb3235d199a3dc02ef0a24d0d4fe1416
2022-12-09 12:04:40 +01:00
thiemowmde
c05e0183ad Fix bogus test setup in LinkHolderArrayTest
These array keys are all meant to come from Parser::nextLinkID(). That
returns a positive integer. Let's please make the tests reflect what
the code actually does to reduce confusion.

I tried my best to come up with a mapping that still (possibly even
better) makes it obvious what's happening.

Change-Id: Ib5d5f6f48ddc8c066961ea0f36a922ae9f0c6f6c
2022-12-09 12:01:18 +01:00
thiemowmde
f846177e33 Fix bogus non-numeric namespaces in LinkHolderArrayTest
The first level in LinkHolderArray::$internals is indexed by numeric
namespace id. This must be a number. Let's please make the test
reflect what the code actually does.

Note the same mistake was already fixed before in I7e71ffc.

Change-Id: I14f717510a71adf8ea3f75e61ba844d93838f945
2022-12-09 11:46:26 +01:00
Amir Sarabadani
09b18a8f4c Reorg: Move Title-related classes to title/
These three classes:
 - TitleArray
 - TitleArrayFromResult
 - TitleFactory

We need to move these and the rest of files under title/ to Title/ (and
namespace them) but the patch will become way too big given that Title class is
also one of them.

Bug: T321882
Change-Id: Iac1688172ee457348a08a470c86e047571feb8e0
2022-11-26 09:30:32 +00:00
C. Scott Ananian
6153062d48 Protect against long match length in CHAR_REFS_REGEX
Some malformed pages contain "character references" that were so long
that they caused PHP's `hexdec` to return a `float` instead of an
`int`.  This caused Parsoid to crash on a type hint on the argument to
Sanitizer::validateCodepoint().  MediaWiki core has the same issue,
but doesn't have the type hint (yet), so soft-fails instead of
crashes.  Add sanity checks around each call to `hexdec` to protect
against arbitrarily-long entity strings (while allowing arbitrary
zero-padding), and add a note to `intval` to explain why it is not
similarly affected.  New test cases added to SanitizerUnitTest as
well.

Corresponding patch on the Parsoid side:
Ic33196961bb2b86290148fbc3ce33bcd8b28ab56
(And see T247804 re: eventually removing this duplicate code.)

Bug: T322892
Change-Id: I5085c4edbb86e282b92536d05b01ed5f9d5c615e
2022-11-17 16:47:39 -05:00
Aaron Schulz
a58e677f91 parsoid: inject UrlUtils to avoid phpunit failures in SiteConfigTest
Previously, prior changes to the global UrlUtils instance could interfere
with testInterWikiMap() and trigger a BadMethodCallException exception in
UrlUtils::expand().

Bug: T297078
Change-Id: If75a91e33c5881244de388d203d0fc4879000f46
2022-08-26 23:39:57 +00:00
Michał Turek
1d87b8e7b3 Sanitizer: Don't consider inline var CSS insecure
Since (T208881) "CSS using var() to create exponential sized calc() on wiki page will crash visitor's browser" was fixed by disabling var in inline CSS, the issue with browser crashes appears to have been fixed in Firefox, Chrome, modern Edge, and Opera.
This change reverts T208881.

Bug: T288201
Change-Id: I387a0e9fdd02faa69616890c613462c83b91b789
2022-08-24 07:17:28 +00:00
Umherirrender
167fb2a979 unit tests: Use MainConfigNames constant to refer configs
When creating ServiceOptions objects or fake HashConfigs use the
constant to refer the config name

Change-Id: I59a29f25b76e896c07e82156c6cc4494f98e64cc
2022-08-17 22:33:58 +02:00
jenkins-bot
b07027f575 Merge "Replace trivial usa of mock builder with createMock() shortcut" 2022-07-19 11:41:12 +00:00
Thiemo Kreuz
3703b30618 Tests: Use createNoOpMock() shortcut in a few more places
… instead of doing the same manually with anythingBut() and such.

Change-Id: Idb66040d1560a82df9a5bfa2a6c7e20a0649e49c
2022-07-18 21:25:31 +00:00
Thiemo Kreuz
61ae7504df Replace trivial usa of mock builder with createMock() shortcut
createMock() does the same, but is much easier to read.

A small difference is that some of the replacements made in this
patch didn't use disableOriginalConstructor() before. In case this
was relevant we should see the respective test fail. If not we can
save some CPU cycles and skip these constructors.

Change-Id: Ib98fb06e0fe753b7a53cb087a47e1159515a8ad5
2022-07-15 16:43:48 +00:00
Isabelle Hurbain-Palatin
61c14054f5 Ensure core compatibility with Parsoid external link attributes support
* Export nofollow and target settings in siteinfo API so that Parsoid's
  developer mode of ApiSiteConfig works.
* Implement SiteConfig::getNoFollowConfig and
  SiteConfig::getExternalLinkTarget, which are defined as abstract
  in the parent class in Parsoid.

Bug: T186241
Change-Id: I6a1f12335be19509d4c5a17e2cae96ecdb677103
2022-06-24 19:12:43 -04:00
jenkins-bot
b494330aa7 Merge "ParserCache: always use JSON" 2022-06-07 14:12:29 +00:00
daniel
697f28df32 ParserCache: always use JSON
When JSON support was introduced into ParserCache in 1.36, it was
controlled by a feature flag, $wgParserCacheUseJson. The feature flag
was "born deprecated" in 1.36. It can now be removed.

This means that ParserCache will always store entries as JSON.
Support for reading old non-JSON entries remains intact.
This is needed when updating wikis from a version older than 1.36
to the current version.

Change-Id: Id04e42bfb458d98414bac50e0d6c505e8878e5c0
2022-06-07 15:19:45 +02:00
Reedy
41c42d5435 Tests: Cleanup some unnecessary nested function calls
Replace ->will( ->return with ->willReturn(

Change-Id: Ia2dfafa03cac8169d86d6fa5a30b73bfad1fe9fa
2022-06-06 01:02:34 +01:00
Derick Alangi
13f6ec9e1b Rest: Migrate parsoid stashing logic from RESTbase
Add stash option to /page/html & /revision/html endpoints.
When this option is set, the PageBundle returned by Parsoid is
stashed and an etag is returned that can later be used to
make use of the stashed PageBundle.

The stash is for now backed by the BagOStuff returned by
ObjectCache::getLocalClusterInstance().

This patch adds additional data to the ParserOutput stored in ParserCache.
Old entries lacking that data will be ignored.

Bug: T267990
Co-Authored-by: Nikki <nnikkhoui@wikimedia.org>
Change-Id: Id35f1423a69e3ff63e4f9883b3f7e3f9521d81d5
2022-05-23 17:28:29 +01:00
Bartosz Dziewoński
f7705d976a ParserObserver: Only report duplicate parse if the content is the same
Bug: T303596
Change-Id: Ib3b00a8cfabeb12723ac6a441495d72fd0c0ca92
2022-05-14 02:13:25 +02:00
Aryeh Gregor
b85391120b Use UrlUtils in Parser
Change-Id: I65f851ea29efe482ee225565a200d623fa85bc20
2022-04-28 17:14:51 +03:00
Tim Starling
d6a3b6cfa8 TempUser EditPage and permissions
* Allow EditPage to create a user on page save. This has to be enabled
  in config and then activated by the UI/API caller.
* Add an autocreate source for temporary users.
* Allow editing by anonymous users via automatic account creation when
  $wgGroupPermisions['*']['edit'] = false. On an edit GET request, use
  an unsaved placeholder user to stand in for post-create permissions.
* On preview or aborted save, the username to be created is stashed in a
  session and restored on subsequent requests.
* On a (likely) successful page save, create the account.
* Put regular non-temporary users in a "named" group so that they can be
  given additional permissions.
* Use a different "~~~" signature for temporary users
* Show account creation warnings on edit and preview.

Change-Id: I67b23abf73cc371280bfb2b6c43b3ce0e077bfe5
2022-04-26 14:10:53 +10:00
Tim Starling
13c1839735 Fix SignatureValidatorFactory circular dependency
Parser is using the service container to get a SignatureValidator
because, as noted in Gerrit comments on the relevant commit, there is a
circular dependency Parser -> SignatureValidatorFactory -> Parser.

So, have SignatureValidatorFactory::__construct() take a closure which
returns a Parser, instead of an actual Parser or ParserFactory.

Change-Id: I7bf4660f84ec8c8fb1d5b3b8581fe5d82bc3156e
2022-04-13 12:38:00 +10:00
Isabelle Hurbain-Palatin
0548babd46 Revert "Add temporary ParsoidSiteConfigInit hook"
This reverts commit 5a9f0300e4573ce2a5f1b7869c9bdff2dc5d4708 on Parsoid
- that code has been moved to Core in the meantime.

Bug: T303029
Depends-On: 9b9ab2cdd6fd2dbb00e38f652b2856ca5860bffb
Change-Id: I34243250542785804ddf6c8210d22715ba9e91fc
2022-04-07 20:16:57 +00:00
C. Scott Ananian
2d66ee70a2 Copy over Parsoid's Config and ServiceWiring classes
* This is the first step of migrating Parsoid integration code into
  core and transitioning Parsoid from an extension to a pure library.

* Parsoid already has conditional code to skip loading Parsoid's
  copy of its classes, but it relies on the existence of ParsoidServices.
  Technically ParsoidServices isn't needed once Parsoid is migrated to
  core -- users can just use MediaWikiServices instead -- but we need
  to temporarily add ParsoidServices as a marker class during the
  transition.

This version of Parsoid's ServiceWiring comes from Parsoid commit
898c813fd832b3f2d7b5a37f60bd65e8368ce18f.

Bug: T302118
Change-Id: I0b388d93143a782c2c3b72e46407572e5c586e4a
2022-03-28 12:36:38 -04:00
C. Scott Ananian
9f14fbd002 Add Sanitizer::removeSomeTags() which uses Remex to tokenize
The existing Sanitizer::removeHTMLtags() method, in addition to having
dodgy capitalization, uses regular expressions to parse the HTML.
That produces corner cases like T298401 and T67747 and is not guaranteed
to yield balanced or well-formed HTML.

Instead, introduce and use a new Sanitizer::removeSomeTags() method
which is guaranteed to always return balanced and well-formed HTML.

Note that Sanitizer::removeHTMLtags()/::removeSomeTags() take a callback
argument which (as far as I can tell) is never used outside core. Mark
that argument as @internal, and clean up the version used by
::removeSomeTags().

Use the new ::removeSomeTags() method in the two places where
DISPLAYTITLE is handled (following up on T67747).  The use by the
legacy parser is more difficult to replace (and would have a
performace cost), so leave the old ::removeHTMLtags() method in place
for that call site for now: when the legacy parser is replaced by
Parsoid the need for the old ::removeHTMLtags() will go away.  In a
follow-up patch we'll rename ::removeHTMLtags() and mark it @internal
so that we can deprecate ::removeHTMLtags() for external use.

Some benchmarking code added.  On my machine, with PHP 7.4, the new
method tidies short 30-character title strings at a rate of about
6764/s while the tidy-based method being replaced here managed 6384/s.
Sanitizer::removeHTMLtags blazes through short strings 20x faster
(120,915/s); some of this difference is due to the set up cost of
creating the tag whitelist and the Remex pipeline, so further
optimizations could doubtless be done if Sanitizer::removeSomeTags()
is more widely used.

Bug: T299722
Bug: T67747
Change-Id: Ic864c01471c292f11799c4fbdac4d7d30b8bc50f
2022-03-04 14:06:02 -05:00
daniel
026133bb05 remove access to config globals from includes/parser
Loops ServiceOptions through to CoreParserFunctions and CoreTagHooks to
avoid access to the main config from static methods.

Bug: T294739
Change-Id: Ia6c97f2d0952964c2ad6189f8053ad127589b37c
2022-02-01 07:48:57 -08:00
jenkins-bot
d974510c56 Merge "Make Sanitizer::stripAllTags() strip css and js tag contents" 2021-12-23 12:44:22 +00:00
Derk-Jan Hartman
8e06927190 Make Sanitizer::stripAllTags() strip css and js tag contents
We use Sanitizer::stripAllTags primarily to remove formatting from
html so that we can use it in places like notifications, emails,
search result blurbs etc etc.

It is very unlikely we want the raw contents of css and/or js tags
anywhere in those places, so lets surpress that content, to make it
more readable as template styles are showing up in more and more
places.

Bug: T228856
Change-Id: I7930361068ddcf3a6c2fdebd0177d142f025b64f
2021-12-22 23:26:17 +00:00
Umherirrender
535aa27593 Remove Parser dependency on config LanguageCode/DisableLangConversion
Use the value from corresponding services,
for consistency if services are injected from outside of service wiring.

Change-Id: Ib0f6af20df8dbc0deae71023e5493524d43ce211
2021-12-18 16:18:19 +00:00
Umherirrender
86f7c83c03 Fix line indent in ParserFactoryTest
Change-Id: I9338f6804b741f03b3119b57e86901fdda5317d4
2021-12-18 15:31:06 +01:00
C. Scott Ananian
9632dfb041 Move ::addTrackingCategory() implementation to TrackingCategories
This moves the implementation of ParserOutput::addTrackingCategory()
to the TrackingCategories class as a non-static method. This makes
invocation from ParserOutput awkward, but when invoking as
Parser::addTrackingCategory() all the necessary services are
available.  As a result, we've also soft-deprecated
ParserOutput::addTrackingCategory(); new users should use the
TrackingCategories::addTrackingCategory() method, or else
Parser::addTrackingCategory() if the parser object is available.

The Parser class is already kind of bloated as it is (alas), but there
aren't too many callsites which invoke
ParserOutput::addTrackingCategory() and don't have the corresponding
Parser object handy; see:

https://codesearch.wmcloud.org/search/?q=%5BOo%5Dutput%28%5C%28%5C%29%29%3F-%3EaddTrackingCategory%5C%28&i=nope&files=&excludeFiles=&repos=

Change-Id: I697ce188a912e445a6a748121575548e79aabac6
2021-10-15 14:17:58 -04:00
Cindy Cicalese
eed48e402b Detect and monitor against multiple Parser invocation during edit requests
Bug: T288707
Change-Id: I0cca8f9bcf1d6e964b8b06c0c4490e83f4fb1de5
2021-09-23 16:12:40 -05:00
Fomafix
3a322ef9b0 Use PHP \u{xxxx} syntax
Let PHP do the UTF-8 encoding of Unicode characters in PHP strings.

Also use faster str_replace instead of preg_replace.

Change-Id: I4e99de694a607e2b5df52c6efcd3d863bb42f76e
2021-08-27 20:53:19 +00:00
Umherirrender
d1950d924a parser: Replace deprecated MWHttpRequest::factory
Change-Id: Id5fe298209cfbc09037799a2cdc117c9b7119172
2021-08-04 13:22:50 +00:00
daniel
4880a82555 Parser: remove Title from method signatures
Bug: T281068
Change-Id: I3280e38dd82d71845c343eeb911e71dd33bb380b
2021-04-29 18:11:46 +02:00
DannyS712
1d5bac66c4 Reduce mocking LoggerInterface
Use TestLogger when we want to ensure specific
logged messages, or NullLogger when we don't care

Change-Id: Ifebc770933d4f5313d5b8b43a52437dbe1e24432
2021-04-23 22:35:43 +00:00
Daimona Eaytoy
535d7abf59 phpunit: Mass-replace setMethods with onlyMethods and adjust
Ended up using
  grep -Prl '\->setMethods\(' . | xargs sed -r -i 's/setMethods\(/onlyMethods\(/g'

special-casing setMethods( null ) -> onlyMethods( [] )

and then manual fix of failing test (from PS2 onwards).

Bug: T278010
Change-Id: I012dca7ae774bb430c1c44d50991ba0b633353f1
2021-04-16 20:15:00 +02:00
Petr Pchelko
f642215aed Convert ParserCache to PageRecord
ParserOptions not updated cause they depend on Title::getLanguage
implementation.

Tests converted to not require a DB anymore. Can't be proper unit
tests yet due to globals in ParserOptions and fake time hacks,
but exec time does go down from 70 seconds to 9 seconds.

Page content model is still emitted in the metrics since
it was considered useful. Should be removed when we get
something like a page type concept.

Change-Id: Ib16fd0b5b87ffc3cb4d21f4aa43d1203cb7206d2
2021-04-02 21:14:54 -06:00
Petr Pchelko
0f0a11f6dc Make Parser use UserIdentity instead of User
Change-Id: Idf8578e88af1fd4824f49417a200b16befdbca51
2021-03-17 13:51:52 -06:00
C. Scott Ananian
4497f99796 Parser: initialize preprocessor in constructor
Initializing the preprocessor in the constructor allows better
dependency injection, and removes code complexity caused by
lazy initialization.  Any use of the parser is going to end
up creating the preprocessor in any case, so deferring the
initialization doesn't save any performance.  (Best performance
is given by not creating the Parser in the first place if it
is not needed, which is what DI allows.)

Old code tried to unbreak cyclic dependencies by setting the
preprocessor to null.  This is somewhat of a lost cause,
since there are a number of other cyclic dependencies
involving the parser, including StripState, LinkHolders,
etc.  The code complexity is not worth it, given how
ineffective it is in any case.

This is part of T275160 in so far as it allows
Parser::getPreprocessor() to be a simple getter, and thus
(once this patch is merged) we can safely replace any
direct access to Parser::$mPreprocessor with a call to
Parser::getPreprocessor().

Bug: T275160
Change-Id: I38c6fe7d5a97badffdbf34d8b9d725756ed86514
2021-03-16 22:37:40 +00:00
C. Scott Ananian
1fd4a7af4e Introduce Tidy service
Refactor the old MWTidy singleton as a DI service.

Change-Id: I95605ea5fd22f53a7f90fe07a6a73fa6c959597a
2021-03-15 17:22:36 -04:00
C. Scott Ananian
5d317c25be Parser: Move Sanitizer::normalizeCharReferences into RemexCompatFormatter
Choosing a particular encoding of HTML entities is logically a task
of the Remex formatter (which serializes HTML).  Move it out of the
Parser so that it is part of the serialization specification.

This is a follow up to Ic8965e81882d7cf024bdced437f684064a30ac86.

Change-Id: If45907baf24d60987b39cd1f7709c5f7caf19f37
2021-03-15 17:20:14 -04:00
DannyS712
6a93b0ca93 More misc test cleanup
* parent::setUp() should be first, and ::tearDown()
  should be last
* Move tests that directly extend PHPUnit\Framework\TestCase
  to /unit

Change-Id: I1172855c58f4f52a8f624e6d596ec43beb8c93ff
2020-12-24 00:52:06 +00:00
daniel
00a3439dce Introduce RevisionOutputCache
Bug: T267981
Change-Id: Ib1dc641ed10d786918362b25bd655780d5844ba1
2020-12-14 16:50:28 +00:00
Petr Pchelko
66cc685b45 Make ParserCache use CachedBagOStuff
Bug: T269593
Change-Id: I21e6e39eccad22b781252b142c1e5b079c1ee0b4
2020-12-07 10:28:30 -06:00
Petr Pchelko
f24125684c Clean up ParserCache construction and inject logger
Bug: T263583
Depends-On: Iceaa0e872c53aa79b7012711813895221fa62fa6
Change-Id: I6f131a078e9d6eb5da3533b0ac3730e24bd3f56f
2020-09-28 13:17:30 -07:00
Petr Pchelko
fec48eb5a4 Create ParserCacheFactory.
* Makes ParserCache take the root of the key
  as a constructor argument
* Introduces a ParserCacheFactory

Next steps:
- convert FlaggedRevs to using this.
- cleanup

This assumes that we wouldn't want to differentiate
the parser cache settings per use-case, as it is now
for default vs flaggedrevs caches. There are only two settings:
$wgParserCacheType - name of the BagOStuff to use
$wgParserCacheExpireTime - the expiration time.
I think if we wanted to have different settings for different
caches, we could add that as a next step.

Bug: T263583
Change-Id: I188772da541a95c95a5ecece7c7dd748395506c2
2020-09-25 18:17:58 -07:00
James D. Forrester
a52c933998 Drop Sanitizer::escapeId(), deprecated in MediaWiki 1.30
Hard deprecation was in b79c1e2, which shipped in MediaWiki 1.35.

Change-Id: I7186462c95d346f362ba0cf84b136c083d66a7d3
2020-07-29 17:08:45 -04:00
DannyS712
94169ee873 Whitespace cleanup: Use tabs for indentation, avoid double spaces
Change-Id: I346073b59d283029bd6666356c62c81e687ea5e6
2020-06-27 07:53:07 +00:00
Tim Starling
f7f6f0d700 Update LinkHolderArray tests for new HookContainer parameter
Change-Id: I63fc731ca1dbaef6f215279ee0b1788e735783df
2020-06-23 09:00:32 +10:00