Commit graph

600 commits

Author SHA1 Message Date
jenkins-bot
d3ecbc93a3 Merge "parser: Optimize regex patterns used in LinkHolderArray" 2023-01-07 16:21:50 +00:00
thiemowmde
69c5757243 parser: Optimize regex patterns used in LinkHolderArray
Two micro-optimizations are done in this patch:

1. We know exactly how these placeholders are built in the makeHolder()
method. In »<!--IWLINK'" 1-->« it's guaranteed to be a single number
and in »<!--LINK'" 1:2-->« it's two numbers.

The most extreme synthetic micro benchmark I did cuts the runtime of
these regular expressions down to about 25%. It won't make much of a
difference in real-world scenarios but is still worth it, I believe.

It also makes the code more specific and less confusing (see below).

2. We don't need to use the full string »<!--LINK'" 1:2-->« as array
key when the only thing that matters is the part »1:2«. Note the same
is done just a few lines below in the replaceInterwiki() method.

This code does have outstanding test coverage via all the parser tests,
I believe. Any change here that doesn't make a test fail should be safe.

Note the unit tests have been written many years later via I2c12cc7,
using "dummy" strings and such instead of the expected numeric
namespace and link ids. Most of this is already fixed via previous
patches. The last mistake addressed in this patch is that
getPrefixedDBkey() is supposed to be a title. It can't contain one of
these placeholders.

Follow-Up: I2c12cc76a9bf01eb527db3ea038e4adc59446cac
Change-Id: Ie994059092df8861ddb97c098acd082698d45c53
2023-01-07 13:25:33 +00:00
Amir Sarabadani
523ab7cff8 Reorg: Move RawMessage to under language/
To follow Message. This is approved as part of RFC T166010.

Also namespace it but doing it properly with PSR-4 would require
namespacing every class under language/ and that will take some time.

Bug: T321882
Change-Id: I195cf4c67bd51410556c2dd1e33cc9c1033d5d18
2022-12-16 11:30:19 +01:00
Umherirrender
fd516a98e1 Fix whitespaces after comma
Change-Id: Ide6de0a53661e6f650099d7b1f274a02699441df
2022-12-15 01:24:14 +01:00
jenkins-bot
be2ff28b48 Merge "Reorg: Move MagicWord related files to under parser/" 2022-12-11 18:15:48 +00:00
Amir Sarabadani
a1b4699fea Reorg: Move MagicWord related files to under parser/
This is approved as part of T166010 RFC.

Bug: T321882
Change-Id: Ia4498c0a20e38a6a288dc14065ea8242c84fbc49
2022-12-09 13:48:35 +01:00
thiemowmde
800fd1d4c4 Fix bogus nextLinkID in LinkHolderArrayIntegrationTest
Parser::nextLinkID cannot return a string. It returns a positive
integer number.

Note a very similar mistake was already fixed before via I7e71ffc.

Change-Id: Ifce71d0f4db31787bf0eb84e621cfdeb07c674ef
2022-12-09 11:45:09 +01:00
Reedy
0cb2c3c106 Fix casing of class and function name usages
Bug: T253628
Change-Id: I5c64f436d3cf757390b751ce3e34bfc7872bc176
2022-12-04 19:09:30 +00:00
Subramanya Sastry
bcb7009c41 Use real section metadata in tests
* Most of the files were generated from the validate* script.
* Post-processing of these generated files to fix problems:
  - Some of the files were binary-edited via "vi -b" to fix some
    issues with bad property names used in the prior step.
    1.36, 1.38, 1.39 files were all fixed up this way.
  - In addition, the 1.36 file had bad data (not sure if the wrong
    php version was used) but I fixed this by splicing in data
    from the 1.38 file to revert incorrect changes to "Categories"
    and "IndexPolicy" properties.
  - The 1.35 data file was binary edited by splicing data from the
    now 1.36 version.

Change-Id: I4e22b94ce30c2ad9b1f544c15e1c3cd0dd0bce6b
2022-11-23 12:45:27 -05:00
Subramanya Sastry
623625e8f2 Followup to fb747bc0: Fix bad property names
Change-Id: I362b0cf8feca13a91fd91961d400579f2e4ea97e
2022-11-18 16:12:06 -06:00
Subramanya Sastry
fb747bc038 Add section metadata parsercache serialization tests for MW 1.40
* Generate data files for 1.40 only since the new formats only
  showed up in 1.40 and won't be present in the parser cache
  for older MW versions.

Change-Id: I6f297e3091ec2faab7c2203c138800551b01e32a
2022-11-17 15:48:15 -06:00
daniel
118d4980b2 Track the reason for rendering.
Allow the causeAction that triggers page rendering to be looped through
to ParserCache, so we can count what causes writes to the cache.

Change-Id: I6ad8e105a3ce457e3ab4f85cd154f47a32085e0d
2022-11-09 09:38:57 +00:00
daniel
8c1c1ae35a Enable pig-latin variant for testing
Having pig-latin enabled per default in dev environments is convenient
for manual testing. More importantly, it will allow us to write
end-to-end tests for variant conversion.

Depends-On: I9dc2f743ac487b0f7cfb667150c0f6950d5e7fce
Depends-On: I85b66c85be3959d48a048733af17197bc4cf70af
Change-Id: Ia80ad33cbf5e311fa8b84bd765a8df8d156f4c38
2022-11-08 17:45:51 +05:30
Tim Starling
0077c5da15 Use short array destructuring instead of list()
Introduced in PHP 7.1. Because it's shorter and looks nice.

I used regex replacement.

Change-Id: I0555e199d126cd44501f859cb4589f8bd49694da
2022-10-21 15:33:37 +11:00
C. Scott Ananian
d96207ab86 Auto-discover core parser test files
Make parser test discover in core work the same way as it does in
extensions: any file ending with *.txt under tests/parser is run
as a parser test file.

This search is recursive, which is motivation to also move some
unrelated files under tests/parser/preprocess over to
tests/phpunit/data/preprocess where they belong; they are used
by tests/phpunit/includes/parser/PreprocessorTest.php and are
unrelated to the parser test infrastructure.

Change-Id: I8c84b4b853e1309929dceb700aab1e79a598d8ab
2022-10-13 10:41:15 -04:00
Jon Robson
d1662dca59 Parser: Use linkAnchor in section definition as well as anchor
The anchor property comes from Sanitizer::escapeIdForAttribute() and
should be used if you want to (eg) look up an element by ID using
document.getElementById(). The linkAnchor property comes from
Sanitizer::escapeIdForLink() and contains additional escaping
appropriate for use in a URL fragment, and should be used (eg) if you
are creating the href attribute of an <a> tag.

Bug: T315222
Change-Id: Icecf9640a62117c2729dca04af343fb1ddaaf8f8
2022-09-14 12:54:36 -04:00
jenkins-bot
61cbd18ff3 Merge "parser: Use a <meta> tag for the internal TOC_PLACEHOLDER" 2022-09-09 21:12:34 +00:00
Subramanya Sastry
c8a944a94b Add support to enable Scribunto & Parsoid to handle nowikis properly
* Lua modules have been written to inspect nowiki strip state markers
  and extract nowiki content to further process them. Callers might have
  used nowikis in arguments for any number of reasons including needing
  to have the argument be treated as raw text intead of wikitext.

  While we might add first-class typing features to wikitext, templates,
  extensions, and the like in the future which would let Parsoid process
  template arguments based on type info (rather than as wikitext always),
  we need a solution now to enable modules to work properly with Parsoid.

* The core issue is the decoupled model used by Parsoid where
  transclusions are preprocessed before further processing. Since
  nowikis cannot be processed and stripped during preprocessing,
  Lua modules don't have access to nowiki strip markers in this model.

* In this patch, we change extension tag processsing for nowikis.

  When generating HTML, nowikis are replaced with a 'nowiki' strip
  marker with the nowiki's "innerXML" (only tag contents).

  In this patch, during preprocessing, instead of adding a 'general'
  strip marker with the "outerXML" (tag contents and the tag wrapper),
  we add a 'nowiki' strip marker with its "outerXML".

* Since Parsoid (and any clients using the preprocessed output) will
  unstrip all strip markers, the shift from a general to nowiki
  strip marker won't make a difference.

* To support Scribunto and Lua modules unstrip usage, this patch adds
  new functionality to StripState to replace the (preprocessing-)nowiki
  strip markers with whatever its users want. So, Scribunto could
  pass in a callback that replaces these with the "innerXML" by
  stripping out the tag wrapper.

* Hat tip to Tim Starling for recommending this strategy.

* Updated strip state tests.

Bug: T272507
Bug: T299103
Depends-On: Id6ea611549e98893f53094116a3851e9c42b8dc8
Change-Id: Ied0295feab06027a8df885b3215435e596f0353b
2022-09-01 21:04:42 +00:00
Bartosz Dziewoński
f7158c396d Add markup to page titles to distinguish the namespace and the main text
Pages outside of the main namespace now have the following markup in
their <h1> page titles, using 'Talk:Hello' as an example:

<h1>
  <span class="mw-page-title-namespace">Talk</span>
  <span class="mw-page-title-separator">:</span>
  <span class="mw-page-title-main">Hello</span>
</h1>
(line breaks and spaces added for readability)

Pages in the main namespace only have the last part, e.g. for 'Hello':

<h1>
  <span class="mw-page-title-main">Hello</span>
</h1>

The change is motivated by a desire to style the titles differently on
talk pages in the DiscussionTools extension (T313636), but it could
also be used for other things:
* Language-specific tweaks (e.g. adding typographically-correct spaces
  around the colon separator: T249149, or replacing it with a
  different character: T36295)
* Site-specific tweaks (e.g. de-emphasize or emphasize specific
  namespaces like 'Draft': T62973 / T236215)

The markup is also added to automatically language-converted titles.

It is not added when the title is overridden using the wikitext
`{{DISPLAYTITLE:…}}` or `-{T|…}-` forms. I think this is a small
limitation, as those forms mostly used in the main namespace, where
the extra markup isn't very helpful anyway. This may be improved in
the future. As a workaround, users could also just add the same HTML
markup to their wikitext (as those forms accept it).

It is not also added when the title is overridden by an extension
like Translate. Maybe we'll have a better API before anyone wants
to do that. If not, one could un-mark Parser::formatPageTitle()
as @internal, and use that method to add the markup themselves.

Bug: T306440
Change-Id: I62b17ef22de3606d736e6c261e542a34b58b5a05
2022-08-16 23:36:21 +00:00
C. Scott Ananian
0b10563895 parser: Use a <meta> tag for the internal TOC_PLACEHOLDER
Split out from the I44045b3b9e78e change.

This is consistent with what Parsoid will use for the TOC marker.

Bug: T287767
Bug: T270199
Bug: T311502
Depends-On: I1f607cf1ef1b61fb4d2e1880de756fb94d5a6b22
Change-Id: Ie63eed07b9bca1bfa07d4c256aba3728cedd8f93
2022-08-16 06:05:17 +00:00
C. Scott Ananian
fa8646ca7b parser: Prepare to use a <meta> tag for the internal TOC_PLACEHOLDER
Split out from the I44045b3b9e78e and Ie63eed07b9bca changes.  We
first add code to handle the new tag as well as the old tag in
ParserCache contents. This will allow us to safely rollback if needed
when deploying the follow-on patch which actually changes the tag
used.

Bug: T287767
Bug: T270199
Bug: T311502
Change-Id: Ib3e5e010b9f5ca2c4ea7c4fe28080170b6a88812
2022-08-15 18:54:52 -04:00
Derick Alangi
5e8cd2c838
Migrate from setMwGlobals() to overrideConfigValue(s)
Change-Id: I3f167d0e7d59a5aa091c3095a7d96c889d6e7e78
2022-08-02 10:14:10 +01:00
Brian Wolff
f79ea41072 parser: Mock WikiPage::getContentModel in ParserCacheTest to fix php8.1
PHP 8.1 doesn't like this returning null.

Bug: T313663
Change-Id: I59eb21301aab946b6362fea956b398337af8d971
2022-07-25 20:51:51 +00:00
Thiemo Kreuz
61ae7504df Replace trivial usa of mock builder with createMock() shortcut
createMock() does the same, but is much easier to read.

A small difference is that some of the replacements made in this
patch didn't use disableOriginalConstructor() before. In case this
was relevant we should see the respective test fail. If not we can
save some CPU cycles and skip these constructors.

Change-Id: Ib98fb06e0fe753b7a53cb087a47e1159515a8ad5
2022-07-15 16:43:48 +00:00
Umherirrender
246bc931f6 tests: Set wgLang with MediaWikiIntegrationTestCase::setUserLang
Change-Id: Ic1247a6719032b3a0ea1f76514edc5ffd5a7854a
2022-07-13 00:59:46 +02:00
Umherirrender
047c184bfe tests: Use Title::makeTitle instead of Title::newFromText
Avoid parsing known titles in tests to improve performance

Change-Id: Ibfccfe696f0b8bfda0b99abae324e60bbecef7d8
2022-07-06 00:44:00 +02:00
Derick Alangi
d01e3ed739 Replace deprecated calls ParserOptions::newCanonical( 'canonical' )
This is a quick find & replace of calls to the deprecated method
ParserOptions::newCanonical() when the context is the string literal
'canonical'. This can be safely replaced by called newFromAnon().

Change-Id: If7bb68459b11e0c5f5de188f10fdae85ad1a78bf
2022-06-16 14:22:24 +01:00
jenkins-bot
b494330aa7 Merge "ParserCache: always use JSON" 2022-06-07 14:12:29 +00:00
daniel
697f28df32 ParserCache: always use JSON
When JSON support was introduced into ParserCache in 1.36, it was
controlled by a feature flag, $wgParserCacheUseJson. The feature flag
was "born deprecated" in 1.36. It can now be removed.

This means that ParserCache will always store entries as JSON.
Support for reading old non-JSON entries remains intact.
This is needed when updating wikis from a version older than 1.36
to the current version.

Change-Id: Id04e42bfb458d98414bac50e0d6c505e8878e5c0
2022-06-07 15:19:45 +02:00
Reedy
41c42d5435 Tests: Cleanup some unnecessary nested function calls
Replace ->will( ->return with ->willReturn(

Change-Id: Ia2dfafa03cac8169d86d6fa5a30b73bfad1fe9fa
2022-06-06 01:02:34 +01:00
Umherirrender
8557249ac6 tests: Update namespace for MediaWiki\SpecialPage\SpecialPageFactory
MediaWiki\Special\SpecialPageFactory is deprecated since 1.35

Change-Id: I558a59e781edef4a78b4e902961809ba07f4f695
2022-05-28 01:31:53 +02:00
Nikki Nikkhoui
b5fe60a7e1 Introduce PageBundleJsonTrait for serialization
New trait for PageBundle class to serialize & deserialize
PageBundle object into json before stashing and after unstashing.

Change-Id: I486fab5b3d01bcef2b535af579cd9672403b2102
2022-05-23 17:54:48 +01:00
Brian Wolff
bec8dada48 Clarify generate-html and make ParserOutput behave as expected
Previously:
* It was unclear that generate-html is an optional optimization
* Most of MediaWiki core was doing $parserOutput->setText('') if
html wasn't generated. However this is wrong and will cause
$parserOutput->hasText() to return true and also potentially cause
cache pollution if a content handler both does that and supports
parser cache (Like MassMessage; see T299896)
* The default value of mText in the constructor was '', and most
of the time MW used that default. This doesn't seem right. If
setText() is never called, the ParserOutput should not be considered
to have text
* It was impossible to set mText to null, as $parserOutput->setText(null)
was a no-op. Docs implied you were supposed to do this, so it was very
confusing.

This patch clarifies docs, changes the default value for ParserOutput::$mText
from '' to null, and makes $parserOutput->setText(null) do what you
expect it to. The last two are arguably breaking changes, although
the previous behaviours were unexpected, mostly undocumented and
based on a code search do not appear to be relied on.

It seems like the main reason this only broke MassMessage is most
content handlers either don't support generateHtml, or they don't
support parser cache.

Bug: T306591
Change-Id: I49cdf21411c6b02ac9a221a13393bebe17c7871e
Depends-On: I68ad491735b2df13951399312a4f9c37b63a08fa
2022-05-03 11:23:08 +02:00
Aryeh Gregor
b85391120b Use UrlUtils in Parser
Change-Id: I65f851ea29efe482ee225565a200d623fa85bc20
2022-04-28 17:14:51 +03:00
Tim Starling
d6a3b6cfa8 TempUser EditPage and permissions
* Allow EditPage to create a user on page save. This has to be enabled
  in config and then activated by the UI/API caller.
* Add an autocreate source for temporary users.
* Allow editing by anonymous users via automatic account creation when
  $wgGroupPermisions['*']['edit'] = false. On an edit GET request, use
  an unsaved placeholder user to stand in for post-create permissions.
* On preview or aborted save, the username to be created is stashed in a
  session and restored on subsequent requests.
* On a (likely) successful page save, create the account.
* Put regular non-temporary users in a "named" group so that they can be
  given additional permissions.
* Use a different "~~~" signature for temporary users
* Show account creation warnings on edit and preview.

Change-Id: I67b23abf73cc371280bfb2b6c43b3ce0e077bfe5
2022-04-26 14:10:53 +10:00
Umherirrender
2909d06a08 Use new namespace for revision related classes
All revision related classes are namespaced MediaWiki\Revision
instead of MediaWiki\Storage since 1.32. The old namespaced
class names are deprecated and only kept for backwards-compatibility.

Bug: T305784
Change-Id: I34e492d84d9fc4bc78481667202716d93b3c43cb
2022-04-14 23:03:43 +02:00
Tim Starling
13c1839735 Fix SignatureValidatorFactory circular dependency
Parser is using the service container to get a SignatureValidator
because, as noted in Gerrit comments on the relevant commit, there is a
circular dependency Parser -> SignatureValidatorFactory -> Parser.

So, have SignatureValidatorFactory::__construct() take a closure which
returns a Parser, instead of an actual Parser or ParserFactory.

Change-Id: I7bf4660f84ec8c8fb1d5b3b8581fe5d82bc3156e
2022-04-13 12:38:00 +10:00
jenkins-bot
0827d5ffea Merge "Fix notice from ParserCacheSerializationTestCases" 2022-04-10 15:22:58 +00:00
Alexander Vorwerk
62a70ec7c7 Use new namespace for revision related classes
All revision related classes are namespaced MediaWiki\Revision
instead of MediaWiki\Storage since 1.32. The old namespaced
class names are deprecated and only kept for backwards-compatibility.

Bug: T305784
Change-Id: Ia0030814ce2176d06e2898acffe533d31633fccb
2022-04-09 20:22:36 +02:00
Tim Starling
0d94c44743 Fix notice from ParserCacheSerializationTestCases
Change-Id: I6e65952367dd6de30916bfc574d1e4a5db84b998
2022-04-08 10:57:46 +10:00
jenkins-bot
1a91fcb41e Merge "Emit deprecation warnings for ParserOutput::addOutputHook()" 2022-04-07 21:27:33 +00:00
C. Scott Ananian
05eda60400 Emit deprecation warnings for ParserOutput::addOutputHook()
Once no one is calling ::addOutputHook() we can stub out ::getOutputHook()
to just return an empty array.

Code search:
 https://codesearch.wmcloud.org/deployed/?q=-%3E%28addOutputHook%7CgetOutputHooks%29%5C%28&i=nope&files=&excludeFiles=&repos=

Bug: T292321
Change-Id: I1081696c4cc2e67c3c38b8f6e53054e62ac71502
2022-04-07 02:48:57 +00:00
C. Scott Ananian
c1a326f44e Emit warnings when accessing deprecated public properties of Parser
Code search:
 https://codesearch.wmcloud.org/deployed/?q=-%3E%28mLinkID%7CmIncludeSizes%7CmDoubleUnderscores%7CmShowToc%7CmRevisionId%7CmRevisionTimestamp%7CmRevisionUser%7CmRevisionSize%7CmInputSize%7CmInParse%7CmFirstCall%7CmGeneratedPPNodeCount%29&i=nope&files=&excludeFiles=&repos=

The following @deprecated properties are not included in this patch in
order to keep it conservative:

* Hard to code search because of generic name:
  $mTitle, $ot, $mOptions
* Should be @internal, not @deprecated, because they are used internally:
  $mPPNodeCount, $mHighestExpansionDepth
* Used by SyntaxHighlight_GeSHi and TemplateStyles extensions (even though
  they could/should use their own independent unique ID):
  $mMarkerIndex
* Used by test cases for Wikibase:
  $mExpensiveFunctionCount

Change-Id: I1dadff934ead767cbd25615c08768e8e935d6b2e
2022-03-31 19:25:33 -04:00
Alexander Vorwerk
82739980fd parser: change 'level' in parse api back to string
We changed to operate on an int internally in I92daeb0f7be8a0.
Let's cast it back to a string for the api in order to prevent
a breaking change, which is not really necessary.

Bug: T304171
Change-Id: I5f5a9203b4dd085cb5defba72c6650532bc9e8d1
2022-03-18 19:52:24 +01:00
jenkins-bot
c268687d46 Merge "Hard deprecate Sanitizer::removeHTMLtags()" 2022-03-08 19:29:55 +00:00
jenkins-bot
d1cfc0317d Merge "Add explicit casts between scalar types" 2022-03-08 17:32:26 +00:00
Umherirrender
6ea3d6ac2c Add explicit casts between scalar types
php internal functions like floor/round/ceil documented to return
float, most cases the result is used as int, added casts

Found by phan strict checks

Change-Id: I92daeb0f7be8a0566fd9258f66ed3aced9a7b792
2022-03-08 16:59:01 +00:00
C. Scott Ananian
d6576e5dc6 Hard deprecate Sanitizer::removeHTMLtags()
Rename Sanitizer::removeHTMLtags() into an @internal method named
::internalRemoveHtmlTags() so that we can deprecate external use.

Code search:
https://codesearch.wmcloud.org/deployed/?q=removeHTMLtags&i=nope&files=&excludeFiles=&repos=

Followup-To: Ic864c01471c292f11799c4fbdac4d7d30b8bc50f
Depends-On: Iaca83ed06e9c61d8366579cd2283cba653c82319
Depends-On: I1963bfe9a99198ea02ca482a5769467ce806cd58
Depends-On: I83923d8b38d33f3638cd53958dd10f257ec21f7c
Depends-On: I018b34bb5f6e113056da9b04cc72d4318422adce
Change-Id: I202826f8b27519f7be89643e24eda47a6e3fc9f6
2022-03-07 22:04:56 -05:00
C. Scott Ananian
9f14fbd002 Add Sanitizer::removeSomeTags() which uses Remex to tokenize
The existing Sanitizer::removeHTMLtags() method, in addition to having
dodgy capitalization, uses regular expressions to parse the HTML.
That produces corner cases like T298401 and T67747 and is not guaranteed
to yield balanced or well-formed HTML.

Instead, introduce and use a new Sanitizer::removeSomeTags() method
which is guaranteed to always return balanced and well-formed HTML.

Note that Sanitizer::removeHTMLtags()/::removeSomeTags() take a callback
argument which (as far as I can tell) is never used outside core. Mark
that argument as @internal, and clean up the version used by
::removeSomeTags().

Use the new ::removeSomeTags() method in the two places where
DISPLAYTITLE is handled (following up on T67747).  The use by the
legacy parser is more difficult to replace (and would have a
performace cost), so leave the old ::removeHTMLtags() method in place
for that call site for now: when the legacy parser is replaced by
Parsoid the need for the old ::removeHTMLtags() will go away.  In a
follow-up patch we'll rename ::removeHTMLtags() and mark it @internal
so that we can deprecate ::removeHTMLtags() for external use.

Some benchmarking code added.  On my machine, with PHP 7.4, the new
method tidies short 30-character title strings at a rate of about
6764/s while the tidy-based method being replaced here managed 6384/s.
Sanitizer::removeHTMLtags blazes through short strings 20x faster
(120,915/s); some of this difference is due to the set up cost of
creating the tag whitelist and the Remex pipeline, so further
optimizations could doubtless be done if Sanitizer::removeSomeTags()
is more widely used.

Bug: T299722
Bug: T67747
Change-Id: Ic864c01471c292f11799c4fbdac4d7d30b8bc50f
2022-03-04 14:06:02 -05:00
jenkins-bot
24aa34d06c Merge "phpcs: Disable Generic.Files.LineLength for test files" 2022-02-21 15:51:29 +00:00