Commit graph

1480 commits

Author SHA1 Message Date
Arlo Breault
5ed94aba15 Drop comments in cleanUpTocLine
Needed-By: Ie6760dd25f937d4f6acbab1c0e1475b54878d4ed
Change-Id: I10f96435f892b188cffe64b92cdf2701a3e2058b
2024-02-22 19:06:15 -05:00
Arlo Breault
909043c539 Remove empty spans while traversing in cleanUpTocLine
Change-Id: I2d75bc6aa03c112c6e1dccd9a3b4f608cafde6cb
2024-02-20 19:09:00 -05:00
Arlo Breault
b05e4b98ce Walk the dom instead of using a queryselector in cleanUpTocLine
Change-Id: Ic59a4883f5b830c0c513e1836ad0de7c29a4b96d
2024-02-20 18:54:40 -05:00
Arlo Breault
89ddae6805 Remove metadata content while traversing all nodes in cleanUpTocLine
Change-Id: I900cff697b1d644140d0a8755ba601d8f94abb3e
2024-02-20 18:52:28 -05:00
Subramanya Sastry
e55cc517da Move Parser to Mediawiki\Parser namespace
Bug: T166010
Co-Authored-By: Daimona Eaytoy <daimona.wiki@gmail.com>
Co-Authored-By: James Forrester <jforrester@wikimedia.org>
Co-Authored-By: Subramanya Sastry <ssastry@wikimedia.org>
Change-Id: I79b4e732c45095eedbaa80afa5eb7479b387ed8a
2024-02-16 09:18:38 -05:00
C. Scott Ananian
f7ba84855a Parser::getExternalLinkAttribs: Don't set rel attribute to null
Paser::getExternalLinkRel() is defined to return `null` if there's
no attribute to add, but then ParserOptions::getExternalLinkTarget()
may try to append to it and external users might try to actually pass
the $attribs to (eg) Xml::element() and become unhappy if the value
is `null`.

Bug: T357668
Followup-To: Ifec733a923f193b72eaba9a1e604ad4e56c0aef2
Change-Id: I907c22ef070616d81b9a50b0e807a7b8f78b59b5
2024-02-15 17:32:28 -05:00
C. Scott Ananian
e72e1cd163 Revert "Move section heading formatting to post-cache transform"
This reverts commit de0646843a.

Reason for revert: caused T357723.

Change-Id: I4690c03a34e8796090563e19a214d8ede63fe5d1
2024-02-15 20:58:32 +00:00
Bartosz Dziewoński
de0646843a Move section heading formatting to post-cache transform
Previously, Parser.php used Linker::makeHeadline() in order to
generate the `<h2><span class="mw-headline" id="...">...</span></h2>`
markup for section headings, and this was saved in the parser cache.
Now it generates heading tags with placeholder attributes like
`<h2 data-mw-...="..." ...>...</h2>`, and they are replaced in a
post-cache transform to generate the final heading markup, similarly
to how section edit links already worked.

The purpose of these changes is to allow changing the final markup
depending on skin options without splitting the parser cache (T13555).

Deployment and undeployment safety:
* The new post-cache transform has been already added in commit
  Ibce512b3c4a52f74b2d2124f0159e306f2689ea5 for forward-compatibility
  (so that if this patch is reverted, new parser cache entries
  will still be shown correctly).

Implementation notes:
* There are many ways to keep the temporary information other than
  `data-mw-...` attributes, but this way is the easiest to handle
  in a post-cache transform (everything is on the DOM node we want
  to modify), is compatible with other heading-enhancing code in
  DiscussionTools and MobileFrontend, and remains human-readable
  if the post-cache transform doesn't run.
* Sadly this code can't be reused to add section heading markup and
  section edit links to Parsoid (T269630), because it lacks some of
  the necessary metadata, and exposes the rest in ways that are
  trickier to handle in a post-cache transform (on other DOM nodes
  or outside the document).

Bug: T13555
Change-Id: I4eae18d9d16f54391daba0de82ad05e50f07f9eb
2024-02-15 13:09:08 -05:00
jenkins-bot
cb6d6e8bae Merge "Parser: Convert wikitext entities to HTML entities in TOC" 2024-02-12 19:52:55 +00:00
James D. Forrester
102a4f8a35 build: Upgrade mediawiki/mediawiki-phan-config from 0.13.0 to 0.14.0 manually
* Switch out raw Exceptions, mostly for InvalidArgumentExceptions.
  * Fake exceptions triggered to give Monolog a backtrace are for
    some reason "traditionally" RuntimeExceptions, instead, so we
    continue to use that pattern in remaining locations.
* Just entirely give up on PostgresResultWrapper's resource vs. object mess.
* Drop now-unneeded false positive hits.

Change-Id: Id183ab60994cd9c6dc80401d4ce4de0ddf2b3da0
2024-02-10 02:22:41 +00:00
Bartosz Dziewoński
fb1be73a07 Parser: Convert wikitext entities to HTML entities in TOC
Bug: T355386
Bug: T324763
Change-Id: Ic0a805f29c928d0c2edf266ea045b0d29bb45a28
2024-02-09 02:00:38 +00:00
jenkins-bot
e831aa9c8b Merge "Namespace includes/context" 2024-02-08 18:04:34 +00:00
James D. Forrester
4bae64d1c7 Namespace includes/context
Bug: T353458
Change-Id: I4dbef138fd0110c14c70214282519189d70c94fb
2024-02-08 11:07:01 -05:00
C. Scott Ananian
242c6d2cf9 Introduce ParserOutput:setFromParserOptions() and use for preview flag
Bug: T341010
Co-Authored-by: cananian <cananian@wikimedia.org>
Co-Authored-by: ihurbain <ihurbainpalatin@wikimedia.org>
Change-Id: I03125fdaa7dd71ba57d593e85ecb98be6806f3f6
2024-02-07 21:22:06 -05:00
Daimona Eaytoy
7acfa6a0a5 Replace more instances of unchecked MWException
Most (all?) of the remaining usages are caught somewhere and will be
migrated later.

Bug: T328220
Change-Id: I5c36693a5361dd75b4f1e7a0bab5ad48626ed75c
2024-01-23 16:20:53 +00:00
Arlo Breault
4318039a23 Remove redundant internal tag
Change-Id: I09b282324ae8d6307ae963bede4848dbdfb2a150
2024-01-17 17:55:52 -05:00
Arlo Breault
4b987168d0 Remove unnecessary null check from Parser::braceSubstitution
Parser::braceSubstitution is only called from PPFrame_Hash::expand with
the result of PPNode_Hash_Tree::splitRawTemplate which always sets
'parts' to a PPNode_Hash_Array

Parser::argSubstitution is similarly called without the unnecessary null
check..

The comment was introduced in e002df9 and, although true, even then
the ternary may have been made redundant by a previous refactor.

Change-Id: Ia1c5b8570c65c8e174c723dbd292e11c3a72f54d
2024-01-17 17:42:10 -05:00
Bartosz Dziewoński
c2c4645fa2 Parser: Normalize dot segments in URL paths
Bug: T352827
Change-Id: Id90a26b656067481039fa77080417f34347f9c22
2024-01-04 01:46:33 +01:00
Fomafix
45c450aacb Parser: Remove hard-deprecated getCustomDefaultSort and setDefaultSort
getCustomDefaultSort and setDefaultSort are unused:
* https://codesearch.wmcloud.org/search/?q=getCustomDefaultSort
* https://codesearch.wmcloud.org/search/?q=setDefaultSort
and are hard-deprecated since dc3d489156 included in MediaWiki 1.38.

Change-Id: Ib9a9622d50a5807f55be91885e473b90f98c2cb9
2023-12-29 11:19:28 +00:00
James D. Forrester
9bfb75ff90 Namespace ParserOutput
Most used non-namespaced class!

Bug: T353458
Change-Id: I4c2cbb0a808b3881a4d6ca489eee5d8c8ebf26cf
2023-12-14 14:57:34 -05:00
jenkins-bot
c57120300a Merge "ParserOutput: Allow passing LinkTarget to title-related methods" 2023-12-11 18:02:25 +00:00
Isabelle Hurbain-Palatin
a3f51c732d Refactor DefaultOutputTransform into a pipeline of transforms
Bug: T348253
Change-Id: I53551ec6d6471569709c71c1155729e550f64de8
2023-12-08 18:06:19 -05:00
C. Scott Ananian
4b83285954 ParserOutput: Allow passing LinkTarget to title-related methods
Broadened the argument type to allow passing LinkTarget to:
* ParserOutput::addCategory()
* ParserOutput::addLanguageLink()
* ParserOutput::addLink()
* ParserOutput::addImage()
* ParserOutput::addTemplate()

This allows for a tighter interface with Parsoid's
ContentMetadataCollector class and avoids errors caused by passing the
wrong form of string title ("text" with spaces versus "dbkey" with
underscores).

There are a few performance problems remaining after this patch, which
only apply to use by Parsoid (not the legacy parser):

1. ::addLink() does inefficient db requests to fetch the page id for
each link if the optional $id parameter is not passed.  These lookups
should be deferred and a LinkBatch used.  (The legacy parser always
passes $id.)

2. ::addTemplate() similarly requires $page_id (and $rev_id) to be
passed, so is not currently usable by Parsoid.

3. ::addLanguageLink() uses Title::getFullText() which is not present
in LinkTarget and is currently implemented as a full Title lookup.
This is not an issue for the legacy parser, because it already has a
Title object so the lookup is a no-op, but could be improved for
Parsoid's use.

Bug: T296023
Change-Id: If21ec8563c8a619bdde7c0cb6534bb9009480a21
2023-12-08 17:50:29 -05:00
jenkins-bot
b7fc1b2f43 Merge "Only cache expensive renderings" 2023-11-30 21:24:34 +00:00
daniel
e3fb964439 Only cache expensive renderings
Pages that are fast to render can be omitted from the parser cache
to preserve disk space and cache write operations.

The threshold is configurable per namespace, so the tradeoff can
be evaluated based on different access patterns. For example, pages
that are accessed rarely, like file description pages on commons,
may have a high threshold configured, while pages that are read
frequently, like wikipedia articles, may be configured to be always
cached, using a 0 threshold.

Filtering is based on a time profile recorded in the ParserOutput.
A generic mechanism for capturing the timing profile is implemented
in the ContentHandler base class. Subclasses may implement a more
rigorous capture mechanism.

Bug: T346765
Change-Id: I38a6f3ef064f98f3ad6a7c60856b0248a94fe9ac
2023-11-30 20:56:12 +00:00
Martin Urbanec
29af4dd074 Move user options related classes into its own namespace
There are a couple of user options related classes already,
and the T321527 work on dynamic defaults is going to add
even more. Let's move them into a separate namespace
to make core a bit more organized.

Old name is kept as an alias for compatibility purposes.

Bug: T321527
Bug: T352284
Change-Id: I9822eb1553870b876d0b8a927e4e86c27d83bd52
2023-11-29 13:27:13 +01:00
Subramanya Sastry
00d64e4156 Revert "Parsoid DataAccess: Stop processing extensions as top-level docs"
This reverts commit 0791724ead.

Reason for revert: Breaks math rendering in Parsoid (and hence for all clients)

Change-Id: I9abe07060e5d11a9a1a2c953344eb50d4536e8c4
2023-11-28 03:59:19 +00:00
Subramanya Sastry
0791724ead Parsoid DataAccess: Stop processing extensions as top-level docs
* See T351461 and T303015 for examples where calling top-level doc
  parser hooks during extension processing causes problems further
  downstream.

  The hooks are: ParserAfterTidy and ParserAfterParse

* Since any extension that relies on those two hooks will need a
  Parsoid-equivalent implementation to work properly with Parsoid,
  we don't need to preemptively run those hooks on a sublevel doc.

  We can instead let the Parsoid-compatible implementation process
  the full doc.

* Accordingly, this patch removes the parseExtensionTagAsTopLevelDoc
  method from Parser.php and has DataAccess::parseWikitext simply
  call Parser::recursiveTagParseFully instead.

Change-Id: I58e693499e1a53e0814911dc2ea424aa822b8320
2023-11-26 22:23:35 -06:00
C. Scott Ananian
3f23b09748 [parser] Broaden TOC placeholder regular expression
* This broke in 0e1b889a.
* HtmlHolder (via Remex) serializes self-closing meta tags without a
  trailing / char.
* Separately, worth exploring if HtmlHolder should use Parsoid's
  XML serializer.

Co-Authored-By: C. Scott Ananian <cscott@cscott.net>
Co-Authored-By: Subramanya Sastry <ssastry@wikimedia.org>
Change-Id: I9fba68a8cfe63540fec83eb9c886e2956ba75660
2023-11-21 17:26:54 +00:00
Bartosz Dziewoński
68ccfa46ad Use DOM to clean up headings for the table of contents (TOC)
Parse the heading contents as HTML. This makes it easier to strip out
some HTML tags using DOM operations, and ensures that we generate
balanced HTML at the end (T218330).

There are a few minor changes in behavior:

* [improvement] Fixed inconsistency with Parsoid in whitespace
  handling around stripped tags (see changed test case 1)

* [bug fix] Allows `<span dir>` even when `dir` is not the first
  attribute (see changed test case 2)

* [improvement] Unnecessary entities are no longer preserved in
  the TOC (see changed test case 3a)

* [bug fix] Underscores in headings are preserved in section edit
  link title (see changed test case 3b)

* [bug fix] Attributes on `<q>` tags are now correctly removed
  (this behavior wasn't covered by a test case)

Bug: T218330
Change-Id: Ibad7480088b82a1fd515831a9813ce18c2b1f3ea
2023-11-17 18:27:46 +01:00
thiemowmde
10a828ba72 Deprecate MagicWordFactory::getSubstIDs
The main motivation is to further reduce the complexity of the class:
* There is no code that ever writes to $this->mSubstIDs. It's
  effectively a constant.
* According to CodeSearch the getSubstIDs() method is not used
  anywhere. It's @internal to the parser.
* I find it weird that the parser needs to call 2 factory methods to
  do 1 thing.
* I still find it a good idea to keep the knowledge encapsulated in
  the factory and not have the [ 'subst', 'safesubst' ] array in the
  parser. That's why I propose the new method.

Change-Id: I5c147c75200c3c34a410d93a0328b56ea00a050f
2023-11-13 11:10:24 +01:00
jenkins-bot
c544883e84 Merge "Strip state from attributes before inserting them" 2023-10-23 14:35:42 +00:00
jenkins-bot
3285c8d5d3 Merge "parser: Add strict type constraints to MagicWord… classes" 2023-10-18 15:32:31 +00:00
jenkins-bot
70ef48b846 Merge "Improve performance of trivial encoding/decoding regexes" 2023-10-17 20:54:11 +00:00
thiemowmde
2e0301e634 parser: Add strict type constraints to MagicWord… classes
This patch is intentionally "incomplete". It's limited to places
where we can be 100% sure about the type just from looking at the
code. More to be done in later patches.

Change-Id: Ideea49ea9603127038ef08c6a9805f40a0b86b6d
2023-10-16 10:36:36 +02:00
jenkins-bot
f98ae5faa9 Merge "parser: Improve PHPDoc type hints in MagicWord… classes" 2023-10-13 00:34:27 +00:00
thiemowmde
bef3da3210 parser: Improve PHPDoc type hints in MagicWord… classes
Intentionally split across multiple patches. This is only about
documentation and impossible to break anything (other than Phan).

MagicWordArray::matchAndRemove is particularly confusing because the
documentation and structure of the returned array make it look like
it would support parameters. But it never (!) did.

The method was added like this in 2008 via commit 269a9103 (r31113).

There was always only a single caller in the Parser class. The
parser never used the array values, only the keys (via isset). Which
makes sense because that code in the parser is about "double
underscore" magic words (e.g. __NOTOC__). These don't support
parameters anyway.

Change-Id: Ife92fc3d6d5b03606ba2b209a886cadef3451fea
2023-10-11 00:07:19 +00:00
mainframe98
8451cbfa87 Parser: remove usages of $wgTitle
Change-Id: Iaff236f096c2b8a966da01479b80e98b76e80425
2023-10-10 01:34:18 +00:00
mainframe98
fdfc99e01f Parser: Remove ability to initialize mTitle to null
Setting mTitle to null has been deprecated since 1.34.
Enforce this with a type declaration, now that this is possible in PHP 7.4.

To keep existing behavior, have getPage return null if mTitle is set
to Special:Badtitle/Missing. getTitle never returned null to begin with.

Change-Id: I2e0f87265f88ed6db97957af4faee8733e27df79
2023-10-09 19:32:37 +02:00
Isabelle Hurbain-Palatin
6c109970a8 Strip state from attributes before inserting them
This patch fixes the referenced bug by resolving strip markers before they
get stashed in an attribute.

There is some concern about breaking out of the attribute (see
https://gerrit.wikimedia.org/r/plugins/gitiles/mediawiki/core/+/refs/heads/master/includes/parser/Parser.php#175
around that topic), but these seem to be taken care of by the wrapping
in htmlspecialchars.

Bug: T347552
Change-Id: I6ce45e56c00ce8eff7e178746502afa946aba768
2023-10-09 18:19:34 +02:00
jenkins-bot
1005e7c9b9 Merge "parser: Fix detection of variable with whitespace after subst:" 2023-10-07 19:43:48 +00:00
thiemowmde
06051e1256 Replace complex preg_replace_callback with strtr/preg_replace
The complexity is really not needed in these cases. strtr() does have
the behavior we want: It does all replacements at the same time instead
of sequentially.

We are also adding test cases for the previously uncovered
StringUtils::escapeRegexReplacement() we rely on in this patch.

Bug: T308395
Change-Id: I6741303775d6d54f3ad0d50635a986ff992ae8f4
2023-10-05 10:47:46 +02:00
thiemowmde
f5cd1ba7ca Improve performance of trivial encoding/decoding regexes
Instead of replacing 1 character at a time the functions used here
can replace sequences of any length. This can dramatically reduce the
function call overhead.

Also make use of the `fn ()` syntax because we can.

Change-Id: I2dbc2271aa7847d9b687703f837cb0d850596ef0
2023-10-04 11:09:44 +02:00
Umherirrender
b718462479 parser: Hard-deprecate Parser::getFreshParser
Bug: T325959
Depends-On: I301cfecd95db04585e0f65b7919ea1c2e2bbff2a
Change-Id: I97938348407e3096187cfb41adb433a09ac77866
2023-10-03 17:01:22 +02:00
Umherirrender
87fadf2484 parser: Fix detection of variable with whitespace after subst:
The subst: magic word gets removed from $part1, but the whitespace is
not removed, so trim $part1 after the remove to ensure the next step can
detect the variable, which is using a regex without whitespaces at
begin, assuming the code has already trimmed.

Bug: T340806
Change-Id: I8eea173bdf992511989b8a433c11032d3864abc1
2023-10-01 18:30:15 +00:00
James D. Forrester
468e69bccc Namespace Sanitizer under \MediaWiki\Parser
Bug: T166010
Change-Id: Id13dcbf7a0372017495958dbc4f601f40c122508
2023-09-21 05:39:23 +00:00
James D. Forrester
1d0b7ae1e2 Namespace User under \MediaWiki\User
Bug: T166010
Change-Id: I7257302b485588af31384d4f7fc8e30551f161f1
2023-09-19 19:18:16 +00:00
jenkins-bot
14e52d187d Merge "Parser: use PHPDoc comments on properties, typed private properties" 2023-09-19 05:53:21 +00:00
James D. Forrester
5bc2a04b08 Namespace remaining Title-related classes under \MediaWiki\Title
Bug: T166010
Change-Id: Ia2e5a7367cc8cdbd8a7b845ae2fd5d776ff22891
2023-09-19 05:21:23 +00:00
James D. Forrester
b16be7a36c Namespace TitleFormatter under \MediaWiki\Title
One of the big ones, so doing this alone.

Bug: T166010
Change-Id: Ic2d59eb6764b1a273ed7162ecabf641f638b8f66
2023-09-19 05:17:18 +00:00