Commit graph

151 commits

Author SHA1 Message Date
Subramanya Sastry
e55cc517da Move Parser to Mediawiki\Parser namespace
Bug: T166010
Co-Authored-By: Daimona Eaytoy <daimona.wiki@gmail.com>
Co-Authored-By: James Forrester <jforrester@wikimedia.org>
Co-Authored-By: Subramanya Sastry <ssastry@wikimedia.org>
Change-Id: I79b4e732c45095eedbaa80afa5eb7479b387ed8a
2024-02-16 09:18:38 -05:00
Umherirrender
28ae6795f8 Move array destructuring into foreach
Change-Id: I54c98085b21f1fe48ccf575d1b9dd60d3b855c58
2023-07-08 19:52:46 +00:00
Umherirrender
a602dd8139 parser: Improvements to Preprocessor::buildDomTreeArrayFromText
- Use a non-capturing group and positive look-ahead assertion, use the
overall capture as result
- Use substr_compare to compare strings
- Remove unneeded variable assigment
- Use array_pop to get the last element
- Remove unneeded isset/empty checks
- Change behaviour of $wsEnd to avoid +1 everywhere

Change-Id: Ifc1728b09743ec91908b59c09ac34598d9cd8b49
2023-01-06 22:43:33 +01:00
Umherirrender
20c83e9852 parser: Small improvements to Preprocessor::buildDomTreeArrayFromText
- Use str_contains
- Always use strict comparision
- Remove unneeded parenthesis
- Break some lines
- Remove unneeded variable assigment
- Move variable closer to the usage
- Unpack $matches to named variables

Change-Id: I31b92afe32243748198dbb692144b4b575772554
2023-01-06 18:42:02 +01:00
Umherirrender
8384d832b2 parser: Improve $noMoreClosingTag cache in preprocessor
The regex searching for </$name> is case-insensitive (/i),
the cache to avoid the regex should be case-insensitive as well to avoid
that the regex is triggered for same tag name varying in case.

Change-Id: I5cdbc623be1ebd3cb8439c0ed4781da86f246c94
2022-12-13 19:52:44 +01:00
Aaron Schulz
8609728119 parser: allow segmentation of large preprocessor cache values
Since "segmentable" requires the use of strings, serialize the
tree as JSON, as it was before de6eeead21.

Bug: T309126
Change-Id: Ib5a902c6a8e03a129dec47426c2aa5f85e17c99f
2022-11-10 10:45:58 -08:00
thiemowmde
8ac6410f75 Fix off-by-1 error in Preprocessor_Hash
Here is an example wikitext to play around with:
a
<!----> <!---> <!--->
b

a
<!----> <!---> <!--->b

https://www.mediawiki.org/w/api.php?action=parse&text=a%0A%3C!----%3E%20%3C!---%3E%20%3C!---%3E%0Ab%0A%0Aa%0A%3C!----%3E%20%3C!---%3E%20%3C!---%3Eb&prop=parsetree

Notice how the comments are parsed different in the two situations.
<comment>&lt;!----&gt; </comment><comment>&lt;!---&gt; </comment><comment>&lt;!---&gt;\n</comment>
<comment>&lt;!----&gt;</comment> <comment>&lt;!---&gt; &lt;!---&gt;</comment>

The first one is wrong because <!---> is not an HTML comment. There
is one dash missing. It should be parsed as <!-- followed by the
content "-> <!-" followed by the closing -->, just as in the second
example.

Change-Id: Iaf8e52c5d41ca3f28890e8b66cfdf1d86126c357
2022-11-07 21:14:25 +01:00
Umherirrender
1b342a8893 Various doc fixes about false and null on method arguments/return types
Doc-only changes

Change-Id: Ice974b3ba41708859dfe646e94b31c5ebbf26410
2022-11-03 18:55:47 +01:00
Aryeh Gregor
7b4b0135b9 Use str_starts_with/str_ends_with
All the other ways of doing it were ridiculous and much harder to read,
and usually required repeating the needle expression (to get its
length). I found these occurrences by grepping for various expressions,
but I undoubtedly missed some.

I didn't try replacing the many instances of strpos(...) === 0 with
str_starts_with(...), because I think they're readable enough as-is
(although less efficient). Likewise I didn't try porting strpos(...) !==
false to str_contains(...). For case-insensitive comparisons, Tim
Starling requested that we stick with substr_compare() because it's more
efficient than calling strtolower().

On PHP < 8 these functions will be included with a polyfill via
vendor/autoload.php. This is included at the beginning of
includes/AutoLoader.php, so if our autoloader has been included the
polyfill will be available. This means it should be safe to call these
functions from any code that would not be usable without our autoloader.

Three uses that Tim Starling identified as being performance-sensitive
have been split out to a separate commit for porting after the switch to
PHP 8.

Change-Id: I113a8d052b6845852c15969a2f0e6fbbe3e9f8d9
2022-05-02 10:59:58 +03:00
Reedy
7bf779524a Remove or replace usages of "sanity"
Bug: T254646
Change-Id: I2b120f0b9c9e1dc1a6c216bfefa3f2463efe1001
2021-11-19 23:19:42 +00:00
Umherirrender
07b499fbcf build: Update mediawiki/mediawiki-phan-config to 0.11.0
Addition and remove of suppression needs to be done with the version
update.

Change-Id: I3288b3cefa744b507eadebb67b8ab08c86517c1c
2021-09-07 17:19:05 +02:00
Aaron Schulz
de6eeead21 parser: convert Preprocessor to WANCache and inject dependencies
Make the caching logic use getWithSetCallback() and simplify
the code given that there is only one Preprocessor subclass.
Also, keep the cached values JSON serializable but rely on
the serialization in BagOStuff instead for simplicity.

Add related class constants for injecting preprocessor flags.

Bug: T254608
Change-Id: I72f9f0c0bc352ed5120469090c71294ff0c24999
2021-01-11 20:21:10 -08:00
C. Scott Ananian
c64e71615e Replace $wgDisable{Lang,Title}Conversion with LanguageConverterFactory methods
Replace direct access to $wgDisableLangConversion with
LanguageConverterFactory::isConversionDisabled(), and replace direct
access to $wgDisableTitleConversion with
LanguageConverterFactory::isTitleConversionDisabled().  However, most
places that check ::isTitleConversionDisabled() actually want
::isLinkConversionDisabled(), so add that too (and deprecate
isTitleConversionDisabled()).

Code search:
https://codesearch.wmcloud.org/search/?q=Disable%28Lang|Title%29Conversion&i=nope&files=&repos=

This change removes a number of spurious dependencies on the global
configuration and reduces code duplication (for example, if the logic
for disabling language conversion were ever to change).

Depends-On: I6fa8230ae97b0e34c381003548e61f9b7387d363
Change-Id: Icc4687638ff1815003dd903854efdbd904854f1e
2020-11-25 12:47:26 -05:00
Florian
43d5307715 Fix PHP 8 compat with strcspn() $length parameter exceeding string
Bug: T264502
Change-Id: I25cb8f56e6f56a9233e38844dc62cda9a06cb5e6
2020-10-04 01:27:47 +00:00
Reedy
b038d6333a Fix even more PSR12.Properties.ConstantVisibility.NotFound
Change-Id: I6d98efcfac1f1c0ab6a442e0af6d5daa6ef7801a
2020-05-16 00:28:41 +00:00
Daimona Eaytoy
15e4968ee9 parser: Declare some dynamic properties
Mostly via the @property annotation. This is to make phan a little
happier.

Change-Id: I3fde33955240dab20870821e9db93caba163845b
2019-09-08 19:03:03 +00:00
Daimona Eaytoy
c659bc6308 Unsuppress another phan issue (part 7)
Bug: T231636
Depends-On: I2cd24e73726394e3200a570c45d5e86b6849bfa9
Depends-On: I4fa3e6aad872434ca397325ed7a83f94973661d0
Change-Id: Ie6233561de78457cae5e4e44e220feec2d1272d8
2019-09-03 17:19:21 +00:00
Umherirrender
dd25f3df6b Improve type hints in parser related classes
Change-Id: Ia07a2eb32894f96b195fa3189fb5f617e68f2581
2019-07-05 21:29:32 +00:00
Zoranzoki21
4226fada45 Split parser related files to have one class in one file
Change-Id: I36b26609ccb3f135a22961b32a46cdc06603b3e4
2019-04-27 00:41:47 +00:00
Derick Alangi
fb46f4fc3f parser: Fix return type for methods and match phpdoc comments
Change-Id: I867d7eb6fc56cc52eb8e129977b7a62607a11268
2019-04-12 23:33:47 +00:00
Aryeh Gregor
7b4489e019 Get rid of unnecessary func_get_args() and friends
HHVM does not support variadic arguments with type hints.  This is
mostly not a big problem, because we can just drop the type hint, but
for some reason PHPUnit adds a type hint of "array" when it creates
mocks, so a class with a variadic method can't be mocked (at least in
some cases).  As such, I left alone all the classes that seem like
someone might like to mock them, like Title and User.  If anyone wants
to mock them in the future, they'll have to switch back to
func_get_args().  Some of the changes are definitely safe, like
functions and test classes.

In most cases, func_get_args() (and/or func_get_arg(), func_num_args() )
were only present because the code was written before we required PHP
5.6, and writing them as variadic functions is strictly superior. In
some cases I left them alone, aside from HHVM compatibility:

* Forwarding all arguments to another function. It's useful to keep
  func_get_args() here where we want to keep the list of expected
  arguments and their meanings in the function signature line for
  documentation purposes, but don't want to copy-paste a long line of
  argument names.
* Handling deprecated calling conventions.
* One or two miscellaneous cases where we're basically using the
  arguments individually but want to use them as an array as well for
  some reason.

Change-Id: I066ec95a7beb7c0665146195a08e7cce1222c788
2019-04-12 20:17:01 +00:00
Fomafix
43244db9a2 Use PHP 7 '??' operator instead of if-then-else
Change-Id: If9d4be5d88c8927f63cbb84dfc8181baf62ea3eb
2018-10-21 21:46:46 +02:00
Umherirrender
a4caa4d0c6 build: Updating mediawiki/mediawiki-codesniffer to 22.0.0
Added spaces around .
Removed empty return statement which are not required
Removed return after phpunit markTestIncomplete,
which is throwing to exit the test, no need for a return

Change-Id: I2c80b965ee52ba09949e70ea9e7adfc58a1d89ce
2018-09-16 15:51:11 +00:00
Bartosz Dziewoński
485f66f174 Use PHP 7 '??' operator instead of '?:' with 'isset()' where convenient
Find: /isset\(\s*([^()]+?)\s*\)\s*\?\s*\1\s*:\s*/
Replace with: '\1 ?? '

(Everywhere except includes/PHPVersionCheck.php)
(Then, manually fix some line length and indentation issues)

Then manually reviewed the replacements for cases where confusing
operator precedence would result in incorrect results
(fixing those in I478db046a1cc162c6767003ce45c9b56270f3372).

Change-Id: I33b421c8cb11cdd4ce896488c9ff5313f03a38cf
2018-05-30 18:06:13 -07:00
Kunal Mehta
f093fb4192 Don't globally disable PHPCS's prohibition of assert()
Whitelist the remaining usages of assert(), and reinstate the PHPCS sniff
that forbids usage of it. Add FIXME comments as well, so any casual readers
of the code will not think that the disabling and usage is intentional.

Change-Id: I7cabe715c0e6aa6a9ef3ffe5657f3de7fd8e662b
2018-05-07 17:15:16 +00:00
Arlo Breault
f27c50b580 Clarify -{ => {{ transition
Ensure we have the correct rule on the stack.

Change-Id: Ie814df7b759a2381be0b815eeefdb5d1f7adcde0
2018-03-15 13:04:59 -04:00
Arlo Breault
7c450dff64 Remove "dash" case in preprocessToObj
This was introduced in 2877402 and removed in 186a182

Change-Id: Ibfa1ae1597bfc50ae6ea49402c7966ca042f12e5
2018-03-09 17:46:53 -05:00
Umherirrender
3124a990a2 Use ::class to resolve class names in includes files
This helps to find renamed or misspelled classes earlier.
Phan will check the class names

Change-Id: I07a925c2a9404b0865e8a8703864ded9d14aa769
2018-01-27 20:34:29 +01:00
Umherirrender
255d76f2a1 build: Updating mediawiki/mediawiki-codesniffer to 15.0.0
Clean up use of @codingStandardsIgnore
- @codingStandardsIgnoreFile -> phpcs:ignoreFile
- @codingStandardsIgnoreLine -> phpcs:ignore
- @codingStandardsIgnoreStart -> phpcs:disable
- @codingStandardsIgnoreEnd -> phpcs:enable

For phpcs:disable always the necessary sniffs are provided.
Some start/end pairs are changed to line ignore

Change-Id: I92ef235849bcc349c69e53504e664a155dd162c8
2018-01-01 14:10:16 +01:00
jenkins-bot
1a40e0cc86 Merge "Change php extract() to explicit code" 2017-12-28 09:44:59 +00:00
Huji Lee
e74bfe13f6 Require indentation of CASE statements in PHP code
Bug: T182546
Change-Id: I91a9555893a08e4ec58da97c6cc4d1e70000ff6b
2017-12-10 22:07:50 -05:00
Umherirrender
3d560be428 Change php extract() to explicit code
Avoid php magic and make var settings more visible

Change-Id: I223874fd871104b0ac6a80d7f39c6dd997d0551d
2017-12-08 14:46:33 +01:00
Umherirrender
f739a8f368 Improve some parameter docs
Add missing @return and @param to function docs and fixed some @param

Change-Id: I810727961057cfdcc274428b239af5975c57468d
2017-09-10 20:32:31 +02:00
Umherirrender
3f1a52805e Use short type bool/int in param documentation
Enable the phpcs sniffs for this and used phpcbf

Change-Id: Iaa36687154ddd2bf663b9dd519f5c99409d37925
2017-08-20 13:20:59 +02:00
WMDE-Fisch
6df9ed1ad6 update mediawiki-codesniffer to 0.11.0 and fix issues
- mostly auto fixes
- some too long lines fixed
- ignore amp space in one case  passing by reference

Change-Id: I6472f83bc3cbf4bd629d83050cc3319b19ec465c
2017-08-11 22:27:51 +02:00
Umherirrender
b5cddfb27b Remove empty lines at begin of function, if, foreach, switch
Organize phpcs.xml a bit

Change-Id: Ifb767729b481b4b686e6d6444cf48b1f580cc478
2017-07-01 11:34:16 +00:00
C. Scott Ananian
186a182a15 Protect language converter markup in the preprocessor (take 2).
This revises 2877402276, which was
reverted in master due to unexpected issues with `-{{...}} ` markup
on translatewiki and enwiki.  Test cases are added to ensure that this
is parsed as a template, not as language converter markup.

https://www.mediawiki.org/wiki/Preprocessor_ABNF is the canonical
documentation for the preprocessor; this will be updated after this
patch is merged.  The basic principles described in that page are
maintained in this patch:

* Rightmost opening structure has precedence: `-{{` is parsed as a
dash followed by template opening.

* `{{{` has precedence over `{{` and `-{`: `-{{{{` is parsed as
`-{` `{{{` since we first grab the rightmost `{{{`.

A bunch of test cases were added to verify the "ideal precedence"
order described on that wiki page.

This patch introduced some minor incompatibilities in existing
markup, in particular with chemical formulae in templates.
Fixes for these are being tracked at
https://www.mediawiki.org/wiki/Parsoid/Language_conversion/Preprocessor_fixups

Bug: T146304
Bug: T153761
Change-Id: I2f0c186c75e392c95e1a3d89266cae2586349150
2017-05-23 15:43:49 +01:00
James D. Forrester
9635dda73a includes: Replace implicit Bugzilla bug numbers with Phab ones
It's unreasonable to expect newbies to know that "bug 12345" means "Task T14345"
except where it doesn't, so let's just standardise on the real numbers.

Change-Id: I6f59febaf8fc96e80f8cfc11f4356283f461142a
2017-02-21 18:13:24 +00:00
C. Scott Ananian
046c463635 Revert "Protect language converter markup in the preprocessor."
This effectively reverts commit 2877402276 in
order to unblock the deploy train.  The underlying behavior might not be
incorrect, but it was unexpected.

Bug: T153761
Change-Id: Ifc9c7cf3482dd5d222ff4da24a6d4cc401e9d965
2017-01-03 17:23:28 -05:00
C. Scott Ananian
2877402276 Protect language converter markup in the preprocessor.
This ensures that `{{echo|-{R|foo}-}}` is parsed correctly as
a template invocation with a single argument, not as two separate
arguments split by the `|`.

Bug: T146304
Change-Id: I709d007c70a3fd19264790055042c615999b2f67
2016-12-15 23:50:44 +00:00
Tim Starling
6a04d86149 Remove all assert() calls with string parameters
These fail when HHVM is in RepoAuthoritative mode

Change-Id: Ifb1628f8269b2b651154b740b95cc14163a1b186
2016-08-15 23:11:18 +00:00
Tim Starling
b2f7bb4d76 Preprocessor_Hash: use child arrays instead of linked lists
The singly-linked list data structure of Preprocessor_Hash was causing
stack exhaustion due to the need for a recursion depth proportional to
the number of children of a given PPNode, in serialize() and on
object destruction. So, switch to array-based storage. PPNode_* becomes
a temporary proxy around the underlying storage, which avoids circular
references and keeps the storage very compact. Preprocessor_DOM uses
similar temporary PPNode objects, so the fact that

  $node->getFirstChild() !== $node->getFirstChild()

should not cause any new problems.

* Increment cache version
* Use JSON serialization of the store array instead of serialize(),
  since JSON is more compact, even after gzipping.
* For efficiency, make $accum a plain array, and use it as an array
  where possible, instead of using helper functions.

Performance and memory usage for typical input are slightly improved:
something like 4% faster for the whole parse, and 20% less memory for
the tree.

Bug: T73486
Change-Id: I0d6c162b790d6dc1ddb0352aba6e4753854f4c56
2016-07-22 05:25:11 +00:00
Bartosz Dziewoński
674e8388cb Preprocessor: Don't allow unclosed extension tags (matching until end of input)
(Previously done in f51d0d9a81 and
reverted in 543f46e9c08e0ff8c5e8b4e917fcc045730ef1bc.)

I think it's saner to treat this as invalid syntax, and output the
mismatched tag code verbatim. The current behavior is particularly
annoying for <ref> tags, which often swallow everything afterwards.

This does not affect HTML tags, though. Assuming Tidy is enabled, they
are still auto-closed at the end of the page content. (For tags that
"shadow" a HTML tag name, this results in the tag being treated as a
HTML tag. This currently only affects <pre> tags: if unclosed, they
are still displayed as preformatted text, but without suppressing
wikitext formatting.)

It also does not affect <includeonly>, <noinclude> and <onlyinclude>
tags. Changing this behavior now would be too disruptive to existing
content, and is the reason why previous attempt was reverted. (They
are already special-cased enough that this isn't too weird, for example
mismatched closing tags are hidden.)

Related to T17712 and T58306. I think this brings the PHP parser closer
to Parsoid's interpretation.

It reduces performance somewhat in the worst case, though. Testing with
https://phabricator.wikimedia.org/F3245989 (a 1 MB page starting with
3000 opening tags of 15 different types), parsing time rises from
~0.2 seconds to ~1.1 seconds on my setup. We go from O(N) to O(kN),
where N is bytes of input and k is the number of types of tags present
on the page. Maximum k shouldn't exceed 30 or so in reasonable setups
(depends on installed extensions, it's 20 on English Wikipedia).

Change-Id: Ide8b034e464eefb1b7c9e2a48ed06e21a7f8d434
2016-04-05 12:28:10 -07:00
Thiemo Mättig
6906de45c1 Fix @param and @return types on all PPFrame::getArgument methods
This is about template parameters. They can be indexed by position (int) or
name (string). The returned value is always a string, or false (bool) on
failure.

Change-Id: I565210ad485505281246ef2bb3086a675b905976
2016-03-29 06:12:18 +00:00
umherirrender
74534f9ba6 Fix unmatched @codingStandardsIgnore in parser folder
Fix outstanding phpcs errors

Change-Id: I7b857be88354f2ffa27d76406253ec9e9710b91d
2016-02-17 21:26:30 +01:00
Kunal Mehta
6e9b4f0e9c Convert all array() syntax to []
Per wikitech-l consensus:
 https://lists.wikimedia.org/pipermail/wikitech-l/2016-February/084821.html

Notes:
* Disabled CallTimePassByReference due to false positives (T127163)

Change-Id: I2c8ce713ce6600a0bb7bf67537c87044c7a45c4b
2016-02-17 01:33:00 -08:00
Legoktm
543f46e9c0 Revert "Preprocessor: Don't allow unclosed extension tags (matching until end of input)"
This reverts commit f51d0d9a81.

Breaks templates with non-closed </noinclude> tags, which
were previously acceptable.

Bug: T125754
Change-Id: I8bafb15eefac4e1d3e727c1c84782636d8b82c2b
2016-02-04 00:38:35 +00:00
Bartosz Dziewoński
f51d0d9a81 Preprocessor: Don't allow unclosed extension tags (matching until end of input)
I think it's saner to treat this as invalid syntax, and output the
mismatched tag code verbatim. The current behavior is particularly
annoying for <ref> tags, which often swallow everything afterwards.

This does not affect HTML tags, though. Assuming Tidy is enabled, they
are still auto-closed at the end of the page content.

Related to T17712 and T58306. I think this brings the PHP parser closer
to Parsoid's interpretation.

It reduces performance somewhat in the worst case, though. Testing with
https://phabricator.wikimedia.org/F3245989 (a 1 MB page starting with
3000 opening tags of 15 different types), parsing time rises from
~0.2 seconds to ~1.1 seconds on my setup. We go from O(N) to O(kN),
where N is bytes of input and k is the number of types of tags present
on the page. Maximum k shouldn't exceed 30 or so in reasonable setups
(depends on installed extensions, it's 20 on English Wikipedia).

To consider:
* Should we keep previous behavior for unclosed <includeonly> /
  <noinclude>? This would be particularly disruptive for these if
  someone relied on the old behavior, and they're already
  special-cased in places.
* Unclosed <pre> tags are now treated as HTML tags, and are still
  displayed as preformatted text, but without suppressing wikitext
  formatting.

Change-Id: Ia2f24dbfb3567c4b0778761585e6c0303d11ddd0
2016-01-21 04:22:34 +00:00
umherirrender
54c1e18eec Remove various double empty newlines
The double empty newline is not needed between functions, variable or at
end of file

Change-Id: Ib866a95084c4601ac150a2b402cfa184ebc18afa
2015-12-27 18:55:12 +00:00
Brad Jorsch
2f95b102f5 Fix PPNode_Hash_Tree::getChildrenOfType return value
PPNode defines it as returning an array-type PPNode, not an array.

Change-Id: I9a6c5cea408aae449bfbf808d067837c4337c672
2015-12-17 22:18:14 +00:00