Autolinking free external links is clever about making sure that trailing
punctuation isn't included in the link. But if an HTML entity happens to
terminate the URL, the semicolon from the entity is stripped from the url,
breaking it.
Fix this corner case. This also unifies autolink parsing with Parsoid.
See: I5ae8435322c78dd1df170d7a3543fff3642759b1
Change-Id: I5482782c25e12283030b0fd2150ac55092f7979b
The behavior of the different preprocessors differs when given \r or
\r\n newlines. We already normalize the latter here, so may as well do
the former here too.
Bug: T78488
Change-Id: Id6390f64a73ea01088729f25d79103388c1fe7e8
Ensure that there is a \b boundary before and after RFC, PMID, and ISBN
links. (Previously we enforced \b boundaries only before free external
links and after ISBN links.) Consistency is a good thing!
In addition:
* \b is not a PHP escape sequence, so you don't need to write \\b inside
a string.
* \b before the numeric part of an ISBN is pointless: by the structure
of the regexp there will always be a space on the left and a word
character (a digit) on the right.
Bug: 65278
Change-Id: Ic315b988091a5c7530a8285b9249804db72e55db
- Added/removed spaces around parenthesis
- Added newline in empty blocks
- Added space after switch/foreach/function
- Use tabs at begin of line
- Add newline at end of file
Change-Id: I244cdb2c333489e1020931bf4ac5266a87439f0d
* Added a standard getFunctionStats() method for Profilers to return
per function data as maps. This is not toolbar specific like getRawData().
* Cleaned up the interface of SectionProfiler::getFunctionStats() a bit.
* Removed unused cpu_sq, real_sq fields from profiler UDP output.
* Moved getTime/getInitialTime to ProfilerStandard.
Co-Authored-By: Aaron Schulz <aschulz@wikimedia.org>
Change-Id: I266ed82031a434465f64896eb327f3872fdf1db1
includes/parser/Parser.php
* Pull out a chunk of code we need to reuse from parse() to
internalParseHalfParsed(). This is a fully backwards-compatible
change.
Code changes:
* Add a guard for running ParserBeforeTidy and ParserAfterTidy
hooks, as extensions might not expect them to be called for
snippets, only full page content.
* Change $options to $this->mOptions.
The bulk of parsing work is now done in internalParse() and
internalParseHalfParsed(), parse() only handles four things:
* Resetting parser state when a parse starts/finishes
* Page title language conversion
* Outputting limit report and limitation warnings
* Running ParserAfterParse hook (dunno why, but it's documented)
* Expand documentation for recursiveTagParse(), with some uppercase
warnings so that no one does the stupid thing I did ever again.
* Add new public method recursiveTagParseFully(), which is a
recursive parser entry point that produces fully parsed HTML ready
for inclusion in HTML output. Compared to Parser::parse(), it
doesn't produce limit reports and doesn't run the ParserAfterParse
hook.
includes/parser/CoreTagHooks.php
* Use the new recursiveTagParseFully() method.
* Use Parser::stripOuterParagraph() to remove silly tags.
Bug: 72887
Change-Id: I89ae9a50b82245f9a9e4a903563aeb1c51b6103e
Breaks extensions, doesn't entirely fix the problem it was meant to fix.
This reverts commit 6da3f169ac.
Change-Id: Ic193abcff8c72b0c8b434fcac514f88603a45beb
The JIT compiler in newer versions of PCRE experiences lock contention
when multithreaded applications perform a high rate of concurrent
compilations. We are seeing some performance impact on HHVM under normal
production traffic.
The random part of the strip marker is just there to protect against
deliberate insertion of strip markers into the source text, which is
very rare. So use a generic regex to find strip markers, and check in
the callback whether the random state ID is correct.
StripState::killMarkers() will be slower when it has to remove many
strip markers, but most calls to it will not match any strip markers, so
overall performance should be improved due to reduced JIT compilation.
Bug: 72205
Change-Id: I8d37ae929a8c669c9e39adc8096b89e5732b68d0
addTrackingCategory is more in line with ParserOutput's functionality
(addLink, addCategory etc), and tracking categories are useful even for
content types which do not use the parser at all. There is no reason to
require the caller to obtain a Parser object just to be able to add
tracking categories.
Change-Id: I89d9ea1db3a4e6486e77eee940bd438f7753b776
If you have a reference *to* an object field (anywhere in the call
stack) when you clone the object, the field will be cloned as a
reference rather than as a value.
So we have to break those unexpected references in the cloned object
manually, which is easy enough by making a non-reference copy and then
rebinding the cloned object's reference to this copy.
Bug: 56226
Change-Id: I9c600e9c0845b4fde0366126ce3809d74e2240b4
Add Parser::fetchCurrentRevisionOfTitle(). By default, this just calls
Revision::newFromTitle, but a callback can be set in ParserOptions that
will override it. Anything that runs as part of a parse should use this
wherever possible.
Bug: 70495
Change-Id: I521f1f68ad819cf0f37e63240806f10c1cceef9c
The previous implementation would unescape '&', '=', '+', and '%'. The
first three will break the URL when unescaped in the query string, and
the last will break when unescaped anywhere.
The code is now changed to treat the path, query, and fragment parts of
the URL separately when unescaping. We also escape any unsafe characters
and ensure all percent-encodings use uppercase hexits.
And since the old name is no longer accurate,
Parser::replaceUnusualEscapes is deprecated in favor of
Parser::normalizeLinkUrl.
Bug: 57909
Change-Id: I77dc308d0d016c395ad737c08cf10a7711e25bbd
In Parser.php an array was built and then the elements of that array
were used, replaced this by local vars.
In ParserOutput.php also use local vars to make the code more readable.
Also inlined a private callback by using an anonymous function.
Change-Id: I1c31c9e4855f93a8fb65e1c21faba46fcdcb1f4b
This regex looked something like /^(?i)bitcoin:|ftp://|ftps://|.../, which
meant the anchoring ^ only applied to the first name. This meant that any
link= value that happened to contain a URL protocol anywhere within it
(e.g. wikinews:Foo containing "news:") got incorrectly matched by this
regex.
Bug: 69317
Change-Id: Ide1c4f64137666db99f8e3b6816df01ef5099c8e
This solution is somewhat imperfect, as the logic being added here to
MediaWikiTitleCodec really belongs in the parser. However, given the
current state of this code, this is the cleanest possible solution at
the moment.
Modified the existing release note for this.
Bug: 68802
Change-Id: I38309186bdcad23f49e23beb26daaf3ef5bceea1
- Swap "$variable type" to "type $variable"
- Added missing types
- Fixed spacing inside docs
- Makes beginning of @param/@return/@var/@throws in capital
- Changed some types to match the more common spelling
Change-Id: I8ebfbcea0e2ae2670553822acedde49c1aa7e98d
This was noticed on enwiki after w: was marked as a local interwiki prefix
there. Links like [[w🇩🇪Foo]] ought to act like [[🇩🇪Foo]], not
[[de:Foo]].
Also adding a number of additional parser tests related to interwiki links.
Bug: 68085
Change-Id: If39af06edb4af2da85c9bcf43df7088181809fcf
It is needed for PageImages to collect information about galleries, improving results
for Commons mainspace.
Bug: 66510
Change-Id: I3136d648ef2c1841767db0ab33855cd168e3de3e
Add {{!}} as a magic word that expands to a pipe. Parsoid already does
this, so we know it isn't going to cause major breakage.
Change-Id: I1f857417d224d6443504074a5add852df3975b89
If you transclude a special page, OutputPage::addWikiText can cause
problems. This prevents that from happening, by using a new object
if currently in a parsing operation.
Bug: 14562
Bug: 65826
Change-Id: I7c38fa9e2fbd270e45f73f522612451e77ab8cbb
This brings the image syntax in gallery tags inline with normal
syntax. Handle <gallery>File:foo.png|link=bar#baz</gallery>
properly.
Bug: 62343
Change-Id: If6149ccc19f70605ad4481e4da2ca55676d6001d
$wgExtraInterlanguageLinkPrefixes holds a list of interwiki prefixes to be
treated as language codes if $wgInterwikiMagic is true.
To set the display text for the interlanguage links generated by this
code, you need to create MediaWiki:Interlanguage-link-foo, where "foo" is
the interwiki prefix. To provide a friendly site name for the link title
text, use MediaWiki:Interlanguage-link-sitename-foo. On the WMF cluster,
these messages could be set using the WikimediaMessages extension.
Information about extra language links (in the site language only) is
provided via the API in meta=siteinfo&prop=interwikimap.
Bug: 32189
Change-Id: I3d04760e2d9fb3320bb71e3d5ad115eed54a899c
There are so many slightly different understandings of what a
"section" is or can be. I'm aware the documentation was improved
just a few weeks ago. I still find it incomplete and confusing.
1. I renamed it to $sectionId to make it more clear what it
really is.
2. Sections are usually numbers. 0, 1 and so on. There is no
reason to disallow the use of ints or even floats (this works
because the string representation of 0.0 is "0"). The code never
disallowed numbers.
3. 'T1' never was supported, as far as I can tell. 'T-1' is
supported. See Parser::extractSections().
4. null and false and '' all mean "the whole page" in
WikiPage::replaceSectionAtRev() but for some reason this meaning got
lost in WikitextContent::replaceSection(). I made it the same again.
Change-Id: Icc3997722d2ed742bf7703cd7c06d09199225720
This improves on commit 34bd573144 by matching
Parsoid's newline handling in the PHP parser. It is the outcome of a
discussion with Erwin, where we agreed that
* foo
* bar
should produce
<ul><li>foo</li>
<li>bar</li></ul>
See the discussion in https://gerrit.wikimedia.org/r/#/c/94443/
The original rendering issue this tried to address is no longer present after
a change to the template. The pure CSS solution is now working.
Bug: 39617
Bug: 56809
Change-Id: Ib7aa9449bbd994cb23b83b3f23cff944b1cddadf
Most wikitext is safe to parse once and then cache for when that same
wikitext is used again, such as for multiple transclusions of the same
template within a page. There are occasions, though, where some piece of
wikitext has side effects and so should not be cached; a prominent
example of such wikitext is the <ref> and <references> tags in Cite.php.
This change adds PPFrame::setVolatile so parser hooks such as <ref> and
<references> can indicate that they have done something that should not
be cached, and PPFrame::isVolatile so that callers of PPFrame::expand
can know when to avoid caching.
Bug: 46815
Bug: 31834
Change-Id: I95b3cf8781cf047cdb63da221cef45f3e7d1632e
Remove the parser's global $mTplExpandCache, and replace it with an
alternative that is separated by parent frame. This allows the integrity
of the empty-frame expansion cache to be maintained while also allowing
parent frame access.
A page with 3 copies of
http://ja.wikipedia.org/wiki/%E4%B8%AD%E5%A4%AE%E7%B7%9A_(%E9%9F%93%E5%9B%BD)
has the following statistics: Without this change, there are 4625 cache hits
on this page, and a sample of 3 parses took 16.6, 16.9, and 16.8 seconds.
With this change, there are 2588 cache hits, and a sample of 3 parses took
16.7, 16.7, and 17.0 seconds.
Change-Id: I621e9075e0f136ac188a4d2f53418b7cc957408d
We've had the logic for stripping the outer <p/> element in three
separate places. The version in OutputPage was missing the '$' at the
end of the regex, that was most likely a mistake caused by the
duplication.
Also, extend the logic in order not to generate invalid HTML if the
input contains more than one <p/> tag. Added tests for this and the
previous behaviour.
https://www.mail-archive.com/mediawiki-api@lists.wikimedia.org/msg03188.html
Change-Id: I6bb3597898324556df912a23a7ffc9ff250b8f58