This aims at providing an interface similar to setOutputFlag for string
sets, such as the ones used in CSP properties.
Change-Id: I6f103bd88802e66611e483403a2f8a540d54aae9
This is in preparation for changes on the Parsoid side to make
sure its signature is compatible with the ContentMetadataCollector
interface there.
Change-Id: Ife4ae81dbc304097da7dcba40b143f7030b959f3
The Hooks class contains deprecated functions and the whole class is
going to get removed, so remove the convenience function and inline the
code.
Bug: T335536
Change-Id: I8ef3468a64a0199996f26ef293543fcacdf2797f
Eventually we should merge the "title text" and "display title" in
ParserOutput (T293514) but for now mirror the logic in
ParserOutput::mergeHtmlMetadataFrom() and update the title text
from the source if it hasn't already been set in the destination.
This patch ensures that after page properties are merged during
metadata collection, the title text is suitably updated if the
'displaytitle' property is set.
This will let Parsoid pass displaytitle (metadata) tests in integrated
mode since Parsoid relies on merging metadata from multiple ParserOutput
objects (in the DataAccess object that is used to expand templates, etc.)
Once this patch is merged, Parsoid patches may start failing CI till
we submit a patch there to fix up the integrated test failures list
since some previously failing tests may now pass.
Bug: T293514
Bug: T294621
Change-Id: Ia673f1261ccd03caf455122b71cfb9769b02f22e
* TOCData in Parsoid expects to process non-string-key indexed arrays.
* Don't use 'null' as the default for maxtoclevel to ensure that
TOC is always displayed even when it isn't passed in as a param
by callers.
* Follows up on 05535be6 which only partially fixed the breakage
caused by 153a4157 and 439656e0
Bug: T334551
Change-Id: I8883b58574ea8ed0566de2c44dba3408a47d2d0c
This is an initial quick-and-dirty implementation. The
ParsoidParser class will eventually inherit from \Parser,
but this is an initial placeholder to unblock other Parsoid
read views work.
Currently Parsoid does not fully implement all the ParserOutput
metadata set by the legacy parser, but we're working on it.
This patch also addresses T300325 by ensuring the the Page HTML
APIs use ParserOutput::getRawText(), which will return the entire
Parsoid HTML document without post-processing. This is what
the Parsoid team refers to as "edit mode" HTML. The
ParserOutput::getText() method returns only the <body> contents
of the HTML, and applies several transformations, including
inserting Table of Contents and style deduplication; this is
the "read views" flavor of the Parsoid HTML.
We need to be careful of the interaction of the `useParsoid` flag with
the ParserCacheMetadata. Effectively `useParsoid` should *always* be
marked as "used" or else the ParserCache will assume its value doesn't
matter and will serve legacy content for parsoid requests and
vice-versa. T330677 is a follow up to address this more thoroughly by
splitting the parser cache in ParserOutputAccess; the stop gap in this
patch is fragile and, because it doesn't fork the ParserCacheMetadata
cache, may corrupt the ParserCacheMetadata in the case when Parsoid
and the legacy parser consult different sets of options to render a
page.
Bug: T300191
Bug: T330677
Bug: T300325
Change-Id: Ica09a4284c00d7917f8b6249e946232b2fb38011
The TOC used to be language-converted in ParserOutput::getText(), but
it wasn't possible to apply custom rules defined in the wikitext
article body at ::getText() time. Remove the various hacks that we'd
added in an attempt to do so, which were made unnecessary by
I321cd31dae64bbf845d53282e5d28a55bc4ec319.
Bug: T306862
Change-Id: Ib12cd02e9ade91d5794462e8833f2aa3b45a51f2
The tag has been <mw:editsection> since at least 2011
(f0fd318a4e), we no longer need to
include the ancient <editsection> variant in our regexp and
test cases.
Change-Id: I5fd783556810ea13b07a69066ea6762d1a1863e1
Provide a way for backend code to determine the primary language of a
ParserOutput, eg for setting the Content-Language header of an API
response.
This is read-only and backed by extension data at the moment for
transition purposes; if this API sticks we'll graduate it to a
"real" property in the future, with appropriate serialization
to/from JSON (T303329).
Similarly, this patch only includes the most basic code to handle
the various ParserOutput merge cases in
ParserOutput::merge{Internal,Html,Tracking}MetaDataFrom(),
ParserOutput::collectMetadata(), and
OutputPage::addParserOutput{Content,Metadata,Text,}(); mostly
inherited from the fact that the storage is backed by extension
data at the moment.
Generally only the "top-level" parser output gets to set the
primary language; we'll presumably need to ensure that the
language is consistent during merge.
Change-Id: I767daba22805a877d9b806fd77334e508902844b
This undocumented method returns a reference to ParserOutput's private
storage array, yet very few callers actually require a reference or try
to use this to mutate the internal storage. Further, the keys of the
array can be converted to `int` when the category names are numeric,
which can further confuse users. Most users found through codesearch
can/should use ::getCategoryNames() instead.
Add a new ::getCategorySortKey() method to provide access to the sort
keys for those few callers who require them, in a manner which doesn't
expose that the internal `mCategories` array stores numeric category
names as 'int'.
Bug: T331727
Change-Id: I8dc85e76bfbb9ed49a603d990c14b7ee798bd821
Numeric category strings like '1' are converted to ints when they are
used as array keys. Convert back to strings as needed to ensure this
doesn't surprise any clients.
Bug: T331084
Change-Id: Ib39707216d213e414c09226a6378047ffaf43892
When running PHPUnit integration tests locally for
Extension:GrowthExperiments, $toc['extensionData'] isn't
defined, leading to failures for various tests.
Follows-Up: I67397c49f2d0764e5c755101264631bea6603e16
Change-Id: I3ef45a86c236863dbeafbd121f1a5951947c5dc6
In order to break a cyclic dependency, Parsoid doesn't know about
core's `ParserOutput` class; it defines its own
`ContentMetadataCollector` interface which expose those portions
of the ParserOutput metadata which the parser needs to supply.
Other bits of the ParserOutput metadata are specific to MediaWiki
internals and Parsoid doesn't have to explicitly know about them:
extensions and core implementations of parser functions (eg) can
take the ContentMetadataCollector supplied by Parsoid and downcast
it back to a ParserOutput in order to propagate internal information
(like ParserCache lifetimes) "behind Parsoid's back" - aka, without
violating abstraction boundaries by exposing every implementation
detail of MediaWiki to Parsoid.
When Parsoid calls into core to expand magic words like
`currenttimestamp` they update the cache TTL in the ParserOutput using
this mechanism. Using ParserOutput::collectMetadata() ensure these
values are propagated to the final ParserOuput, even though Parsoid
doesn't (shouldn't have to) explicitly know about them.
Bug: T329067
Change-Id: Ia92efff4293841330674df09e82897d0775ef4d6
Before 1.39 we used <mw:toc> and in 1.39 we switched to <mw:tocplace/>
(commit 24949480eb). This was changed
to a <meta> tag in 1.40 (commit
0b10563895 and
fa8646ca7b) and the old content has long
since expired from the ParserCache. Clean up the old ParserCache
transition code.
Change-Id: I3254d0acba31e107b50767797a2b0ad28aba59ee
The TOCData should be serialized with the JsonCodec which will also
allow preserving the TOC top-level extension data. But for now, use a
hack to ensure it is not lost when we use the "legacy" associative
array format to serialize/deserialize TOCData.
Change-Id: I67397c49f2d0764e5c755101264631bea6603e16
* Rather than computing TOC HTML in Parser and setting it in
ParserOutput, compute it on demand based on section metadata.
This will let Parsoid set section metadata in ParserOutput
and have the TOC generated automatically.
* This required fixing some "bugs" in Linker's generateTOC
which didn't properly close tags and relied on Tidy to fix
up unclosed li and ul tags.
* This patch relies on converting section metadata objects to
array objects, but Linker::generateTOC could be converted to
use TOC data instead.
* Since TOC generation is now moved to getText(), this is done
post-PC load and this eliminates the parser cache split on
user language for TOC heading localization.
Bug: T293513
Change-Id: Ief1bba326d3612b40930440c872a61abadffab10
* ParserOutput::setSections()/::getSections() are expected
to be deprecated. Uses in extensions and skins will need to be
migrated in follow up patches once the new interface has stabilized.
* In the skins code, the metadata is converted back to an array.
Downstream skin TOC consumers will need to be migrated as well
before we can remove the toLegacy() conversion.
* Fixed SerializationTestTrait's validation method
- Not sure if this is overkill but should handle all future
complex objects we might stuff into the ParserCache.
* This patch emits a backward-compatible Sections property in order to
avoid changing the parser cache serialization format. T327439 has
been filed to eventually use the JsonCodec support for object
serialization, but for this initial patch it makes sense to avoid
the need for a concurrent ParserCache format migration by using a
backward-compatible serialization.
* TOCData is nullable because the intent is that
ParserOutput::setTOCData() is MW_MERGE_STRATEGY_WRITE_ONCE; that is,
only the top-level fragment composing a page will set the TOCData.
This will be enforced in the future via wfDeprecated() (T327429),
but again our first patch is as backward-compatible as possible.
Bug: T296025
Depends-On: I1b267d23cf49d147c5379b914531303744481b68
Co-Authored-By: C. Scott Ananian <cananian@wikimedia.org>
Co-Authored-By: Subramanya Sastry <ssastry@wikimedia.org>
Change-Id: I8329864535f0b1dd5f9163868a08d6cb1ffcb78f
Add a type annotation when encoding `stdClass` objects so that we can
be sure to decode them as objects instead of arrays.
This avoids issues such as that seen in the Graph extension (T312589)
where an extension data key is stored as a stdClass. If ParserOutput
was computed fresh, a subsequent getExtensionData(..) call will return
a stdClass object, but if the ParserOutput was cached, getExtensionData()
would return an array. After this change the return type is always
consistent.
Properly handle nested objects: encode all object values returned by
JsonSerializable::jsonSerialize() (so that client is not responsible
for implementing this correctly), and decode all object values *before*
calling JsonUnserializable::newFromJsonArray (again, so that the
client is not responsible for decoding its property values). The new
behavior matches how serialize/unserialize is handled in the 'naive'
JsonUnserializable{Sub,Super}Class test cases; ParserOutput (the only
users of JsonCodec in core) was doing an extra manual decode for
the ExtensionData array in ParserOutput::initFromJson that is no longer
necessary.
The GrowthExperiments and SemanticMediaWiki extensions were working
around the non-recursive nature of JsonCodec; this patch depends on
patches to GrowthExperiments to make it agnostic about whether object
unserialization occurs before or after ::newFromJsonArray() is called,
which can then be further cleaned up once this is released.
A pull request for SemanticMediaWiki has also been submitted.
Bug: T312589
Depends-On: I3413609251f056893d3921df23698aeed40754ed
Change-Id: Id7d0695af40b9801b42a9b82f41e46118da288dc
To follow Message. This is approved as part of RFC T166010.
Also namespace it but doing it properly with PSR-4 would require
namespacing every class under language/ and that will take some time.
Bug: T321882
Change-Id: I195cf4c67bd51410556c2dd1e33cc9c1033d5d18
There are many, many more. I touch only a few where I'm sure it's
never anything but an array of strings.
Change-Id: I8b798f2e9d48f07a241b95ce0ace8fa9d981695d
This addresses the common case patched by
I530d71d0f9279b40a263cd62467d3ef8c76975c3,
If6267f3389b166043fc94d7f952bc54122b1a378 and probably
the code in Article.php from I44045b3b9e78e7ab793da3f37e3c0dbc91cd7d39
by ensuring that "injectTOC" in the options passed to
ParserOutput::getText() defaults to the correct value based on the skin
being used by OutputPage.
Bug: T317333
Change-Id: Ica30569efbb5730eff5b807e8fc34beb2e13e74f
Map values can include JsonUnserializable objects, and strict
(reference) equality comparison of these objects is not going to
reflect value equality. Serialize the values and compare strings
instead; this case should be hit very infrequently given that
rewriting the same extension data key is discouraged.
Bug: T312588
Change-Id: I942e7fa662b2f1a5e32fd55ef65eaa10a22afcfb
The PHP `isset(...)` construct covers a multitude of possible "wrong
types" for the left hand side of an array access, but it still crashes
(with "Cannot use object of type stdClass as array") if the left hand
side is an object.
Bug: T312242
Change-Id: I35026c573fb941004764d46d5652ebcddc559c03
When JSON support was introduced into ParserCache in 1.36, it was
controlled by a feature flag, $wgParserCacheUseJson. The feature flag
was "born deprecated" in 1.36. It can now be removed.
This means that ParserCache will always store entries as JSON.
Support for reading old non-JSON entries remains intact.
This is needed when updating wikis from a version older than 1.36
to the current version.
Change-Id: Id04e42bfb458d98414bac50e0d6c505e8878e5c0
Follow-up to I9d1f0f6bab1305552a0350667d6142a24bc04049. That patch was
not collecting data at all (not even overwriting them over and over
again) - the assignment operation was, in practice, a NOP. This patch
fixes this.
Bug: T303014
Bug: T303015
Change-Id: I7d09b532f3270edf4327c16e032d665353d992f6