This is an initial quick-and-dirty implementation. The
ParsoidParser class will eventually inherit from \Parser,
but this is an initial placeholder to unblock other Parsoid
read views work.
Currently Parsoid does not fully implement all the ParserOutput
metadata set by the legacy parser, but we're working on it.
This patch also addresses T300325 by ensuring the the Page HTML
APIs use ParserOutput::getRawText(), which will return the entire
Parsoid HTML document without post-processing. This is what
the Parsoid team refers to as "edit mode" HTML. The
ParserOutput::getText() method returns only the <body> contents
of the HTML, and applies several transformations, including
inserting Table of Contents and style deduplication; this is
the "read views" flavor of the Parsoid HTML.
We need to be careful of the interaction of the `useParsoid` flag with
the ParserCacheMetadata. Effectively `useParsoid` should *always* be
marked as "used" or else the ParserCache will assume its value doesn't
matter and will serve legacy content for parsoid requests and
vice-versa. T330677 is a follow up to address this more thoroughly by
splitting the parser cache in ParserOutputAccess; the stop gap in this
patch is fragile and, because it doesn't fork the ParserCacheMetadata
cache, may corrupt the ParserCacheMetadata in the case when Parsoid
and the legacy parser consult different sets of options to render a
page.
Bug: T300191
Bug: T330677
Bug: T300325
Change-Id: Ica09a4284c00d7917f8b6249e946232b2fb38011
It is very easy for developers and maintainers to mix up "internal
MediaWiki language codes" and "BCP-47 language codes"; the latter are
standards-compliant and used in web protocols like HTTP, HTML, and
SVG; but much of WMF production is very dependent on historical codes
used by MediaWiki which in some cases predate the IANA standardized
name for the language in question.
Phan and other static checking tools aren't much help distinguishing
BCP-47 from internal codes when both are represented with the PHP
string type, so the wikimedia/bcp-47-code package introduced a very
lightweight wrapper type in order to uniquely identify BCP-47 codes.
Language implements Bcp47Code, and LanguageFactory::getLanguage() is
an easy way to convert (or downcast) between Bcp47Code and Language
objects.
This patch updates the Parsoid integration code and the associated
REST handlers to use Bcp47Code in APIs so that the standalone Parsoid
library does not need to know anything about MediaWiki-internal codes.
The principle has been, first, to try to convert a string to a
Bcp47Code as soon as possible and as close to the original input as
possible, so it is easy to see *why* a given string is a BCP-47 code
(usually, because it is coming from HTTP/HTML/etc) and we're not stuck
deep inside some method trying to figure out where a string we're
given is coming from and therefore what sort of string code it might
be. Second, we've added explicit compatibility code to accept
MediaWiki internal codes and convert them to Bcp47Code for backward
compatibility with existing clients, using the @internal
LanguageCode::normalizeNonstandardCodeAndWarn() method. The intention
is to gradually remove these backward compatibility thunks and replace
them with HTTP 400 errors or wfDeprecated messages in order to
identify and repair callers who are incorrectly using
non-standard-compliant language codes in web standards
(HTTP/HTML/SVG/etc).
Finally, maintaining a code as a Bcp47Code and not immediately
converting to Language helps us delay or even avoid full loading of a
Language object in some cases, which is another reason to occasionally
push Bcp47Code (instead of Language) down the call stack.
Bug: T327379
Depends-On: I830867d58f8962d6a57be16ce3735e8384f9ac1c
Change-Id: I982e0df706a633b05dcc02b5220b737c19adc401
Mixing Handlers with Helpers doesn't look nice for consistency
reasons. Helpers should be in their own place (grouped) in the
Handlers directory as they're really "helpers for the handlers".
Change-Id: Ieeb7a0a706a4cb38778f312bfbfe781a1f366d14
This introduces an interface HtmlOutputHelper that is implemented
by both HtmlMessageOutputHelper or HtmlOutputRendererHelper based
on the page we're dealing with.
Bug: T323558
Change-Id: I1fb8dcc5cc05ce3f32f3c1862b88045f1c8e612b
This allows extensions like VisualEditor to safely instantiate REST
helper objects. It also reduces the number of services that need to be
injected into REST handlers from route definitions.
Change-Id: I10af85b2da96568cfffd03867d1cb299645fb371
* We will have several kinds of HTML transformations.
Rename HTMLTransform to indicate that its for converting HTML to Content
objects.
* Using Naming Convention 'Html' instead of 'HTML'
Change-Id: I506f3303ae8f9e4db17299211366bef1558f142c
Variant conversion is based on the Accept-Language header. Updated
the HtmlOutputRendererHelper to set the HTTP headers related to
variant conversion.
Bug: T317019
Change-Id: I5e11452f1c531a757e8d860f9c727b5810406bce
This adds setters to HtmlOutputRendererHelper which allow it to be used
more conveniently in different contexts. This is aimed specifically at
making it easier for DirectParsoidClient in the VisualEditor extension
to re-use this code.
NOTE: HtmlOutputRendererHelper is declared @unstable, but the changes in
this patch need to be backwards compatible at least temporarily, to
allow the VisualEditor extension to be updated in a follow-up.
Change-Id: I18c8bc6f5aa7c204f0faa56919bfe64026761bd4
NOTE: stats key has been updated to reflect this change so we'll
no longer get data on the "parsoidhtmlhelper..." key after this
is deployed.
Change-Id: I599b1fd22c2d962b57e80beb84fe6f3a335f488c
This patch introduces a ParsoidOutputAccess service for
getting parsoid outputs and warms the cache with pregenerated
outputs.
It also introduces a config variable in ParsoidCacheConfig that
is turned off by default for controlling the cache warming.
Bug: T301371
Change-Id: I6152c42ea765d94093d8d62598b1b4278314adec
Cache the parsoid outputs only if a certain time is exceeded on
parse and consider the parse operation within this time limit as
not expensive per that wiki and not cache the parsoid output at all.
Bug: T308588
Change-Id: I7793b77feab13400ccd04343e7878ad701f5e6a7
As a means of understanding the usage of the stash FEAT for
/page/html & /revision/html endpoints used by VE extension,
this patch introduces the collection of stats using the
StatsDataFactory.
Bug: T309017
Change-Id: I4e17d50e79da263637bdd55ab62e993df441fe38
This patch enables the response from PageHTMLHandler and
RevisionHTMLHandler to have different eTags for different
output modes and varying flavors.
Before, the only difference we got was when the stashing
option is set or not, but we need more flavors.
Bug: T308744
Change-Id: I2e9679e46a31955a2106a52af4eb612b32799c8c
Add stash option to /page/html & /revision/html endpoints.
When this option is set, the PageBundle returned by Parsoid is
stashed and an etag is returned that can later be used to
make use of the stashed PageBundle.
The stash is for now backed by the BagOStuff returned by
ObjectCache::getLocalClusterInstance().
This patch adds additional data to the ParserOutput stored in ParserCache.
Old entries lacking that data will be ignored.
Bug: T267990
Co-Authored-by: Nikki <nnikkhoui@wikimedia.org>
Change-Id: Id35f1423a69e3ff63e4f9883b3f7e3f9521d81d5