Commit graph

37 commits

Author SHA1 Message Date
Subramanya Sastry
c8d0470f4b Make ParsoidOutputAccess a wrapper over ParserOutputAccess
* Updated ParserOutput to set Parsoid render ids that REST API
  functionality expects in ParserOutput objects.
* CacheThresholdTime functionality no longer exists since it was
  implemented in ParsoidOutputAccess and ParserOutputAccess doesn't
  support it. This is tracked in T346765.
* Enforce the constraint that uncacheable parses are only for fake or
  mutable revisions. Updated tests that violated this constraint to
  use 'getParseOutput' instead of calling the parse method directly.
* Had to make some changes in ParsoidParser around use of preferredVariant
  passed to Parsoid. I also left some TODO comments for future fixes.
  T267067 is also relevant here.

PARSOID-SPECIFIC OPTIONS:
* logLinterData: linter data is always logged by default -- removed
  support to disable it. Linter extension handles stale lints properly
  and it is better to let it handle it rather than add special cases
  to the API.
* offsetType: Moved this support to ParsoidHandler as a post-processing
  of byte-offset output. This eliminates the need to support this
  Parsoid-specific options in the ContentHandler hierarchies.
* body_only / wrapSections: Handled this in HtmlOutputRendererHelper
  as a post-processing of regular output by removing sections and
  returning the body content only. This does result in some useless
  section-wrapping work with Parsoid, but the simplification is probably
  worth it. If in the future, we support Parsoid-specific options in
  the ContentHandler hierarchy, we could re-introduce this. But, in any
  case, this "fragment" flavor options is likely to get moved out of
  core into the VisualEditor extension code.

DEPLOYMENT:
* This patch changes the cache key by setting the useParsoid option
  in ParserOptions. The parent patch handles this to ensure we don't
  encounter a cold cache on deploy.

TESTS:
* Updated tests and mocks to reflect new reality.
* Do we need any new tests?

Bug: T332931
Change-Id: Ic9b7cc0fcf365e772b7d080d76a065e3fd585f80
2023-10-13 15:03:03 -05:00
thiemowmde
46bed8ac6d Make use of assertStatusGood/Error and such in tests
Change-Id: I11eace3d9823ca28a1d9a64f959f5f8ca2945821
2023-10-04 17:16:00 +00:00
Subramanya Sastry
062fd08e51 Remove all Parsoid debugApi references and uses
* Was used during the Parsoid JS -> PHP port and is no longer used.
* This also eliminated the need to inject ParsoidSettings into some
  classes.
* Once this merges and lands in core, I'll remove this from the Parsoid
  repo as well.

Change-Id: I008d30ea81f5a3db26e512c87762b90e3ca3c4ff
2023-09-14 14:48:48 -05:00
Amir Sarabadani
f4e68e055f Reorg: Move Status to MediaWiki\Status\
This class is used heavily basically everywhere, moving it to Utils
wouldn't make much sense. Also with this change, we can move
StatusValue to MediaWiki\Status as well.

Bug: T321882
Depends-On: I5f89ecf27ce1471a74f31c6018806461781213c3
Change-Id: I04c1dcf5129df437589149f0f3e284974d7c98fa
2023-08-25 15:44:17 +02:00
Daimona Eaytoy
77d4c2c454 phpunit: Randomize and improve default test page names
UTPage is badly named, because it doesn't give any information as to
what test caused the page to be created. It also sort of encourages test
authors to rely on this "UTPage" page being created by the framework for
them.

Both these things are dangerous, or at least very questionable. Use a
random page title instead, but include the caller name in case someone
needs to investigate where a test page is coming from.

Do the same for summary and content, too.

In getExistingTestPage, add a check to make sure that the page was
created successfully. Do not use assert* to avoid adding assertions
extraneous to the test.

addCoreDBData is not changed because that method will be removed
entirely (T342428).

Fix tests that are now failing:
- ParsoidOutputAccessTest was relying on the content of
  getExistingTestPage to be UTContent.
- HTMLHandlerTestTrait did not account for spaces in the page name (also
  change the signature to reflect the fact that WikiPage is always
  passed in).
- HtmlInputTransformHelperTest was relying on the fake test page to be
  there.
- PoolWorkArticleViewTest is leaving pages behind, and for some reason
  that's making SpecialRecentchangesTest fail.

Bug: T341344
Change-Id: I9c2dc1cf1f184c8062864756d2747ee56e886086
2023-08-15 20:39:25 +00:00
Arlo Breault
0f8aac2de8 Catch RevisionAccessException in ParsoidOutputAccess
A shared get content assertion is added to PageConfigFactory::create

Bug: T338925
Bug: T336501
Follows-Up: I647ed253691970bbf39321a3cd652ea14bc11278
Change-Id: Iaf3898e5c53f1673ade639f7990911e4595801a8
2023-06-27 14:04:19 -04:00
jenkins-bot
6c32cc6698 Merge "Allow setting a ParserOption to generate Parsoid HTML" 2023-03-27 08:08:56 +00:00
C. Scott Ananian
cfd9c516e1 Allow setting a ParserOption to generate Parsoid HTML
This is an initial quick-and-dirty implementation.  The
ParsoidParser class will eventually inherit from \Parser,
but this is an initial placeholder to unblock other Parsoid
read views work.

Currently Parsoid does not fully implement all the ParserOutput
metadata set by the legacy parser, but we're working on it.

This patch also addresses T300325 by ensuring the the Page HTML
APIs use ParserOutput::getRawText(), which will return the entire
Parsoid HTML document without post-processing.  This is what
the Parsoid team refers to as "edit mode" HTML. The
ParserOutput::getText() method returns only the <body> contents
of the HTML, and applies several transformations, including
inserting Table of Contents and style deduplication; this is
the "read views" flavor of the Parsoid HTML.

We need to be careful of the interaction of the `useParsoid` flag with
the ParserCacheMetadata.  Effectively `useParsoid` should *always* be
marked as "used" or else the ParserCache will assume its value doesn't
matter and will serve legacy content for parsoid requests and
vice-versa.  T330677 is a follow up to address this more thoroughly by
splitting the parser cache in ParserOutputAccess; the stop gap in this
patch is fragile and, because it doesn't fork the ParserCacheMetadata
cache, may corrupt the ParserCacheMetadata in the case when Parsoid
and the legacy parser consult different sets of options to render a
page.

Bug: T300191
Bug: T330677
Bug: T300325
Change-Id: Ica09a4284c00d7917f8b6249e946232b2fb38011
2023-03-26 21:46:05 -04:00
Tim Starling
be3018b268 Just another 80 or so PHPStorm inspection fixes (#4)
* Unnecessary regex modifier. I agree with this inspection which flags
  /s modifiers on regexes that don't use a dot.
* Property declared dynamically.
* Unused local variable. But it's acceptable for an unused local
  variable to take the return value of a method under test, when it is
  being tested for its side-effects. And it's acceptable for an unused
  local variable to document unused list expansion elements, or the
  nature of array keys in a foreach.

Change-Id: I067b5b45dd1138c00e7269b66d3d1385f202fe7f
2023-03-25 00:39:06 +00:00
Tim Starling
5e30a927bc tests: Make some PHPUnit data providers static
Just methods where adding "static" to the declaration was enough, I
didn't do anything with providers that used $this.

Initially by search and replace. There were many mistakes which I
found mostly by running the PHPStorm inspection which searches for
$this usage in a static method. Later I used the PHPStorm "make static"
action which avoids the more obvious mistakes.

Bug: T332865
Change-Id: I47ed6692945607dfa5c139d42edbd934fa4f3a36
2023-03-24 02:53:57 +00:00
jenkins-bot
5434c71393 Merge "Use Bcp47Code when interfacing with Parsoid" 2023-03-13 19:11:03 +00:00
C. Scott Ananian
5ad8dea80a Use Bcp47Code when interfacing with Parsoid
It is very easy for developers and maintainers to mix up "internal
MediaWiki language codes" and "BCP-47 language codes"; the latter are
standards-compliant and used in web protocols like HTTP, HTML, and
SVG; but much of WMF production is very dependent on historical codes
used by MediaWiki which in some cases predate the IANA standardized
name for the language in question.

Phan and other static checking tools aren't much help distinguishing
BCP-47 from internal codes when both are represented with the PHP
string type, so the wikimedia/bcp-47-code package introduced a very
lightweight wrapper type in order to uniquely identify BCP-47 codes.
Language implements Bcp47Code, and LanguageFactory::getLanguage() is
an easy way to convert (or downcast) between Bcp47Code and Language
objects.

This patch updates the Parsoid integration code and the associated
REST handlers to use Bcp47Code in APIs so that the standalone Parsoid
library does not need to know anything about MediaWiki-internal codes.
The principle has been, first, to try to convert a string to a
Bcp47Code as soon as possible and as close to the original input as
possible, so it is easy to see *why* a given string is a BCP-47 code
(usually, because it is coming from HTTP/HTML/etc) and we're not stuck
deep inside some method trying to figure out where a string we're
given is coming from and therefore what sort of string code it might
be.  Second, we've added explicit compatibility code to accept
MediaWiki internal codes and convert them to Bcp47Code for backward
compatibility with existing clients, using the @internal
LanguageCode::normalizeNonstandardCodeAndWarn() method.  The intention
is to gradually remove these backward compatibility thunks and replace
them with HTTP 400 errors or wfDeprecated messages in order to
identify and repair callers who are incorrectly using
non-standard-compliant language codes in web standards
(HTTP/HTML/SVG/etc).

Finally, maintaining a code as a Bcp47Code and not immediately
converting to Language helps us delay or even avoid full loading of a
Language object in some cases, which is another reason to occasionally
push Bcp47Code (instead of Language) down the call stack.

Bug: T327379
Depends-On: I830867d58f8962d6a57be16ce3735e8384f9ac1c
Change-Id: I982e0df706a633b05dcc02b5220b737c19adc401
2023-03-13 13:25:09 -04:00
C. Scott Ananian
bce63d1912 Preserve non-PageBundle metadata set by Parsoid
The Parsoid entrypoints should always have a "real" ParserOutput
passed as the ContentMetadataCollector object, so that recursive
invocations of extensions, etc, can set appropriate metadata
properties in the ParserOutput.

This is part of a belt-and-suspenders fix for T331084, where a
StubMetadataCollector is being used in production -- production should
never use a stub, it should always use a real ParserOutput object.
The other fix for T331084 is
I30ea2bb24e6c9b0950a8f46dc8e5b9bf5ee3378b, which ensures that if you
*were* to use a StubMetadataCollector in production, it wouldn't throw
an error when a numeric category string was encountered.

Bug: T331084
Change-Id: I8711a51fc1bcac48eae92ab1ba15a33fe05937ed
2023-03-13 11:24:57 -04:00
Derick Alangi
1afd52e3e4 REST: Move Helper classes to their own namespace
Mixing Handlers with Helpers doesn't look nice for consistency
reasons. Helpers should be in their own place (grouped) in the
Handlers directory as they're really "helpers for the handlers".

Change-Id: Ieeb7a0a706a4cb38778f312bfbfe781a1f366d14
2023-01-16 21:16:09 +01:00
Derick Alangi
d51522dfd3 ParsoidOutputAccess: Mark dummy parser output as non-cacheable
Bug: T311728
Change-Id: I55f153d21f91ef93fe5c788fd054fa481fc2ab10
2023-01-10 12:43:30 +01:00
Derick Alangi
cdd49f7536 ParsoidOutputAccess: Completely handle unsupported content models
Pages with content models not supported by Parsoid before already
had dummy parser output written to PC but we still got an internal
server error because the output didn't have a render key.

This patch fixes the issue and when we try to render page with
unsupported content models, we get the dummy parser outputs.

Bug: T311728
Change-Id: I49ebbfc0475fb296f2a906ce7dce237641fb375b
2023-01-10 08:38:51 +00:00
daniel
e1c3af9177 ParsoidOutputAccess should support all models that serialize to wikitext.
The motivation is to restore parsoid support for the content models
defined in the Proofread extension.

Bug: T246403
Change-Id: I33d269e42fede28139f7c923504326a77d11ee13
2022-12-16 12:20:10 +01:00
daniel
5559d2d471 Parsoid: Enable lint data and parser cache together
The previous attempt at restoring call to the ParserLogLinterData hook
had the undesirable effect of bypassing parser cache. This change
optionally enables the call to the lind data hook without disabling
parser cache.

This patch us working under the assumption that we only need to log lint
data for canonical parses.

Follow-Up to I1f69498ef75

Change-ID: I39ab54ca6e9f9a6a58b59cdd6feea07638fc908f
2022-12-12 22:57:07 +01:00
daniel
75d6892134 ParsoidOutputAccess: generate dummy output for unsupported models.
While RESTbase insists on getting Parsoid renderings for any content
model, don't waste CPU cycles trying to render garbage. Just output
dummy content. Nobody should ever see it.

Bug: T324711
Change-Id: I407171a5f515b594603b287a7a9e49f54e837161
2022-12-12 19:57:31 +00:00
daniel
9842811e75 ParsoidOutputAccess: only cache output for wikitext
ChangeProp is currently requesting a parsoid parse for all page updates,
regardless of content model. Parsoid renderings of non-wikitext content are
unusable, so we shouldn't bother the parser cache with them. This is
especially true for wikidata items.

Bug: T324711
Change-Id: I6f6325f2b8581dfcc9a8bcd97281ccf4caa7e8f1
2022-12-08 18:29:24 +00:00
jenkins-bot
1b3d4d17d4 Merge "Throw a 400 when asking parsoid to render an unknown content model." 2022-12-07 22:44:45 +00:00
daniel
a79b73e722 Make parsoid accept all content models.
This allows parsoid render anything, even if the output is garbage.
This is a quick fix pending the real solution (T311728).

Bug: T324711
Change-Id: If4e4eb8582ab8a62f592394820b30c1b28fb1216
2022-12-07 22:23:54 +00:00
daniel
d34e73b0c1 Throw a 400 when asking parsoid to render an unknown content model.
Bug: T324711
Change-Id: I0a78e5c57e2b8449b393bccc86148aee4ad87bc8
2022-12-07 23:09:58 +01:00
daniel
c6a0d433ec HtmlOutputRendererHelper: allow parser cache to be disabled.
This is needed so we can ramp up parser cache writes in a controlled
manner.

Bug: T322672
Change-Id: I7d97c9e2d4009029dc64f9c0a369f68098185520
2022-11-28 09:43:12 +00:00
daniel
860f8ebee8 Make HtmlOutputRendererHelper more flexible
This adds setters to HtmlOutputRendererHelper which allow it to be used
more conveniently in different contexts. This is aimed specifically at
making it easier for DirectParsoidClient in the VisualEditor extension
to re-use this code.

NOTE: HtmlOutputRendererHelper is declared @unstable, but the changes in
this patch need to be backwards compatible at least temporarily, to
allow the VisualEditor extension to be updated in a follow-up.

Change-Id: I18c8bc6f5aa7c204f0faa56919bfe64026761bd4
2022-10-17 10:53:05 +00:00
daniel
994e50d24f Fix passing the wikiId into ParsoidOutputAccess.
It's not clear if Parsoid still need this, but let's err on the side of
caution.

Change-Id: I7cef2827da23af3c3466cb855de5f42e05375515
2022-10-07 17:50:38 +02:00
Derick Alangi
ab7849ed47 ParsoidOutputAccess: Add support for fragment flavor
This is needed by VE when performing Wikitext -> HTML transformation
during editing.

Also, this patch introduces the new flavor: fragment, that is passed in
via $envOptions to activate VisualEditor's body only mode functionality.

NOTE: This patch also fixes a PHPUnit test that broke by correctly
injecting the appropriate parsoid instance for checking error handling.

Bug: T308743
Change-Id: I838a3b05d7d8523a469236cf112158349063283c
2022-10-06 20:41:48 +01:00
daniel
a02be0b3f8 HtmlInputTransformHelper: Fall back to ParserCache
If a render ID is given via the use-cache parameter, but the key is not
found in the parsoid stash, look at the most recent known rendering of
the revision, and use it if it matches the render ID.

This patch moves the responsibility for looking up RevisionRecords and
PageRecords into ParsoidOutputAccess. This way, callers only need to
have a PageIdentity, and optionally a revision ID.

Bug: T318395
Change-Id: I1aa5b0fd9fb1acaa2544d5a58125fa3810a0eb39
2022-09-30 15:56:23 +00:00
daniel
f31cd9f1d3 REST: HtmlInputTransformHelper: Load original data from stash
Parsoid needs the original rendering in order to apply
selective serialization (selser). The page/{title}/html endpoint
can stash the rendering, and now the transform endpoint can make use
of the stashed rendering.

Bug: T310464
Change-Id: Ia58043ed3aa1eb12731d82aa87606c82ec63f663
2022-09-29 19:52:27 +02:00
jenkins-bot
e0e430c049 Merge "Add PageBundleParserOutputConverter" 2022-09-26 13:16:23 +00:00
msantos
d3a86cfc6f Fix parse() and getParserOutput() interfaces
In Ie87f823e721ed5ae9d49cf7ead8e77cbef254cd7, we changed the signature
of `parse()` to accept a PageIdentity instead of PageRecord and it broke
some tests in other places, specifically: HtmlOutputRendererHelperTest,
so this patch fixes the interfaces.

Change-Id: I35685412c52f7d4ae9e63960695e686fb2bb9b21
2022-09-26 11:40:19 +01:00
Abijeet
7400456b1a Add PageBundleParserOutputConverter
Move code to create ParserOutput from PageBundle and vice versa to a
separate final class. An final class was used instead of a trait
because traits do not support constants for PHP version < 8.2.

The plan is to use this final class in various interfaces in order
to avoid exposing them to Parsoid concepts.

Bug: T317019
Change-Id: I33076c359ee45719c1c4ef63f77c1f1285951d0c
2022-09-26 15:11:47 +05:30
msantos
f29803e2d9 Support access to outputs of non-existent pages
* Introduce a method in ParsoidOutputAccess that parses and returns
  a parse output directly without caring about cache.

* Parse a non-existent page with the new method when the page object
  is not a PageRecord, but a PageIdentity

Change-Id: Ie87f823e721ed5ae9d49cf7ead8e77cbef254cd7
2022-08-31 20:52:41 +01:00
daniel
2ba27ab06e Protect against passing unsupported content models to Parsoid.
Parsoid currently only supports wikitext (and JSON), so don't give it anything else.

NOTE: ParsoidOutputAccess will fail on content that is unsupported by parsoid.
This will however not affect the /transform and /page endpoints in the
parsoid extension, since they use the ParsoidHandler base class, which doesn't
rely on ParsoidOutputAccess.

Bug: T301371
Change-Id: I6bc9b978947b31455a4bce6385b7bdf64ed4043c
2022-06-30 14:54:42 +00:00
daniel
8ce08c0cbc Move knowledge about HTTP status out of ParsoidOutputAccess
This removes a cyclic dependency:
ParsoidHTML helper in the REST component uses ParsoidOutputAccess in the
parser component. So ParsoidOutputAccess cannot use LocalizedHttpException
from the REST component.

This also improves separation of concerns: the parsing component should
not be concerned with HTTP status codes.

Bug: T301371
Change-Id: I2e661fe3ce0824dbfd7579650972f9019c92ed59
2022-06-28 12:30:44 +02:00
daniel
1271faa381 Move access to the page bundle into ParsoidOutputAccess
This isolates ParsoidHTMLHelper from the internal of
ParsoidOutputAccess. The corresponding test cases were changed to use a
mock ParsoidOutputAccess, and to not test the behavior of
ParsoidOutputAccess.

Bug: T301371
Change-Id: Id693fae2264f15e5d35f28acc5adc4239b2ae24f
2022-06-28 11:49:36 +02:00
Derick Alangi
1854fb02d9 Storage: Warm parsoid parser cache with parsoid outputs
This patch introduces a ParsoidOutputAccess service for
getting parsoid outputs and warms the cache with pregenerated
outputs.

It also introduces a config variable in ParsoidCacheConfig that
is turned off by default for controlling the cache warming.

Bug: T301371
Change-Id: I6152c42ea765d94093d8d62598b1b4278314adec
2022-06-28 09:05:41 +00:00