Commit graph

27 commits

Author SHA1 Message Date
C. Scott Ananian
5735f94648 Add Parsoid HTML version to wrapper div
Followup-To: I941d31479eebb12ea1f4dcdb0a1737033ddc8ac1
Depends-On: I95be56e3662f9cffd1eb5c03bbc0379d4e0a9ee0
Change-Id: I4aaa4b9e800271c2bcfc2fd74f09853b31ee6859
2024-05-06 15:56:02 -04:00
C. Scott Ananian
242c6d2cf9 Introduce ParserOutput:setFromParserOptions() and use for preview flag
Bug: T341010
Co-Authored-by: cananian <cananian@wikimedia.org>
Co-Authored-by: ihurbain <ihurbainpalatin@wikimedia.org>
Change-Id: I03125fdaa7dd71ba57d593e85ecb98be6806f3f6
2024-02-07 21:22:06 -05:00
C. Scott Ananian
1858e1cdd7 Rename ParserOutput::{get,set}Timestamp() to ::{get,set}RevisionTimestamp()
This avoids confusion with the "render timestamp" held by the cache,
and is consistent with ::get*RevisionId() etc.

The old ::getTimestamp() and ::setTimestamp() methods have been
deprecated.

Change-Id: Idb5e687709c98086c5d3075d31885c58a0723197
2024-02-07 21:22:06 -05:00
C. Scott Ananian
0de13d7662 Add ParserOutput::{get,set}RenderId() and set render id in ContentRenderer
Set the render ID for each parse stored into cache so that we are able
to identify a specific parse when there are dependencies (for example
in an edit based on that parse).  This is recorded as a property added
to the ParserOutput, not the parent CacheTime interface.  Even though
the render ID is /related/ to the CacheTime interface, CacheTime is
also used directly as a parser cache key, and the UUID should not be
part of the lookup key.

In general we are trying to move the location where these cache
properties are set as early as possible, so we check at each location
to ensure we don't overwrite a previously-set value.  Eventually we
can convert most of these checks into assertions that the cache
properties have already been set (T350538).  The primary location for
setting cache properties is the ContentRenderer.

Moved setting the revision timestamp into ContentRenderer as well, as
it was set along the same code paths.  An extra parameter was added to
ContentRenderer::getParserOutput() to support this.

Added merge code to ParserOutput::mergeInternalMetaDataFrom() which
should ensure that cache time, revision, timestamp, and render id are
all set properly when multiple slots are combined together in MCR.

In order to ensure the render ID is set on all codepaths we needed to
plumb the GlobalIdGenerator service into ContentRenderer, ParserCache,
ParserCacheFactory, and RevisionOutputCache.  Eventually (T350538) it
should only be necessary in the ContentRenderer.

Bug: T350538
Bug: T349868
Followup-To: Ic9b7cc0fcf365e772b7d080d76a065e3fd585f80
Change-Id: I72c5e6f86b7f081ab5ce7a56f5365d2f75067a78
2024-02-07 21:22:06 -05:00
jenkins-bot
368d3f22a4 Merge "Don't use Parsoid\Config\PageConfig::getTitle()" 2024-01-27 00:54:03 +00:00
C. Scott Ananian
5d1c43cdb4 Don't use Parsoid\Config\PageConfig::getTitle()
This has been replaced by ::getLinkTarget(), which returns a Parsoid
LinkTarget.  This is identical to the core LinkTarget interface, but
we can't quite alias them for technical reasons (sigh).  In actual
practice, LinkTargets generated by core are usually Title objects, so
Title::newFromLinkTarget() is a no-op that just returns the argument
after a type check.

It appears that newer code uses a TitleFormatter rather than calling
methods on Title, but TitleFormatter currently takes LinkTarget not a
ParsoidLinkTarget.  That would force us to go via
TitleValue::newFromLinkTarget() which isn't a simple type check.

Change-Id: I490bb38108d0202b43ea2a9b391b2e664e7d2d48
2024-01-26 19:29:14 -05:00
jenkins-bot
58ca755ce5 Merge "[ParsoidParser] Move parsoid skinning module from Article" 2024-01-02 17:32:46 +00:00
C. Scott Ananian
e6a9f1ab26 [ParsoidParser] Move parsoid skinning module from Article
This relocates the code added in 95d3c025b0.

Also: this is just a small bit of extra CSS, so it can be a ModuleStyle
not a full Module.

Bug: T335157
Depends-On: I9320e3083d2e71db42fb1348dcd3bea01d22cc5c
Change-Id: Iadedba5b41190ea4665f28db61f9565d914774b3
2024-01-02 17:10:59 +00:00
James D. Forrester
9bfb75ff90 Namespace ParserOutput
Most used non-namespaced class!

Bug: T353458
Change-Id: I4c2cbb0a808b3881a4d6ca489eee5d8c8ebf26cf
2023-12-14 14:57:34 -05:00
daniel
e3fb964439 Only cache expensive renderings
Pages that are fast to render can be omitted from the parser cache
to preserve disk space and cache write operations.

The threshold is configurable per namespace, so the tradeoff can
be evaluated based on different access patterns. For example, pages
that are accessed rarely, like file description pages on commons,
may have a high threshold configured, while pages that are read
frequently, like wikipedia articles, may be configured to be always
cached, using a 0 threshold.

Filtering is based on a time profile recorded in the ParserOutput.
A generic mechanism for capturing the timing profile is implemented
in the ContentHandler base class. Subclasses may implement a more
rigorous capture mechanism.

Bug: T346765
Change-Id: I38a6f3ef064f98f3ad6a7c60856b0248a94fe9ac
2023-11-30 20:56:12 +00:00
C. Scott Ananian
0e1b889a0f [parsoid] Fix Parsoid relative links
Bug: T350952
Change-Id: I60165a9946a35cfb42a78ed2f833c34570fefffc
2023-11-16 16:28:55 -05:00
C. Scott Ananian
5bdc436c2c Fix Parsoid <base href>
Bug: T350952
Change-Id: I9323e7048c3f6887157f5e1ce7b7e6e80d43abde
2023-11-10 17:06:33 +00:00
Subramanya Sastry
6e5413b1d8 ParsoidParser: Record page title in ParserCache entries
* This lets post-cache transforms have access to the title.
* Specifically, DiscussionTools uses this to post-process the HTML.

Bug: T341010
Change-Id: I328f533e6cdb11c0c3a873d23bab1a113dfa39be
2023-10-30 13:36:36 -05:00
Subramanya Sastry
17b0ebd3ac includes/parser/Parsoid/*: Use typed class properties
* I had already used this on one property of one file here
  and noticed that Isabelle used this on a newly created
  class in output transform and that prompted me to switch
  over all these files.

* I am about to start adding new files here for new hooks for
  DiscussionTools and updated everything in this namesspace
  to keep usage consistent.

* This exposed initialization and bad typing issues in
  SiteConfig.php and LanguageVariantConverter.php

Change-Id: I35f131a8f584ccc82a915dbfb1b50b3ef1ec6b06
2023-10-23 17:37:14 -05:00
Subramanya Sastry
225be51fa7 ParsoidParser: Register watcher after creating ParserOutput object
* Updated documentation around this point
* Adjust tests to reflect this change.
* While it initially appeared that this can cause ParserCache impacts,
  'disableContentConversion' isn't part of the cache key and thus
  has no deployment impacts.

Change-Id: I535cb21cc104a358aa70829b030ae3751b76ae00
2023-10-17 17:51:19 -05:00
Subramanya Sastry
b1c3914a21 Fix typos in a comments found during code reading
Change-Id: Id8bd3ed449f8fb50107b40a9d813abe353aca161
2023-10-16 21:49:33 -05:00
Subramanya Sastry
c8d0470f4b Make ParsoidOutputAccess a wrapper over ParserOutputAccess
* Updated ParserOutput to set Parsoid render ids that REST API
  functionality expects in ParserOutput objects.
* CacheThresholdTime functionality no longer exists since it was
  implemented in ParsoidOutputAccess and ParserOutputAccess doesn't
  support it. This is tracked in T346765.
* Enforce the constraint that uncacheable parses are only for fake or
  mutable revisions. Updated tests that violated this constraint to
  use 'getParseOutput' instead of calling the parse method directly.
* Had to make some changes in ParsoidParser around use of preferredVariant
  passed to Parsoid. I also left some TODO comments for future fixes.
  T267067 is also relevant here.

PARSOID-SPECIFIC OPTIONS:
* logLinterData: linter data is always logged by default -- removed
  support to disable it. Linter extension handles stale lints properly
  and it is better to let it handle it rather than add special cases
  to the API.
* offsetType: Moved this support to ParsoidHandler as a post-processing
  of byte-offset output. This eliminates the need to support this
  Parsoid-specific options in the ContentHandler hierarchies.
* body_only / wrapSections: Handled this in HtmlOutputRendererHelper
  as a post-processing of regular output by removing sections and
  returning the body content only. This does result in some useless
  section-wrapping work with Parsoid, but the simplification is probably
  worth it. If in the future, we support Parsoid-specific options in
  the ContentHandler hierarchy, we could re-introduce this. But, in any
  case, this "fragment" flavor options is likely to get moved out of
  core into the VisualEditor extension code.

DEPLOYMENT:
* This patch changes the cache key by setting the useParsoid option
  in ParserOptions. The parent patch handles this to ensure we don't
  encounter a cold cache on deploy.

TESTS:
* Updated tests and mocks to reflect new reality.
* Do we need any new tests?

Bug: T332931
Change-Id: Ic9b7cc0fcf365e772b7d080d76a065e3fd585f80
2023-10-13 15:03:03 -05:00
Subramanya Sastry
77423d5ee0 ParsoidParser: Inject Parsoid into constructor
This makes it possible to more easily use a mock Parsoid object
in testing.

Change-Id: I7cfb2fe5975c91cc38d5d488224495ce405673c6
2023-09-07 14:37:47 -05:00
Subramanya Sastry
83ea46ff65 Reconcile Parsoid opts in ParsoidOutputAccess & ParserOutputAccess
* Explicitly set wrapSections to true. This has have no significant
  impact since it defaults to true within Parsoid.
* 'pageName' and 'prefix' removed from ParsoidOutputAccess since
  they are not needed / used in Parsoid.
* 'logLinterData' need to be set in the ParserOutputAccess paths.
* A bunch of documentation FIXMEs as I was digging through the code.
* Record a FIXME that ParsoidOutputAccess and ParsoidParser (which
  is used in the ParserOutputAccess use page) differ in how they
  handle the language value (whether the default value of the title /
  page or the pageLanguageOverride from the REST API). ParsoidParser
  computes a preferred variant whereas ParsoidOutputAccess right now
  does NOT do that. So, as part of the switchover to ParserOutputAccess,
  we will need to set disableContentConversion in ParserOptions.

  That will happen in a later patch.

Bug: T332931
Change-Id: I7326ae3452a7d496a57f5c4ff2ddeaf0daa7ab70
2023-08-10 23:40:26 +00:00
C. Scott Ananian
cb371f2d91 Bcp47Code fixes to ParsoidParser and LanguageVariantConverterUnitTest
LanguageVariantConverterUnitTest: don't mock a method in the Parsoid
class that no longer exists.

ParsoidParser: pass a Bcp47Code (in the form of a Language object),
not a string, when selecting the preferred variant for the output

Followup-To: Ib8554f98b1c653df3864110e0e66796b8da67b5f
Change-Id: I32fd64a9495b8aed729b0b5b00535180006e0223
2023-08-07 17:31:04 -04:00
C. Scott Ananian
7cb30eceb3 Remove Parsoid back-compat code
Now that the latest Parsoid has been released to mediawiki-vendor,
the method_exists() calls aren't necessary.

Bug: T343155
Followup-To: I9da2566cc003e2f05cae16229444dcf3baf61fa4
Change-Id: I081225a268d608f763814245f9cab1c44bf49bad
2023-07-31 18:07:51 -04:00
Umherirrender
511842f9f9 parser: Remove phan-suppression after parsoid 0.18.0-a20 update
The method_exists are kept, not sure if old objects are in any cache

Follow-Up: I9da2566cc003e2f05cae16229444dcf3baf61fa4
Bug: T343155
Change-Id: I0aaa3dce26df1619bedc39696a115145a61d4d14
2023-07-31 22:01:08 +02:00
C. Scott Ananian
0b92c4bedb Record Parsoid version in extension data to allow rollback if necessary
This allows any bad cached parses due to a train deploy to be selectively
rolled back in the RejectParserCacheValue hook, which provides some
operational insurance against corrupted caches.  The version is also
added to the debug information in the HTML footer to aid diagnosis
of any issue in real time.

Depends-On: I3d3caabd959c1ba16f4dc702c2eae38d5d4dcb14
Change-Id: Ibb37a82ec0ce764aefd8c9fab2868073a66301ec
2023-07-27 19:02:24 -04:00
Subramanya Sastry
68805e2f50 ParsoidParser: Record ParserOptions watcher on ParserOutput object
* ParsoidParser hadn't registered a watcher on ParserOptions so far.
  Because of this, you can see that the current parser cache key
  (in deployed production code) doesn't have 'useParsoid=1' in it.

  Ex: View source on enwiki:Hospet shows that the parser cache key
  there is "enwiki:parsoid-pcache:idhash:2360619-0!canonical".

  The only reason this doesn't conflict with legacy parser output
  is because we use "parsoid-pcache", a diferent cache instance than
  "pcache" used for legacy parser output. But if/when we decide to use
  the same parser cache instance, this could cause cache corruptions.

  With FlaggedRevisions, where a single "stable-pcache" parser cache
  instance is used, in local testing, this was causing Parsoid HTML to be
  saved without "useParsoid=1", and so Parsoid HTML was being returned
  for legacy parser cache requests.

* In addition, fix the code in PageBundleParserOutputConverter to copy
  over internal metadata (which includes used options). This ensures
  that any tracked parser options aren't lost and the right parser cache
  key is constructed later on.

* Added / updated a number of new tests that verifies that usedOptions
  is tracked correctly in the useParsoid code paths. The tests fail
  without the code changes in this patch.

Bug: T340703
Bug: T335157
Needed-By: I0e954949768044eea6ec275a36d0d6d7ed457e8e
Change-Id: I076d5d362bdfd9d4b2ca8886bf6b30c1a746aee7
2023-07-11 10:53:11 -05:00
Subramanya Sastry
ec0499d7a1 ParsoidParser: set wrapper div class to ensure wrapper is added
Change-Id: I81f9dee209631d7b5744b6249e4362e96c82058a
2023-06-20 12:54:46 -05:00
Bartosz Dziewoński
6ba47296d9 Fix Phan suppressions related to Title::castFrom*() and friends
There is no way to express that Title::castFromPageIdentity(),
Title::castFromPageReference() and Title::castFromLinkTarget()
can only return null when the parameter is null. We need to add
Phan suppressions or explicit types almost everywhere that these
methods are used with parameters that are known to not be null.

Instead, introduce new methods Title::newFromPageIdentity() and
Title::newFromPageReference() (Title::newFromLinkTarget() already
exists), without the null-coalescing behavior, and use them when
the parameter is not null. This lets static analysis tools, and
humans, easily understand where nulls can't appear.

Do the same with the corresponding TitleFactory methods.

Change the obvious uses of castFrom*() to newFrom*() (if there is
a Phan suppression, a type check, or a method call on the result).

Change-Id: Ida4da75953cf3bca372a40dc88022443109ca0cb
2023-04-22 16:45:09 +02:00
C. Scott Ananian
cfd9c516e1 Allow setting a ParserOption to generate Parsoid HTML
This is an initial quick-and-dirty implementation.  The
ParsoidParser class will eventually inherit from \Parser,
but this is an initial placeholder to unblock other Parsoid
read views work.

Currently Parsoid does not fully implement all the ParserOutput
metadata set by the legacy parser, but we're working on it.

This patch also addresses T300325 by ensuring the the Page HTML
APIs use ParserOutput::getRawText(), which will return the entire
Parsoid HTML document without post-processing.  This is what
the Parsoid team refers to as "edit mode" HTML. The
ParserOutput::getText() method returns only the <body> contents
of the HTML, and applies several transformations, including
inserting Table of Contents and style deduplication; this is
the "read views" flavor of the Parsoid HTML.

We need to be careful of the interaction of the `useParsoid` flag with
the ParserCacheMetadata.  Effectively `useParsoid` should *always* be
marked as "used" or else the ParserCache will assume its value doesn't
matter and will serve legacy content for parsoid requests and
vice-versa.  T330677 is a follow up to address this more thoroughly by
splitting the parser cache in ParserOutputAccess; the stop gap in this
patch is fragile and, because it doesn't fork the ParserCacheMetadata
cache, may corrupt the ParserCacheMetadata in the case when Parsoid
and the legacy parser consult different sets of options to render a
page.

Bug: T300191
Bug: T330677
Bug: T300325
Change-Id: Ica09a4284c00d7917f8b6249e946232b2fb38011
2023-03-26 21:46:05 -04:00