Commit graph

71 commits

Author SHA1 Message Date
Arlo Breault
7e588012e3 Remove error_log
Follows-Up: I647ed253691970bbf39321a3cd652ea14bc11278
Change-Id: Ic404c6f0eb5f0e07ace53d4157f87877f9cbfe22
2023-06-13 14:21:11 -04:00
Arlo Breault
9825ee1bcb Check early for a badrevision exception when Parsoid tries to getContent
RevisionRecord::getContent can return null when errors are encountered.
Check for those early in ParsoidHanlder::tryToCreatePageConfig so we can
return an appropriate HttpException to clients.

Bug: T336501
Change-Id: I647ed253691970bbf39321a3cd652ea14bc11278
2023-06-05 16:52:24 +00:00
sbailey
5f288b11bc Filter out large-tables category lints from Parsoid REST API
* Temporary solution stops flooding the LintHint gadget with
  irrelevant large-tables category lints, reducing the utility
  of LintHint.

* This can be be removed once Parsoid has improved handling for
  "hidden" lint categories.

Bug: T337275
Change-Id: Iad03976e546a13e05134f72718895414ffe063c8
2023-05-26 22:19:04 +05:30
jenkins-bot
5434c71393 Merge "Use Bcp47Code when interfacing with Parsoid" 2023-03-13 19:11:03 +00:00
daniel
74d6e57e6a TransformHandler: Load stashed page bundle based on ETag.
Allow clients to use an If-Match header with the
transform/html/to/wikitext endpoint.

This follows up on Ida81a314f015e205f2081c68a82d486145097c92
(reverted and reapplied)
It adds support for stashing in wt2html, enabling it for Parsoid's
page/html endpoint. It also ensures we are only emitting ETags if
stashing is enabled.

This also removes handling for use-stash from ParsoidHandler,
which did nothing.

Bug: T310464
Bug: T331629
Needed-By: I08f1388faaccef6c1d9a393f8011011d30a25ec7
Change-Id: I9d6eaf45d5b4978afc17493720777e77f0e645b2
2023-03-13 18:18:03 +00:00
C. Scott Ananian
5ad8dea80a Use Bcp47Code when interfacing with Parsoid
It is very easy for developers and maintainers to mix up "internal
MediaWiki language codes" and "BCP-47 language codes"; the latter are
standards-compliant and used in web protocols like HTTP, HTML, and
SVG; but much of WMF production is very dependent on historical codes
used by MediaWiki which in some cases predate the IANA standardized
name for the language in question.

Phan and other static checking tools aren't much help distinguishing
BCP-47 from internal codes when both are represented with the PHP
string type, so the wikimedia/bcp-47-code package introduced a very
lightweight wrapper type in order to uniquely identify BCP-47 codes.
Language implements Bcp47Code, and LanguageFactory::getLanguage() is
an easy way to convert (or downcast) between Bcp47Code and Language
objects.

This patch updates the Parsoid integration code and the associated
REST handlers to use Bcp47Code in APIs so that the standalone Parsoid
library does not need to know anything about MediaWiki-internal codes.
The principle has been, first, to try to convert a string to a
Bcp47Code as soon as possible and as close to the original input as
possible, so it is easy to see *why* a given string is a BCP-47 code
(usually, because it is coming from HTTP/HTML/etc) and we're not stuck
deep inside some method trying to figure out where a string we're
given is coming from and therefore what sort of string code it might
be.  Second, we've added explicit compatibility code to accept
MediaWiki internal codes and convert them to Bcp47Code for backward
compatibility with existing clients, using the @internal
LanguageCode::normalizeNonstandardCodeAndWarn() method.  The intention
is to gradually remove these backward compatibility thunks and replace
them with HTTP 400 errors or wfDeprecated messages in order to
identify and repair callers who are incorrectly using
non-standard-compliant language codes in web standards
(HTTP/HTML/SVG/etc).

Finally, maintaining a code as a Bcp47Code and not immediately
converting to Language helps us delay or even avoid full loading of a
Language object in some cases, which is another reason to occasionally
push Bcp47Code (instead of Language) down the call stack.

Bug: T327379
Depends-On: I830867d58f8962d6a57be16ce3735e8384f9ac1c
Change-Id: I982e0df706a633b05dcc02b5220b737c19adc401
2023-03-13 13:25:09 -04:00
James D. Forrester
ad06527fb4 Reorg: Namespace the Title class
This is moderately messy.

Process was principally:

* xargs rg --files-with-matches '^use Title;' | grep 'php$' | \
  xargs -P 1 -n 1 sed -i -z 's/use Title;/use MediaWiki\\Title\\Title;/1'
* rg --files-without-match 'MediaWiki\\Title\\Title;' . | grep 'php$' | \
  xargs rg --files-with-matches 'Title\b' | \
  xargs -P 1 -n 1 sed -i -z 's/\nuse /\nuse MediaWiki\\Title\\Title;\nuse /1'
* composer fix

Then manual fix-ups for a few files that don't have any use statements.

Bug: T166010
Follows-Up: Ia5d8cb759dc3bc9e9bbe217d0fb109e2f8c4101a
Change-Id: If8fc9d0d95fc1a114021e282a706fc3e7da3524b
2023-03-02 08:46:53 -05:00
Amir Sarabadani
4bb2886562 Reorg: Migrate WikiMap to WikiMap/ out of includes
And WikiReference

Bug: T321882
Change-Id: I60cf4b9ef02b9d58118caa39172677ddfe03d787
2023-02-27 05:19:46 +01:00
Umherirrender
ee73e6ac1b Remove unused local variable assignment
Dead code found by phan

Change-Id: I9fc404d546a4fb1c61394cb6359eb774fd94383a
2023-02-04 22:16:31 +01:00
Derick Alangi
1afd52e3e4 REST: Move Helper classes to their own namespace
Mixing Handlers with Helpers doesn't look nice for consistency
reasons. Helpers should be in their own place (grouped) in the
Handlers directory as they're really "helpers for the handlers".

Change-Id: Ieeb7a0a706a4cb38778f312bfbfe781a1f366d14
2023-01-16 21:16:09 +01:00
daniel
4f22f967c5 Parsoid: implicitly enable linting in API endpoints
Logging linter data should be enabled automatically by
HtmlOutputRendererHelper.

This change enables linting data for requests coming in via the
v1/page/{title}/html endpoint.

Change-Id: Idafd29784ec712547e36fea88a8c159784b97f2b
2022-12-13 13:35:06 +01:00
Arlo Breault
5ac33424c2 Don't enable logging linter data in the /lint/ endpoint
We only want to log full page parses of the latest revision and that's
already covered by the wt2html endpoint and changeprop should ensure it
always happens.  In the future, we may want to restore this when
exploring ways to avoid the performance cost of doing the linting during
the canonical parse, but that's T325031.

Follow up to I1f69498ef759f7a82ad8ad9002d7212636e92ffe

Bug: T325031
Change-Id: Icefb46b416629c2714a7d4f282cd55cbca271323
2022-12-12 23:54:38 +00:00
Arlo Breault
8e7aa04917 Log linter data while parsing full pages
This regressed in Ic48db1b5fdff1dfd4f2d2643d64252e5fc721e79

Bug: T246403
Change-Id: I1f69498ef759f7a82ad8ad9002d7212636e92ffe
2022-12-12 15:36:02 -05:00
daniel
5f2026c31c ParsoidHandler: test wt2html with old revision
Update the test for wt2html to assert that it works properly with an old
revision.

Bug: T324801
Change-Id: Ia2a7e28cd999712b1bd890eed48d0a5de931700f
2022-12-09 19:33:18 +01:00
Subramanya Sastry
91eceaae3c Followup to 5cb38845: Don't drop revid info
* Also, not sure why the code is marked as accepting PageIdentity
  or PageConfig when it is private and its only user is clearly
  passing a PageConfig object.

* But, I'll leave that for Daniel to resolve since there may be
  a reason he left it there.

* Will also leave adding tests as a followup to Daniel.

Bug: T324801
Change-Id: I24b2e95b479a3019ad65c62d624f980dfc2bf349
2022-12-08 23:15:38 -06:00
daniel
9ff8edfa1e HtmlOutputRendererHelper: fix semantics of getRevisionId
getRevisionId is documented to return 0 for fake revisions, but it was
returning 0 for the current revision as well. This patch makes a clear
distinction, with 0 meaning current (like elsewhere in the code), and
null meaning a fake revision.

This patch includes a fix for redirect handling in ParsoidHandler::wt2html.
This fix is needed here because it previously relied on getRevisionId()
to return the actual revision ID; this would fail, since getRevisionId()
will return 0 when the current revision of a page is requested.

Change-Id: I33d1ab54023c6ac96c6bb5e4750b980e381cb464
2022-12-06 23:06:25 +01:00
Daniel Kinzler
f36a28ff21 [Fix] ParsoidHandler: use HtmlOutputRendererHelper in wt2html
Fixes the reason for reverting Ie430acd0753880d88370bb9f22bb40a0f9ded917:

The issue was that with my patch, the transform/wikitext/to/html started
ignoring the offsetType field in the body. So the offsetType used in the
response (or stashed data) would always be 'byte'.
But the roundtrip-test.js scripts requests 'ucs2'.

This causes an error when sending the HTML and data-parsoid back to
transform/html/to/wikitext, again with offsetType:'ucs2': the offsetType
embededed in data-parsoid will be byte, and the mismatch causes a 400
to be returned. This broke the roundtrip-test.js script.

The fix is to no ignore the offsetType specified in the request body.

Change-Id: Ief721c23ed9a57d781cfdac625a62113f22f87a5
2022-12-05 18:49:30 +00:00
Daniel Kinzler
5cb388455b [Re-apply] ParsoidHandler: use HtmlOutputRendererHelper in wt2html
This restores change Ie430acd0753880d88370bb9f22bb40a0f9ded917.
This reverts commit ab6baad1a5.

NOTE: Also needs the patch the fixes the original reason for the
revert: Ief721c23ed9a57d781cfdac625a62113f22f87a5

Change-Id: Ic48db1b5fdff1dfd4f2d2643d64252e5fc721e79
2022-12-05 18:43:51 +00:00
Daniel Kinzler
ab6baad1a5 Revert "ParsoidHandler: use HtmlOutputRendererHelper in wt2html"
This reverts commit e82f11c246.

Reason for revert: Breaks parsoid CI

1) Parsoid round-trip e2e testing with MW REST endpoints
     rt-testing e2e:
     AssertionError: expected 1 to equal 0
     + expected - actual
     -1
     +0

     at Context.<anonymous> (tests/api-testing/RoundTrip.js:59:10)
     at processTicksAndRejections (internal/process/task_queues.js:95:5)

Change-Id: Ib94f964c2717885f777c1fe0c9c443cd6a5ed3ae
2022-12-01 21:17:34 +00:00
daniel
e82f11c246 ParsoidHandler: use HtmlOutputRendererHelper in wt2html
NOTE: This causes Parsoid output to be written to the parser cache.
This should be unconditional in the future, but for now it is
controled by wgTemporaryParsoidHandlerParserCacheWriteRatio.

This change affects the following endpoints that use the wt2html method:
* /coredev/v0/transform/wikitext/to/html in core
* /{domain}/v3/transform/wikitext/to/html from parsoid
* /{domain}/v3/page/html/{title} from parsoid

The /v1/page/{title}/html endpoint is not affected, since it
doesn't use wt2html, but has always been using HtmlOutputRendererHelper
directly.

Bug: T322672
Depends-On: Ic37f606bb51504c8164d005af55ca9a65f595041
Change-Id: Ie430acd0753880d88370bb9f22bb40a0f9ded917
2022-12-01 10:14:49 +00:00
daniel
2ec1791d40 Introduce PageRestHelperFactory
This allows extensions like VisualEditor to safely instantiate REST
helper objects. It also reduces the number of services that need to be
injected into REST handlers from route definitions.

Change-Id: I10af85b2da96568cfffd03867d1cb299645fb371
2022-11-21 07:23:26 +00:00
daniel
1dfe1f9f51 ParsoidHandler: remove subst feature from wt2html
Per discussion with Subbu and Bartosz, this is unused and dubious.
The client side feature that needs this ability was never implemented,
it has been sitting around since 2013.

Bug: T73161
Bug: T51904
Change-Id: I81dd90189d267b2799b63c972d7d8cf5f431d7b0
2022-11-10 14:08:38 +01:00
daniel
f545d5efeb Rename HTMLTransform to HtmlToContentTransform
* We will have several kinds of HTML transformations.
Rename HTMLTransform to indicate that its for converting HTML to Content
objects.

* Using Naming Convention 'Html' instead of 'HTML'

Change-Id: I506f3303ae8f9e4db17299211366bef1558f142c
2022-11-03 16:47:36 +01:00
daniel
4ad9c9b035 variant transform: allow input content-language to be a variant
When submitting HTML to transform/html/to/html, the language specified
by the input's content-language header should be allowed to be the
source variant.

It should also be possible to just specify the source variant, and
derive the base language from that rather than the content-language
header or the page language.

Change-Id: I703c112358a921a8b0c9e63b70fd820ae3ea16fc
2022-11-02 01:30:36 -04:00
Daimona Eaytoy
947ff7c0f5 build: Update mediawiki/mediawiki-phan-config to 0.12.0
This patch only adds and removes suppressions, which must be done in the
same patch as the version bump.

Bug: T298571
Change-Id: I4044d4d9ce82b3dae7ba0af85bf04f22cb1dd347
2022-10-08 15:45:42 +02:00
Abijeet
715080cfd5 LanguageVariantConverter: Use content language code from HTTP header
Use the content language from the header, and give that the highest
priority when identifying the page language.

Bug: T317019
Change-Id: Ibb0671f1b873ef83a4d53824a9c4c17726e68635
2022-10-07 20:28:57 +05:30
jenkins-bot
ca5814e21f Merge "Re-apply: Introduce LanguageVariantConverter" 2022-10-06 11:25:39 +00:00
daniel
5b0d1cfd35 Re-apply: Introduce LanguageVariantConverter
This reverts Ib73841bcc6c101bbe8a76f76dc81553290726039 and re-applies
I55a58f9824329893575a532cd10b9422ededb9ba with some changes: The source
variant is passed in explicitly. More complete handling of the input
language will be added in a follow-up.

Original description:

This class is used in ParsoidHandler::languageConversion

It uses the Parsoid to perform the actual conversion of the content
to a language variant.

The source language is determined using the PageBundle or the page
language from the Title.

To encapsulate Parsoid related concepts, the class has the ability
to create Parsoid\Config\PageConfig if not provided.

Bug: T317019
Change-Id: Ida1a040628c26ac2ef108b0c90a3d3285a493b0e
2022-10-04 20:29:54 +02:00
jenkins-bot
05d701a2a4 Merge "ParsoidHandler: use metrics from SiteConfig" 2022-10-04 17:05:27 +00:00
daniel
79cc21beaf ParsoidHandler: use metrics from SiteConfig
ParsoidHandler should pass the metrics object from the
SiteConfig to HtmlInputTransformHelper, instead of using the global
metrics instance. Otherwise, the metricsPrefix defined in the parsoid
settings is ignored.

Change-Id: Ie85f2306e8b0f123b9fdd737faffdd85117015b1
2022-10-04 16:49:36 +00:00
jenkins-bot
bfea62061c Merge "Revert "Introduce LanguageVariantConverter"" 2022-10-04 12:35:33 +00:00
Daniel Kinzler
c5bc391b2b Revert "Introduce LanguageVariantConverter"
This reverts commit 5c49a09e89.

Reason for revert: See https://phabricator.wikimedia.org/T319282

Bug: T319282
Change-Id: Ib73841bcc6c101bbe8a76f76dc81553290726039
2022-10-04 11:52:09 +00:00
jenkins-bot
641d01b0ac Merge "Introduce LanguageVariantConverter" 2022-10-03 17:41:24 +00:00
Abijeet
5c49a09e89 Introduce LanguageVariantConverter
This class is used in ParsoidHandler::languageConversion

It uses the Parsoid to perform the actual conversion of the content
to a language variant.

The source language is determined using the PageBundle or the page
language from the Title.

To encapsulate Parsoid related concepts, the class has the ability
to create Parsoid\Config\PageConfig if not provided.

Bug: T317019
Change-Id: I55a58f9824329893575a532cd10b9422ededb9ba
2022-10-03 16:13:29 +00:00
daniel
a02be0b3f8 HtmlInputTransformHelper: Fall back to ParserCache
If a render ID is given via the use-cache parameter, but the key is not
found in the parsoid stash, look at the most recent known rendering of
the revision, and use it if it matches the render ID.

This patch moves the responsibility for looking up RevisionRecords and
PageRecords into ParsoidOutputAccess. This way, callers only need to
have a PageIdentity, and optionally a revision ID.

Bug: T318395
Change-Id: I1aa5b0fd9fb1acaa2544d5a58125fa3810a0eb39
2022-09-30 15:56:23 +00:00
daniel
f31cd9f1d3 REST: HtmlInputTransformHelper: Load original data from stash
Parsoid needs the original rendering in order to apply
selective serialization (selser). The page/{title}/html endpoint
can stash the rendering, and now the transform endpoint can make use
of the stashed rendering.

Bug: T310464
Change-Id: Ia58043ed3aa1eb12731d82aa87606c82ec63f663
2022-09-29 19:52:27 +02:00
daniel
4107333069 Introduce HtmlInputTransformHelper
The HtmlInputTransformHelper is intended to provide code sharing
between VisualEditor's DirectParsoidClient and the ParsoidHandler
base class used by TransformHandler.

Bug: T310376
Change-Id: I9c15f075cfc5f198e290758fc23d25990b47a185
2022-09-26 12:58:17 +00:00
jenkins-bot
3c1a16b7c6 Merge "HTMLTransform: do not presume wikitext" 2022-09-22 17:40:04 +00:00
daniel
d6140952ed HTMLTransform: do not presume wikitext
Parsoid supports other source formats besides wikitext.
This patch improves support for non-wikitext content by removing
assumptions about the source type.

Change-Id: I5480ff200a93026cea7f1542e12834b06ac6f730
2022-09-22 17:41:48 +01:00
Umherirrender
b15e689d49 Remove unused local variables
Various variables are left from ealier refactor are now unused
and can be removed to make the code easier to read

Change-Id: Id51770af1f08e85c7e7a02234a2cd2ab5b47ee7a
2022-09-19 23:07:07 +02:00
daniel
24a26ec25b REST: make ParsoidHandler use HTMLTransformFactory
This also moves the creation of PageConfig from HTMLTransformFactory
into HTMLTransform, to ensure all relevant info, particularly the
page language, is known.

Change-Id: Id354862d6497816e0c007b9cb3b0d183c9d4b719
2022-09-16 18:46:17 +02:00
jenkins-bot
1c7adc8f8b Merge "Split setOriginalData( ... ) to more related setters for encapsulation" 2022-08-25 18:10:33 +00:00
daniel
df0744f402 Split setOriginalData( ... ) to more related setters for encapsulation
By splitting the setOriginalData methods into several setters, we remove
any knowledge about the structure of the request body from HTMLTransform.
It also allows us to be specific about which data to operate on.

This also removes the concept of page bundles from the public interface
of HTMLTransform. PageBundle objects are used only internally.

Change-Id: If97a74ce251f281b7d980928a01b764d6ec0d0a4
2022-08-25 18:40:26 +02:00
Umherirrender
0f49ae2759 Use MainConfigNames constant to refer configs
Change-Id: Iddef589423d1e3f609b3cfbf6cc7437c6ad830b0
2022-08-17 21:27:48 +02:00
Derick Alangi
b078f598f9 Move transformHtmlToWikitext() and getSelserData() to HTMLTransform
This patch moves remaining transformation logic to a renamed (from
HTMLTransformInput -> HTMLTransform) class. Also, the HTMLTransform
class is moved to the correct directory, hence namespace (including
tests).

Some data files have been copied over to it's own sub-directory in
the correct place since HTMLTransformTest needs it. ParsoidHandler
class is fine where it is because its operation is what happens in
the REST land.

NOTE: The 2 remaining methods moved into HTMLTransform are the last
ones we intended to move into this class to make the refactoring of
html2wt() method complete in this context.

Change-Id: I8929931e1b0acf247abe9d826eef57f3e0d4e132
2022-08-11 07:50:53 +01:00
daniel
7d5815b574 ParsoidHandler: do not emit etag for wt2html
Emitting a random ETag without storing it or the corresponding content
is not useful: The ETag cannot possibly ever match anything if the
client uses it in a later request with If-Match or If-None-Match.
If-Match will always fail, and If-None-Match will always succeed.

NOTE: When RESTbase proxies the response generated by this endpoint, it
will assign its own ETag to support stashing. It does not rely on the
ETag returned here.

Bug: T310710
Change-Id: I32b77a89549c37d32502adb101102747bc9ca45f
2022-08-09 03:07:32 +00:00
daniel
32c27772dc ParsoidHandler: pass metrics object to HTMLTransformInput
We were failing to collect some metrics because of this.

Change-Id: I20b4bbf04416fc74e6692e306dc40bf175664c07
2022-08-02 21:24:45 +02:00
jenkins-bot
9af6aa5d86 Merge "Fix $validateXMLNames flag when parsing HTML" 2022-08-01 17:05:56 +00:00
daniel
00c4f11ab6 Fix $validateXMLNames flag when parsing HTML
Change-Id: I6cbd2e8a7096b96814e9e0afe0193e1ca781af45
2022-08-01 17:23:03 +02:00
daniel
891c06816d ParsoidHandler: measure input size in characters
In If09afc4b933 we made ParsoidHandler measure input size in bytes
consistently, rather than using sometimes bytes, and sometimes
characters.

However, that was going to cause input limits to trigger early for
languages that use a lot of multibyte characters. So now we are switching
everything to measuring in characters.

NOTE: this may cause the html2wt.timePerInputKB to report worse values.
It also makes the name slightly misleading, since it's no longer in KB,
it's in kilo-chars.

Change-Id: I41872db6d1f5d96776fef54624428cc3ee5f21b3
2022-07-31 01:04:02 +02:00