Commit graph

18 commits

Author SHA1 Message Date
James D. Forrester
9bfb75ff90 Namespace ParserOutput
Most used non-namespaced class!

Bug: T353458
Change-Id: I4c2cbb0a808b3881a4d6ca489eee5d8c8ebf26cf
2023-12-14 14:57:34 -05:00
Amir Sarabadani
beb3261b8d Remove language coverter for Kazakh
This has been constantly mentioned as buggy and broken and there is no
official version of latin or Arabic (see the ticket for more details).

This can be turned back as an extension if needed by third party users.

Bug: T350684
Bug: T268143
Depends-On: I6180dca2c49b3119751766268acc56087aaf8414
Change-Id: Ifbf3c8954d885daf891f8d9efc11743d898302f0
2023-11-20 10:31:16 -05:00
Subramanya Sastry
ce89bee18b Followup to cf3f68b6: Handle bogus target variant codes
* Discovered through a failing API test in the Parsoid repo.
* Added a new phpunit test to catch this in the future.

Change-Id: Ic6326b409c9420fec676060566879f9a37a80961
2023-11-01 10:51:48 -05:00
Subramanya Sastry
062fd08e51 Remove all Parsoid debugApi references and uses
* Was used during the Parsoid JS -> PHP port and is no longer used.
* This also eliminated the need to inject ParsoidSettings into some
  classes.
* Once this merges and lands in core, I'll remove this from the Parsoid
  repo as well.

Change-Id: I008d30ea81f5a3db26e512c87762b90e3ca3c4ff
2023-09-14 14:48:48 -05:00
Tim Starling
5e30a927bc tests: Make some PHPUnit data providers static
Just methods where adding "static" to the declaration was enough, I
didn't do anything with providers that used $this.

Initially by search and replace. There were many mistakes which I
found mostly by running the PHPStorm inspection which searches for
$this usage in a static method. Later I used the PHPStorm "make static"
action which avoids the more obvious mistakes.

Bug: T332865
Change-Id: I47ed6692945607dfa5c139d42edbd934fa4f3a36
2023-03-24 02:53:57 +00:00
Tim Starling
f600d07ec4 Fix tests that fail when $wgUsePigLatinVariant = false
* ParserTestRunner: LocalisationCache needs to be reset since it has a
  reference to LanguageNameUtils which has a copy of
  $wgUsePigLatinVariant. Also factor out some
  MediaWikiServices::getInstance() calls.
* In some other tests, set the variable.

Change-Id: I6c1e9bfad9790cf805809c28a3f8d45952cbb981
2023-03-17 19:56:32 +11:00
jenkins-bot
5434c71393 Merge "Use Bcp47Code when interfacing with Parsoid" 2023-03-13 19:11:03 +00:00
C. Scott Ananian
5ad8dea80a Use Bcp47Code when interfacing with Parsoid
It is very easy for developers and maintainers to mix up "internal
MediaWiki language codes" and "BCP-47 language codes"; the latter are
standards-compliant and used in web protocols like HTTP, HTML, and
SVG; but much of WMF production is very dependent on historical codes
used by MediaWiki which in some cases predate the IANA standardized
name for the language in question.

Phan and other static checking tools aren't much help distinguishing
BCP-47 from internal codes when both are represented with the PHP
string type, so the wikimedia/bcp-47-code package introduced a very
lightweight wrapper type in order to uniquely identify BCP-47 codes.
Language implements Bcp47Code, and LanguageFactory::getLanguage() is
an easy way to convert (or downcast) between Bcp47Code and Language
objects.

This patch updates the Parsoid integration code and the associated
REST handlers to use Bcp47Code in APIs so that the standalone Parsoid
library does not need to know anything about MediaWiki-internal codes.
The principle has been, first, to try to convert a string to a
Bcp47Code as soon as possible and as close to the original input as
possible, so it is easy to see *why* a given string is a BCP-47 code
(usually, because it is coming from HTTP/HTML/etc) and we're not stuck
deep inside some method trying to figure out where a string we're
given is coming from and therefore what sort of string code it might
be.  Second, we've added explicit compatibility code to accept
MediaWiki internal codes and convert them to Bcp47Code for backward
compatibility with existing clients, using the @internal
LanguageCode::normalizeNonstandardCodeAndWarn() method.  The intention
is to gradually remove these backward compatibility thunks and replace
them with HTTP 400 errors or wfDeprecated messages in order to
identify and repair callers who are incorrectly using
non-standard-compliant language codes in web standards
(HTTP/HTML/SVG/etc).

Finally, maintaining a code as a Bcp47Code and not immediately
converting to Language helps us delay or even avoid full loading of a
Language object in some cases, which is another reason to occasionally
push Bcp47Code (instead of Language) down the call stack.

Bug: T327379
Depends-On: I830867d58f8962d6a57be16ce3735e8384f9ac1c
Change-Id: I982e0df706a633b05dcc02b5220b737c19adc401
2023-03-13 13:25:09 -04:00
C. Scott Ananian
bce63d1912 Preserve non-PageBundle metadata set by Parsoid
The Parsoid entrypoints should always have a "real" ParserOutput
passed as the ContentMetadataCollector object, so that recursive
invocations of extensions, etc, can set appropriate metadata
properties in the ParserOutput.

This is part of a belt-and-suspenders fix for T331084, where a
StubMetadataCollector is being used in production -- production should
never use a stub, it should always use a real ParserOutput object.
The other fix for T331084 is
I30ea2bb24e6c9b0950a8f46dc8e5b9bf5ee3378b, which ensures that if you
*were* to use a StubMetadataCollector in production, it wouldn't throw
an error when a numeric category string was encountered.

Bug: T331084
Change-Id: I8711a51fc1bcac48eae92ab1ba15a33fe05937ed
2023-03-13 11:24:57 -04:00
Abijeet
5c113a833a LanguageVariantConverter: Add fallback to core LanguageConverter
If variant conversion is not supported by Parsoid, fallback to using
the old LanguageConverter.

We still call parsoid to perform variant conversion in order to add
metadata that is missing when the core language converter is used.

Bug: T318401
Change-Id: I0499c853b4e301f135339fc137054bd760ee237d
Depends-On: Ie94aaa11963ec1e9e99136af469a05fa4005710d
2022-12-11 12:12:33 +05:30
Abijeet
803092d4af tests: Remove unnecessary override to use pig-latin
Pig latin is enabled by default since
Ia80ad33cbf5e311fa8b84bd765a8df8d156f4c38

Change-Id: I0cd922bb0ee1fd7bce2ced2eacbdb6ed25ada7d8
2022-12-08 17:52:00 +05:30
daniel
b7ab24c218 Fix LanguageVariantConverter test
Accept sr-Latn as well as sr-el as the language code for Serbian with
latin script.

This was broken when the parsoid library started to use BCP-47 codes
rather than internal MediaWiki codes. For now, we accept both, so we are
compatible with the version of the parsoid lib currently in the vendor
repo as well as the version picked by composer update.

Bug: T323985
Change-Id: If0b02be4f391b31fb75e2ad51e199a83707b0e3c
2022-11-29 15:34:42 +01:00
daniel
e61b9b6680 page/{title}/html: handle unknown variant gracefully
Language conversion shouldn't crash with a 500 when a variant is
requested for a language that does not support variants. This behavior
is especially annoying when manually calling REST endpoints with a
browser, since browsers routinely send Accept-Language headers.

Change-Id: I31a14cb184a7bf940b7d178c12b2e7829d2eca0f
2022-11-22 23:03:55 +01:00
daniel
4ad9c9b035 variant transform: allow input content-language to be a variant
When submitting HTML to transform/html/to/html, the language specified
by the input's content-language header should be allowed to be the
source variant.

It should also be possible to just specify the source variant, and
derive the base language from that rather than the content-language
header or the page language.

Change-Id: I703c112358a921a8b0c9e63b70fd820ae3ea16fc
2022-11-02 01:30:36 -04:00
Abijeet
715080cfd5 LanguageVariantConverter: Use content language code from HTTP header
Use the content language from the header, and give that the highest
priority when identifying the page language.

Bug: T317019
Change-Id: Ibb0671f1b873ef83a4d53824a9c4c17726e68635
2022-10-07 20:28:57 +05:30
daniel
5b0d1cfd35 Re-apply: Introduce LanguageVariantConverter
This reverts Ib73841bcc6c101bbe8a76f76dc81553290726039 and re-applies
I55a58f9824329893575a532cd10b9422ededb9ba with some changes: The source
variant is passed in explicitly. More complete handling of the input
language will be added in a follow-up.

Original description:

This class is used in ParsoidHandler::languageConversion

It uses the Parsoid to perform the actual conversion of the content
to a language variant.

The source language is determined using the PageBundle or the page
language from the Title.

To encapsulate Parsoid related concepts, the class has the ability
to create Parsoid\Config\PageConfig if not provided.

Bug: T317019
Change-Id: Ida1a040628c26ac2ef108b0c90a3d3285a493b0e
2022-10-04 20:29:54 +02:00
Daniel Kinzler
c5bc391b2b Revert "Introduce LanguageVariantConverter"
This reverts commit 5c49a09e89.

Reason for revert: See https://phabricator.wikimedia.org/T319282

Bug: T319282
Change-Id: Ib73841bcc6c101bbe8a76f76dc81553290726039
2022-10-04 11:52:09 +00:00
Abijeet
5c49a09e89 Introduce LanguageVariantConverter
This class is used in ParsoidHandler::languageConversion

It uses the Parsoid to perform the actual conversion of the content
to a language variant.

The source language is determined using the PageBundle or the page
language from the Title.

To encapsulate Parsoid related concepts, the class has the ability
to create Parsoid\Config\PageConfig if not provided.

Bug: T317019
Change-Id: I55a58f9824329893575a532cd10b9422ededb9ba
2022-10-03 16:13:29 +00:00