wiki.techinc.nl/includes/parser/Parsoid/ParsoidParserFactory.php

73 lines
2.1 KiB
PHP
Raw Normal View History

Allow setting a ParserOption to generate Parsoid HTML This is an initial quick-and-dirty implementation. The ParsoidParser class will eventually inherit from \Parser, but this is an initial placeholder to unblock other Parsoid read views work. Currently Parsoid does not fully implement all the ParserOutput metadata set by the legacy parser, but we're working on it. This patch also addresses T300325 by ensuring the the Page HTML APIs use ParserOutput::getRawText(), which will return the entire Parsoid HTML document without post-processing. This is what the Parsoid team refers to as "edit mode" HTML. The ParserOutput::getText() method returns only the <body> contents of the HTML, and applies several transformations, including inserting Table of Contents and style deduplication; this is the "read views" flavor of the Parsoid HTML. We need to be careful of the interaction of the `useParsoid` flag with the ParserCacheMetadata. Effectively `useParsoid` should *always* be marked as "used" or else the ParserCache will assume its value doesn't matter and will serve legacy content for parsoid requests and vice-versa. T330677 is a follow up to address this more thoroughly by splitting the parser cache in ParserOutputAccess; the stop gap in this patch is fragile and, because it doesn't fork the ParserCacheMetadata cache, may corrupt the ParserCacheMetadata in the case when Parsoid and the legacy parser consult different sets of options to render a page. Bug: T300191 Bug: T330677 Bug: T300325 Change-Id: Ica09a4284c00d7917f8b6249e946232b2fb38011
2022-05-27 16:38:32 +00:00
<?php
namespace MediaWiki\Parser\Parsoid;
use MediaWiki\Languages\LanguageConverterFactory;
use MediaWiki\Parser\Parsoid\Config\PageConfigFactory;
use ParserFactory;
use Wikimedia\Parsoid\Config\DataAccess;
use Wikimedia\Parsoid\Config\SiteConfig;
use Wikimedia\Parsoid\Parsoid;
Make ParsoidOutputAccess a wrapper over ParserOutputAccess * Updated ParserOutput to set Parsoid render ids that REST API functionality expects in ParserOutput objects. * CacheThresholdTime functionality no longer exists since it was implemented in ParsoidOutputAccess and ParserOutputAccess doesn't support it. This is tracked in T346765. * Enforce the constraint that uncacheable parses are only for fake or mutable revisions. Updated tests that violated this constraint to use 'getParseOutput' instead of calling the parse method directly. * Had to make some changes in ParsoidParser around use of preferredVariant passed to Parsoid. I also left some TODO comments for future fixes. T267067 is also relevant here. PARSOID-SPECIFIC OPTIONS: * logLinterData: linter data is always logged by default -- removed support to disable it. Linter extension handles stale lints properly and it is better to let it handle it rather than add special cases to the API. * offsetType: Moved this support to ParsoidHandler as a post-processing of byte-offset output. This eliminates the need to support this Parsoid-specific options in the ContentHandler hierarchies. * body_only / wrapSections: Handled this in HtmlOutputRendererHelper as a post-processing of regular output by removing sections and returning the body content only. This does result in some useless section-wrapping work with Parsoid, but the simplification is probably worth it. If in the future, we support Parsoid-specific options in the ContentHandler hierarchy, we could re-introduce this. But, in any case, this "fragment" flavor options is likely to get moved out of core into the VisualEditor extension code. DEPLOYMENT: * This patch changes the cache key by setting the useParsoid option in ParserOptions. The parent patch handles this to ensure we don't encounter a cold cache on deploy. TESTS: * Updated tests and mocks to reflect new reality. * Do we need any new tests? Bug: T332931 Change-Id: Ic9b7cc0fcf365e772b7d080d76a065e3fd585f80
2023-08-29 20:13:43 +00:00
use Wikimedia\UUID\GlobalIdGenerator;
Allow setting a ParserOption to generate Parsoid HTML This is an initial quick-and-dirty implementation. The ParsoidParser class will eventually inherit from \Parser, but this is an initial placeholder to unblock other Parsoid read views work. Currently Parsoid does not fully implement all the ParserOutput metadata set by the legacy parser, but we're working on it. This patch also addresses T300325 by ensuring the the Page HTML APIs use ParserOutput::getRawText(), which will return the entire Parsoid HTML document without post-processing. This is what the Parsoid team refers to as "edit mode" HTML. The ParserOutput::getText() method returns only the <body> contents of the HTML, and applies several transformations, including inserting Table of Contents and style deduplication; this is the "read views" flavor of the Parsoid HTML. We need to be careful of the interaction of the `useParsoid` flag with the ParserCacheMetadata. Effectively `useParsoid` should *always* be marked as "used" or else the ParserCache will assume its value doesn't matter and will serve legacy content for parsoid requests and vice-versa. T330677 is a follow up to address this more thoroughly by splitting the parser cache in ParserOutputAccess; the stop gap in this patch is fragile and, because it doesn't fork the ParserCacheMetadata cache, may corrupt the ParserCacheMetadata in the case when Parsoid and the legacy parser consult different sets of options to render a page. Bug: T300191 Bug: T330677 Bug: T300325 Change-Id: Ica09a4284c00d7917f8b6249e946232b2fb38011
2022-05-27 16:38:32 +00:00
/**
* ParserFactory which uses a ParsoidParser.
*
* This is similar to \ParserFactory, but simplified since we don't need
* to try to reuse parser objects. Eventually we'll be able to simplify
* \ParserFactory the same way.
*
* @since 1.41
* @internal May be combined with \ParserFactory or otherwise refactored
*
* @file
* @ingroup Parser
*/
class ParsoidParserFactory /* eventually this may extend \ParserFactory */ {
private SiteConfig $siteConfig;
private DataAccess $dataAccess;
private PageConfigFactory $pageConfigFactory;
private LanguageConverterFactory $languageConverterFactory;
private ParserFactory $legacyParserFactory;
private GlobalIdGenerator $globalIdGenerator;
Make ParsoidOutputAccess a wrapper over ParserOutputAccess * Updated ParserOutput to set Parsoid render ids that REST API functionality expects in ParserOutput objects. * CacheThresholdTime functionality no longer exists since it was implemented in ParsoidOutputAccess and ParserOutputAccess doesn't support it. This is tracked in T346765. * Enforce the constraint that uncacheable parses are only for fake or mutable revisions. Updated tests that violated this constraint to use 'getParseOutput' instead of calling the parse method directly. * Had to make some changes in ParsoidParser around use of preferredVariant passed to Parsoid. I also left some TODO comments for future fixes. T267067 is also relevant here. PARSOID-SPECIFIC OPTIONS: * logLinterData: linter data is always logged by default -- removed support to disable it. Linter extension handles stale lints properly and it is better to let it handle it rather than add special cases to the API. * offsetType: Moved this support to ParsoidHandler as a post-processing of byte-offset output. This eliminates the need to support this Parsoid-specific options in the ContentHandler hierarchies. * body_only / wrapSections: Handled this in HtmlOutputRendererHelper as a post-processing of regular output by removing sections and returning the body content only. This does result in some useless section-wrapping work with Parsoid, but the simplification is probably worth it. If in the future, we support Parsoid-specific options in the ContentHandler hierarchy, we could re-introduce this. But, in any case, this "fragment" flavor options is likely to get moved out of core into the VisualEditor extension code. DEPLOYMENT: * This patch changes the cache key by setting the useParsoid option in ParserOptions. The parent patch handles this to ensure we don't encounter a cold cache on deploy. TESTS: * Updated tests and mocks to reflect new reality. * Do we need any new tests? Bug: T332931 Change-Id: Ic9b7cc0fcf365e772b7d080d76a065e3fd585f80
2023-08-29 20:13:43 +00:00
Allow setting a ParserOption to generate Parsoid HTML This is an initial quick-and-dirty implementation. The ParsoidParser class will eventually inherit from \Parser, but this is an initial placeholder to unblock other Parsoid read views work. Currently Parsoid does not fully implement all the ParserOutput metadata set by the legacy parser, but we're working on it. This patch also addresses T300325 by ensuring the the Page HTML APIs use ParserOutput::getRawText(), which will return the entire Parsoid HTML document without post-processing. This is what the Parsoid team refers to as "edit mode" HTML. The ParserOutput::getText() method returns only the <body> contents of the HTML, and applies several transformations, including inserting Table of Contents and style deduplication; this is the "read views" flavor of the Parsoid HTML. We need to be careful of the interaction of the `useParsoid` flag with the ParserCacheMetadata. Effectively `useParsoid` should *always* be marked as "used" or else the ParserCache will assume its value doesn't matter and will serve legacy content for parsoid requests and vice-versa. T330677 is a follow up to address this more thoroughly by splitting the parser cache in ParserOutputAccess; the stop gap in this patch is fragile and, because it doesn't fork the ParserCacheMetadata cache, may corrupt the ParserCacheMetadata in the case when Parsoid and the legacy parser consult different sets of options to render a page. Bug: T300191 Bug: T330677 Bug: T300325 Change-Id: Ica09a4284c00d7917f8b6249e946232b2fb38011
2022-05-27 16:38:32 +00:00
/**
* @param SiteConfig $siteConfig
* @param DataAccess $dataAccess
* @param PageConfigFactory $pageConfigFactory
* @param LanguageConverterFactory $languageConverterFactory
* @param ParserFactory $legacyParserFactory
*/
public function __construct(
SiteConfig $siteConfig,
DataAccess $dataAccess,
PageConfigFactory $pageConfigFactory,
LanguageConverterFactory $languageConverterFactory,
Make ParsoidOutputAccess a wrapper over ParserOutputAccess * Updated ParserOutput to set Parsoid render ids that REST API functionality expects in ParserOutput objects. * CacheThresholdTime functionality no longer exists since it was implemented in ParsoidOutputAccess and ParserOutputAccess doesn't support it. This is tracked in T346765. * Enforce the constraint that uncacheable parses are only for fake or mutable revisions. Updated tests that violated this constraint to use 'getParseOutput' instead of calling the parse method directly. * Had to make some changes in ParsoidParser around use of preferredVariant passed to Parsoid. I also left some TODO comments for future fixes. T267067 is also relevant here. PARSOID-SPECIFIC OPTIONS: * logLinterData: linter data is always logged by default -- removed support to disable it. Linter extension handles stale lints properly and it is better to let it handle it rather than add special cases to the API. * offsetType: Moved this support to ParsoidHandler as a post-processing of byte-offset output. This eliminates the need to support this Parsoid-specific options in the ContentHandler hierarchies. * body_only / wrapSections: Handled this in HtmlOutputRendererHelper as a post-processing of regular output by removing sections and returning the body content only. This does result in some useless section-wrapping work with Parsoid, but the simplification is probably worth it. If in the future, we support Parsoid-specific options in the ContentHandler hierarchy, we could re-introduce this. But, in any case, this "fragment" flavor options is likely to get moved out of core into the VisualEditor extension code. DEPLOYMENT: * This patch changes the cache key by setting the useParsoid option in ParserOptions. The parent patch handles this to ensure we don't encounter a cold cache on deploy. TESTS: * Updated tests and mocks to reflect new reality. * Do we need any new tests? Bug: T332931 Change-Id: Ic9b7cc0fcf365e772b7d080d76a065e3fd585f80
2023-08-29 20:13:43 +00:00
ParserFactory $legacyParserFactory,
GlobalIdGenerator $globalIdGenerator
Allow setting a ParserOption to generate Parsoid HTML This is an initial quick-and-dirty implementation. The ParsoidParser class will eventually inherit from \Parser, but this is an initial placeholder to unblock other Parsoid read views work. Currently Parsoid does not fully implement all the ParserOutput metadata set by the legacy parser, but we're working on it. This patch also addresses T300325 by ensuring the the Page HTML APIs use ParserOutput::getRawText(), which will return the entire Parsoid HTML document without post-processing. This is what the Parsoid team refers to as "edit mode" HTML. The ParserOutput::getText() method returns only the <body> contents of the HTML, and applies several transformations, including inserting Table of Contents and style deduplication; this is the "read views" flavor of the Parsoid HTML. We need to be careful of the interaction of the `useParsoid` flag with the ParserCacheMetadata. Effectively `useParsoid` should *always* be marked as "used" or else the ParserCache will assume its value doesn't matter and will serve legacy content for parsoid requests and vice-versa. T330677 is a follow up to address this more thoroughly by splitting the parser cache in ParserOutputAccess; the stop gap in this patch is fragile and, because it doesn't fork the ParserCacheMetadata cache, may corrupt the ParserCacheMetadata in the case when Parsoid and the legacy parser consult different sets of options to render a page. Bug: T300191 Bug: T330677 Bug: T300325 Change-Id: Ica09a4284c00d7917f8b6249e946232b2fb38011
2022-05-27 16:38:32 +00:00
) {
$this->siteConfig = $siteConfig;
$this->dataAccess = $dataAccess;
$this->pageConfigFactory = $pageConfigFactory;
$this->languageConverterFactory = $languageConverterFactory;
$this->legacyParserFactory = $legacyParserFactory;
Make ParsoidOutputAccess a wrapper over ParserOutputAccess * Updated ParserOutput to set Parsoid render ids that REST API functionality expects in ParserOutput objects. * CacheThresholdTime functionality no longer exists since it was implemented in ParsoidOutputAccess and ParserOutputAccess doesn't support it. This is tracked in T346765. * Enforce the constraint that uncacheable parses are only for fake or mutable revisions. Updated tests that violated this constraint to use 'getParseOutput' instead of calling the parse method directly. * Had to make some changes in ParsoidParser around use of preferredVariant passed to Parsoid. I also left some TODO comments for future fixes. T267067 is also relevant here. PARSOID-SPECIFIC OPTIONS: * logLinterData: linter data is always logged by default -- removed support to disable it. Linter extension handles stale lints properly and it is better to let it handle it rather than add special cases to the API. * offsetType: Moved this support to ParsoidHandler as a post-processing of byte-offset output. This eliminates the need to support this Parsoid-specific options in the ContentHandler hierarchies. * body_only / wrapSections: Handled this in HtmlOutputRendererHelper as a post-processing of regular output by removing sections and returning the body content only. This does result in some useless section-wrapping work with Parsoid, but the simplification is probably worth it. If in the future, we support Parsoid-specific options in the ContentHandler hierarchy, we could re-introduce this. But, in any case, this "fragment" flavor options is likely to get moved out of core into the VisualEditor extension code. DEPLOYMENT: * This patch changes the cache key by setting the useParsoid option in ParserOptions. The parent patch handles this to ensure we don't encounter a cold cache on deploy. TESTS: * Updated tests and mocks to reflect new reality. * Do we need any new tests? Bug: T332931 Change-Id: Ic9b7cc0fcf365e772b7d080d76a065e3fd585f80
2023-08-29 20:13:43 +00:00
$this->globalIdGenerator = $globalIdGenerator;
Allow setting a ParserOption to generate Parsoid HTML This is an initial quick-and-dirty implementation. The ParsoidParser class will eventually inherit from \Parser, but this is an initial placeholder to unblock other Parsoid read views work. Currently Parsoid does not fully implement all the ParserOutput metadata set by the legacy parser, but we're working on it. This patch also addresses T300325 by ensuring the the Page HTML APIs use ParserOutput::getRawText(), which will return the entire Parsoid HTML document without post-processing. This is what the Parsoid team refers to as "edit mode" HTML. The ParserOutput::getText() method returns only the <body> contents of the HTML, and applies several transformations, including inserting Table of Contents and style deduplication; this is the "read views" flavor of the Parsoid HTML. We need to be careful of the interaction of the `useParsoid` flag with the ParserCacheMetadata. Effectively `useParsoid` should *always* be marked as "used" or else the ParserCache will assume its value doesn't matter and will serve legacy content for parsoid requests and vice-versa. T330677 is a follow up to address this more thoroughly by splitting the parser cache in ParserOutputAccess; the stop gap in this patch is fragile and, because it doesn't fork the ParserCacheMetadata cache, may corrupt the ParserCacheMetadata in the case when Parsoid and the legacy parser consult different sets of options to render a page. Bug: T300191 Bug: T330677 Bug: T300325 Change-Id: Ica09a4284c00d7917f8b6249e946232b2fb38011
2022-05-27 16:38:32 +00:00
}
/**
* Creates a new Parsoid parser.
* @return ParsoidParser
* @since 1.41
* @unstable
*/
public function create(): ParsoidParser {
return new ParsoidParser(
new Parsoid( $this->siteConfig, $this->dataAccess ),
Allow setting a ParserOption to generate Parsoid HTML This is an initial quick-and-dirty implementation. The ParsoidParser class will eventually inherit from \Parser, but this is an initial placeholder to unblock other Parsoid read views work. Currently Parsoid does not fully implement all the ParserOutput metadata set by the legacy parser, but we're working on it. This patch also addresses T300325 by ensuring the the Page HTML APIs use ParserOutput::getRawText(), which will return the entire Parsoid HTML document without post-processing. This is what the Parsoid team refers to as "edit mode" HTML. The ParserOutput::getText() method returns only the <body> contents of the HTML, and applies several transformations, including inserting Table of Contents and style deduplication; this is the "read views" flavor of the Parsoid HTML. We need to be careful of the interaction of the `useParsoid` flag with the ParserCacheMetadata. Effectively `useParsoid` should *always* be marked as "used" or else the ParserCache will assume its value doesn't matter and will serve legacy content for parsoid requests and vice-versa. T330677 is a follow up to address this more thoroughly by splitting the parser cache in ParserOutputAccess; the stop gap in this patch is fragile and, because it doesn't fork the ParserCacheMetadata cache, may corrupt the ParserCacheMetadata in the case when Parsoid and the legacy parser consult different sets of options to render a page. Bug: T300191 Bug: T330677 Bug: T300325 Change-Id: Ica09a4284c00d7917f8b6249e946232b2fb38011
2022-05-27 16:38:32 +00:00
$this->pageConfigFactory,
$this->languageConverterFactory,
Make ParsoidOutputAccess a wrapper over ParserOutputAccess * Updated ParserOutput to set Parsoid render ids that REST API functionality expects in ParserOutput objects. * CacheThresholdTime functionality no longer exists since it was implemented in ParsoidOutputAccess and ParserOutputAccess doesn't support it. This is tracked in T346765. * Enforce the constraint that uncacheable parses are only for fake or mutable revisions. Updated tests that violated this constraint to use 'getParseOutput' instead of calling the parse method directly. * Had to make some changes in ParsoidParser around use of preferredVariant passed to Parsoid. I also left some TODO comments for future fixes. T267067 is also relevant here. PARSOID-SPECIFIC OPTIONS: * logLinterData: linter data is always logged by default -- removed support to disable it. Linter extension handles stale lints properly and it is better to let it handle it rather than add special cases to the API. * offsetType: Moved this support to ParsoidHandler as a post-processing of byte-offset output. This eliminates the need to support this Parsoid-specific options in the ContentHandler hierarchies. * body_only / wrapSections: Handled this in HtmlOutputRendererHelper as a post-processing of regular output by removing sections and returning the body content only. This does result in some useless section-wrapping work with Parsoid, but the simplification is probably worth it. If in the future, we support Parsoid-specific options in the ContentHandler hierarchy, we could re-introduce this. But, in any case, this "fragment" flavor options is likely to get moved out of core into the VisualEditor extension code. DEPLOYMENT: * This patch changes the cache key by setting the useParsoid option in ParserOptions. The parent patch handles this to ensure we don't encounter a cold cache on deploy. TESTS: * Updated tests and mocks to reflect new reality. * Do we need any new tests? Bug: T332931 Change-Id: Ic9b7cc0fcf365e772b7d080d76a065e3fd585f80
2023-08-29 20:13:43 +00:00
$this->legacyParserFactory,
$this->globalIdGenerator
Allow setting a ParserOption to generate Parsoid HTML This is an initial quick-and-dirty implementation. The ParsoidParser class will eventually inherit from \Parser, but this is an initial placeholder to unblock other Parsoid read views work. Currently Parsoid does not fully implement all the ParserOutput metadata set by the legacy parser, but we're working on it. This patch also addresses T300325 by ensuring the the Page HTML APIs use ParserOutput::getRawText(), which will return the entire Parsoid HTML document without post-processing. This is what the Parsoid team refers to as "edit mode" HTML. The ParserOutput::getText() method returns only the <body> contents of the HTML, and applies several transformations, including inserting Table of Contents and style deduplication; this is the "read views" flavor of the Parsoid HTML. We need to be careful of the interaction of the `useParsoid` flag with the ParserCacheMetadata. Effectively `useParsoid` should *always* be marked as "used" or else the ParserCache will assume its value doesn't matter and will serve legacy content for parsoid requests and vice-versa. T330677 is a follow up to address this more thoroughly by splitting the parser cache in ParserOutputAccess; the stop gap in this patch is fragile and, because it doesn't fork the ParserCacheMetadata cache, may corrupt the ParserCacheMetadata in the case when Parsoid and the legacy parser consult different sets of options to render a page. Bug: T300191 Bug: T330677 Bug: T300325 Change-Id: Ica09a4284c00d7917f8b6249e946232b2fb38011
2022-05-27 16:38:32 +00:00
);
}
}