* Refactored the parser. See my huge entry in RELEASE-NOTES for details.

* Made it possible to configure the parser class being used, via $wgParserConf.
* Moved defines from the top of Parser.php to either class constants or Defines.php
* Added Parser_DiffTest, a differential parser class for regression testing
* Added Parser_OldPP, a parser class which operates like the parser before this commit. I made one breaking change: a bugfix to avoid losing whitespace when adding MWTEMPLATESECTION markers. 
* Made internal tidy work with PHP 5
* Added the ability to supply a hook for template fetching via ParserOptions. This is handy for testing.
* Updated parserTests.txt to account for the various breaking changes I made. Removed a few parser tests that no longer test for anything useful.
This commit is contained in:
Tim Starling 2007-11-20 10:55:08 +00:00
parent 8a355e04ff
commit b6dba5bcfd
10 changed files with 6356 additions and 923 deletions

View file

@ -174,6 +174,71 @@ it from source control: http://www.mediawiki.org/wiki/Download_from_SVN
* Make a better rate-limiting error message (i.e. a normal MW error,
rather than an "Internal Server Error").
== Parser changes in 1.12 ==
The parser pass order has changed from
* Extension tag strip and render
* HTML normalisation and security
* Template expansion
* Main section...
to
* Template and extension tag parse to intermediate representation
* Template expansion and extension rendering
* HTML normalisation and security
* Main section...
The main effect of this for the user is that the rules for uncovered syntax
have changed.
Uncovered main-pass syntax, such as HTML tags, are now generally valid, whereas
previously in some cases they were escaped. For example, you could have "<ta" in
one template, and "ble>" in another template, and put them together to make a
valid <table> tag. Previously the result would have been "&lt;table&gt;".
Uncovered preprocessor syntax is generally not recognised. For example, if you
have "{{a" in Template:A and "b}}" in Template:B, then "{{a}}{{b}}" will be
converted to a literal "{{ab}}" rather than the contents of Template:Ab. This
was the case previously in HTML output mode, and is now uniformly the case in
the other modes as well. HTML-style comments uncovered by template expansion
will not be recognised by the preprocessor and hence will not prevent template
expansion within them, but they will be stripped by the following HTML security
pass.
The rules for template expansion during message transformation were
counterintuitive, mostly accidental and buggy. There are a few small changes in
this version: for example, templates with dynamic names, as in "{{ {{a}} }}",
are fully expanded as they are in HTML mode, whereas previously only the inner
template was expanded. I'd like to make some larger breaking changes to message
transformation, after a review of typical use cases.
The header identification routines for section edit and for numbering section
edit links have been merged. This removes a significant failure mode and fixes a
whole category of bugs (tracked by bug #4899). Wikitext headings uncovered by
template expansion or comment removal will still be rendered into a heading tag,
and will get an entry in the TOC, but will not have a section edit link.
HTML-style headings will also not have a section edit link. Valid wikitext
headings present in the template source text will get a template section edit
link. This is a major break from previous behaviour, but I believe the effects
are almost entirely beneficial.
The main motivation for making these changes was performance. The new two-pass
preprocessor can skip "dead branches" in template expansion, such as unfollowed
#switch cases and unused defaults for template arguments. This provides a
significant performance improvement in template-heavy test cases taken from
Wikipedia. Parser function hooks can participate in this performance improvement
by using the new SFH_OBJECT_ARGS flag during registration.
The pre-expand include size limit has been removed, since there's no efficient
way to calculate such a figure, and it would now be meaningless for performance
anyway. The "preprocessor node count" takes its place, with a generous default
limit.
The context in which XML-style extension tags are called has changed, so
extensions which make use of the parser state may need compatibility changes.
=== API changes in 1.12 ===
Full API documentation is available at http://www.mediawiki.org/wiki/API

View file

@ -7,6 +7,8 @@ ini_set('unserialize_callback_func', '__autoload' );
function __autoload($className) {
global $wgAutoloadClasses;
# Locations of core classes
# Extension classes are specified with $wgAutoloadClasses
static $localClasses = array(
# Includes
'AjaxDispatcher' => 'includes/AjaxDispatcher.php',
@ -133,9 +135,11 @@ function __autoload($className) {
'ReverseChronologicalPager' => 'includes/Pager.php',
'TablePager' => 'includes/Pager.php',
'Parser' => 'includes/Parser.php',
'Parser_OldPP' => 'includes/Parser_OldPP.php',
'Parser_DiffTest' => 'includes/Parser_DiffTest.php',
'ParserCache' => 'includes/ParserCache.php',
'ParserOutput' => 'includes/ParserOutput.php',
'ParserOptions' => 'includes/ParserOptions.php',
'ParserCache' => 'includes/ParserCache.php',
'PatrolLog' => 'includes/PatrolLog.php',
'ProfilerSimple' => 'includes/ProfilerSimple.php',
'ProfilerSimpleUDP' => 'includes/ProfilerSimpleUDP.php',

View file

@ -881,6 +881,8 @@ $wgMaxNameChars = 255; # Maximum number of bytes in username
$wgMaxSigChars = 255; # Maximum number of Unicode characters in signature
$wgMaxArticleSize = 2048; # Maximum article size in kilobytes
$wgMaxPPNodeCount = 1000000; # A complexity limit on template expansion
$wgExtraSubtitle = '';
$wgSiteSupportPage = ''; # A page where you users can receive donations
@ -1875,7 +1877,7 @@ $wgAlwaysUseTidy = false;
$wgTidyBin = 'tidy';
$wgTidyConf = $IP.'/includes/tidy.conf';
$wgTidyOpts = '';
$wgTidyInternal = function_exists( 'tidy_load_config' );
$wgTidyInternal = extension_loaded( 'tidy' );
/** See list of skins and their symbolic names in languages/Language.php */
$wgDefaultSkin = 'monobook';
@ -2782,3 +2784,19 @@ $wgDisableOutputCompression = false;
*/
$wgSlaveLagWarning = 10;
$wgSlaveLagCritical = 30;
/**
* Parser configuration. Associative array with the following members:
*
* class The class name
*
* The entire associative array will be passed through to the constructor as
* the first parameter. Note that only Setup.php can use this variable --
* the configuration will change at runtime via $wgParser member functions, so
* the contents of this variable will be out-of-date. The variable can only be
* changed during LocalSettings.php, in particular, it can't be changed during
* an extension setup function.
*/
$wgParserConf = array(
'class' => 'Parser',
);

View file

@ -263,4 +263,17 @@ define( 'UTF8_TAIL', true );
# Hook support constants
define( 'MW_SUPPORTS_EDITFILTERMERGED', 1 );
# Allowed values for Parser::$mOutputType
# Parameter to Parser::startExternalParse().
define( 'OT_HTML', 1 );
define( 'OT_WIKI', 2 );
define( 'OT_MSG' , 3 );
define( 'OT_PREPROCESS', 4 );
# Flags for Parser::setFunctionHook
define( 'SFH_NO_HASH', 1 );
define( 'SFH_OBJECT_ARGS', 2 );
# Flags for Parser::replaceLinkHolders
define( 'RLH_FOR_UPDATE', 1 );

File diff suppressed because it is too large Load diff

View file

@ -21,7 +21,9 @@ class ParserOptions
var $mTidy; # Ask for tidy cleanup
var $mInterfaceMessage; # Which lang to call for PLURAL and GRAMMAR
var $mMaxIncludeSize; # Maximum size of template expansions, in bytes
var $mMaxPPNodeCount; # Maximum number of nodes touched by PPFrame::expand()
var $mRemoveComments; # Remove HTML comments. ONLY APPLIES TO PREPROCESS OPERATIONS
var $mTemplateCallback; # Callback for template fetching
var $mUser; # Stored user object, just used to initialise the skin
@ -36,7 +38,9 @@ class ParserOptions
function getTidy() { return $this->mTidy; }
function getInterfaceMessage() { return $this->mInterfaceMessage; }
function getMaxIncludeSize() { return $this->mMaxIncludeSize; }
function getMaxPPNodeCount() { return $this->mMaxPPNodeCount; }
function getRemoveComments() { return $this->mRemoveComments; }
function getTemplateCallback() { return $this->mTemplateCallback; }
function getSkin() {
if ( !isset( $this->mSkin ) ) {
@ -65,7 +69,9 @@ class ParserOptions
function setSkin( $x ) { $this->mSkin = $x; }
function setInterfaceMessage( $x ) { return wfSetVar( $this->mInterfaceMessage, $x); }
function setMaxIncludeSize( $x ) { return wfSetVar( $this->mMaxIncludeSize, $x ); }
function setMaxPPNodeCount( $x ) { return wfSetVar( $this->mMaxPPNodeCount, $x ); }
function setRemoveComments( $x ) { return wfSetVar( $this->mRemoveComments, $x ); }
function setTemplateCallback( $x ) { return wfSetVar( $this->mTemplateCallback, $x ); }
function __construct( $user = null ) {
$this->initialiseFromUser( $user );
@ -83,6 +89,7 @@ class ParserOptions
function initialiseFromUser( $userInput ) {
global $wgUseTeX, $wgUseDynamicDates, $wgInterwikiMagic, $wgAllowExternalImages;
global $wgAllowExternalImagesFrom, $wgAllowSpecialInclusion, $wgMaxArticleSize;
global $wgMaxPPNodeCount;
$fname = 'ParserOptions::initialiseFromUser';
wfProfileIn( $fname );
if ( !$userInput ) {
@ -111,7 +118,9 @@ class ParserOptions
$this->mTidy = false;
$this->mInterfaceMessage = false;
$this->mMaxIncludeSize = $wgMaxArticleSize * 1024;
$this->mMaxPPNodeCount = $wgMaxPPNodeCount;
$this->mRemoveComments = true;
$this->mTemplateCallback = array( 'Parser', 'statelessFetchTemplate' );
wfProfileOut( $fname );
}
}

View file

@ -0,0 +1,62 @@
<?php
class Parser_DiffTest
{
var $parsers, $conf;
function __construct( $conf ) {
if ( !isset( $conf['parsers'] ) ) {
throw new MWException( __METHOD__ . ': no parsers specified' );
}
$this->conf = $conf;
}
function init() {
if ( !is_null( $this->parsers ) ) {
return;
}
foreach ( $this->conf['parsers'] as $i => $parserConf ) {
if ( !is_array( $parserConf ) ) {
$class = $parserConf;
$parserconf = array( 'class' => $parserConf );
} else {
$class = $parserConf['class'];
}
$this->parsers[$i] = new $class( $parserConf );
}
}
function __call( $name, $args ) {
$this->init();
$results = array();
$mismatch = false;
$lastResult = null;
$first = true;
foreach ( $this->parsers as $i => $parser ) {
$currentResult = call_user_func_array( array( &$this->parsers[$i], $name ), $args );
if ( $first ) {
$first = false;
} else {
if ( $lastResult !== $currentResult ) {
$mismatch = true;
}
}
$results[$i] = $currentResult;
$lastResult = $currentResult;
}
if ( $mismatch ) {
throw new MWException( "Parser_DiffTest: results mismatch on call to $name\n" .
'Arguments: ' . var_export( $args, true ) . "\n" .
'Results: ' . var_export( $results, true ) . "\n" );
}
return $lastResult;
}
function setFunctionHook( $id, $callback, $flags = 0 ) {
$this->init();
foreach ( $this->parsers as $i => $parser ) {
$parser->setFunctionHook( $id, $callback, $flags );
}
}
}

4918
includes/Parser_OldPP.php Normal file

File diff suppressed because it is too large Load diff

View file

@ -235,7 +235,8 @@ $wgRequest->interpolateTitle();
$wgUser = new StubUser;
$wgLang = new StubUserLang;
$wgOut = new StubObject( 'wgOut', 'OutputPage' );
$wgParser = new StubObject( 'wgParser', 'Parser' );
$wgParser = new StubObject( 'wgParser', $wgParserConf['class'], array( $wgParserConf ) );
$wgMessageCache = new StubObject( 'wgMessageCache', 'MessageCache',
array( $parserMemc, $wgUseDatabaseMessages, $wgMsgCacheExpiry, wfWikiID() ) );

View file

@ -3878,7 +3878,7 @@ Bug 2304: HTML attribute safety (unsafe breakout parameter; 2309)
!! input
{{div style|"><script>alert(document.cookie)</script>}}
!! result
<div>Magic div</div>
<div style="float: right;">&lt;script&gt;alert(document.cookie)&lt;/script&gt;"&gt;Magic div</div>
!! end
@ -3887,7 +3887,7 @@ Bug 2304: HTML attribute safety (unsafe breakout parameter 2; 2309)
!! input
{{div style|" ><script>alert(document.cookie)</script>}}
!! result
<div style="float: right;">Magic div</div>
<div style="float: right;">&lt;script&gt;alert(document.cookie)&lt;/script&gt;"&gt;Magic div</div>
!! end
@ -4151,7 +4151,7 @@ array(0) {
!! test
Parser hook: case insensetive
Parser hook: case insensitive
!! input
<TAG>input</TAG>
!! result
@ -4165,7 +4165,7 @@ array(0) {
!! test
Parser hook: case insensetive, redux
Parser hook: case insensitive, redux
!! input
<TaG>input</TAg>
!! result
@ -4724,8 +4724,8 @@ MOVE YOUR MOUSE CURSOR OVER THIS TEXT
|
!! result
<table>
<u class="&#124;">} &gt;
{{{|
<u class="&#124;">}}}} &gt;
<br style="onmouseover=&#39;alert(document.cookie);&#39;" />
MOVE YOUR MOUSE CURSOR OVER THIS TEXT
@ -4749,8 +4749,10 @@ noxml
>
}}}blah" onmouseover="alert('hello world');" align="left"'''MOVE MOUSE CURSOR OVER HERE
!! result
<p>{{{|
</p>
<li class="&#124;&#124;">
blah" onmouseover="alert('hello world');" align="left"<b>MOVE MOUSE CURSOR OVER HERE</b>
}}}blah" onmouseover="alert('hello world');" align="left"<b>MOVE MOUSE CURSOR OVER HERE</b>
!! end
@ -5251,10 +5253,11 @@ Section extraction test with comment after heading (section 1)
section=1
!! input
==a==
==legal== <!-- a legal section -->
==unmarked== <!-- an unmarked section -->
==b==
!! result
==a==
==unmarked== <!-- an unmarked section -->
!! end
!! test
@ -5263,10 +5266,10 @@ Section extraction test with comment after heading (section 2)
section=2
!! input
==a==
==legal== <!-- a legal section -->
==unmarked== <!-- an unmarked section -->
==b==
!! result
==legal== <!-- a legal section -->
==b==
!! end
!! test
@ -5295,102 +5298,79 @@ section=2
!! end
# Formerly testing for bug 2587, now resolved by the use of unmarked sections
# instead of respecting commented sections
!! test
Section extraction prefixed by comment (section 1) (bug 2587)
Section extraction prefixed by comment (section 1)
!! options
section=1
!! input
<!-- -->==sec1==
==sec2==
!!result
<!-- -->==sec1==
==sec2==
!!end
!! test
Section extraction prefixed by comment (section 2) (bug 2587)
Section extraction prefixed by comment (section 2)
!! options
section=2
!! input
<!-- -->==sec1==
==sec2==
!!result
==sec2==
!!end
# Formerly testing for bug 2607, now resolved by the use of unmarked sections
# instead of respecting HTML-style headings
!! test
Section extraction, mixed wiki and html (section 1) (bug 2607)
Section extraction, mixed wiki and html (section 1)
!! options
section=1
!! input
<h2>1</h2>
<h2>unmarked</h2>
unmarked
==1==
one
==2==
two
==3==
three
!! result
<h2>1</h2>
==1==
one
!! end
!! test
Section extraction, mixed wiki and html (section 2) (bug 2607)
Section extraction, mixed wiki and html (section 2)
!! options
section=2
!! input
<h2>1</h2>
<h2>unmarked</h2>
unmarked
==1==
one
==2==
two
==3==
three
!! result
==2==
two
!! end
# Formerly testing for bug 3342
!! test
Section extraction, heading surrounded by <noinclude> (bug 3342)
Section extraction, heading surrounded by <noinclude>
!! options
section=1
!! input
<noinclude>==a==</noinclude>
text
<noinclude>==unmarked==</noinclude>
==marked==
!! result
<noinclude>==a==</noinclude>
text
==marked==
!!end
!! test
Section extraction, HTML heading subsections (bug 5272)
!! options
section=1
!! input
<h2>a</h2>
<h3>aa</h3>
<h2>b</h2>
!! result
<h2>a</h2>
<h3>aa</h3>
!! end
!! test
Section extraction, HTML headings should be ignored in extensions (bug 3476)
!! options
section=2
!! input
<h2>a</h2>
<tag>
<h2>not b</h2>
</tag>
<h2>b</h2>
!! result
<h2>b</h2>
!! end
!! test
Section replacement test (section 0)
!! options
@ -5722,94 +5702,6 @@ xxx
!! end
!! test
Section extraction, HTML headings not at line boundaries (section 0)
!! options
section=0
!! input
<h2>Evil</h2><i>blah blah blah</i>
evil blah
<h2>Nice</h2>
nice blah
<i>extra evil</i><h2>Extra nasty</h2>
extra nasty
!! result
!! end
!! test
Section extraction, HTML headings not at line boundaries (section 1)
!! options
section=1
!! input
<h2>Evil</h2><i>blah blah blah</i>
evil blah
<h2>Nice</h2>
nice blah
<i>extra evil</i><h2>Extra nasty</h2>
extra nasty
!! result
<h2>Evil</h2><i>blah blah blah</i>
evil blah
!! end
!! test
Section extraction, HTML headings not at line boundaries (section 2)
!! options
section=2
!! input
<h2>Evil</h2><i>blah blah blah</i>
evil blah
<h2>Nice</h2>
nice blah
<i>extra evil</i><h2>Extra nasty</h2>
extra nasty
!! result
<h2>Nice</h2>
nice blah
<i>extra evil</i>
!! end
!! test
Section extraction, HTML headings not at line boundaries (section 3)
!! options
section=3
!! input
<h2>Evil</h2><i>blah blah blah</i>
evil blah
<h2>Nice</h2>
nice blah
<i>extra evil</i><h2>Extra nasty</h2>
extra nasty
!! result
<h2>Extra nasty</h2>
extra nasty
!! end
!! test
Section extraction, heading followed by pre with 20 spaces (bug 6398)
!! options