wiki.techinc.nl/includes/Tokenizer.php

<?php
class Tokenizer {
	/* private */ var $mText, 		# Text to be processed by the tokenizer
			  $mPos,		# current position of tokenizer in text
			  $mTextLength,		# Length of $mText
			  $mCount,		# token count, computed in preParse
			  $mMatch,		# matches of tokenizer regex, computed in preParse
			  $mMatchPos;		# current token position of tokenizer. Each match can
			  			# be up to two tokens: A matched token and the text after it.

	/* private */ function Tokenizer()
	{
		$this->mPos=0;
	}

	# factory function
	function newFromString( $s )
	{
		$t = new Tokenizer();
		$t->mText = $s;
		$t->preParse();
		$t->mTextLength = strlen( $s );
		return $t;
	}

	function preParse()
	{
		global $wgLang;

		# build up the regex, step by step.
		# Basic features: Quotes for <em>/<strong> and hyphens for <hr>
		$regex = "\'\'\'\'\'|\'\'\'|\'\'|\n-----*";
		# Append regex for linkPrefixExtension 
		if (  $wgLang->linkPrefixExtension() ) {
			$regex .= "|([a-zA-Z\x80-\xff]+)\[\[";
		} else {
			# end tag that can start with 3 [
			$regex .= "|\[\[\[?";
		}
		# Closing link
		$regex .= "|\]\]";
		# Magic words that automatically generate links
		$regex .= "|ISBN |RFC ";
		# Language-specific additions
		$regex .= $wgLang->tokenizerRegex();
		# Finalize regex
		$regex = "/(" . $regex . ")/";

		# Apply the regex to the text
		$this->mCount = preg_match_all( $regex, $this->mText, $this->mMatch,
						PREG_PATTERN_ORDER|PREG_OFFSET_CAPTURE);
		$this->mMatchPos=0;
	}

	function nextToken()
	{
		$token = $this->previewToken();
		if ( $token ) {
			$this->mMatchPos = $token["mMatchPos"];
			$this->mPos = $token["mPos"];
		}
		return $token;
	}


	function previewToken()
	{
		if ( $this->mMatchPos < $this->mCount  ) {
			$token["pos"] = $this->mPos;
			if ( $this->mPos < $this->mMatch[0][$this->mMatchPos][1] ) {
				$token["type"] = "text";
				$token["text"] = substr( $this->mText, $this->mPos,
							 $this->mMatch[0][$this->mMatchPos][1] - $this->mPos );
				# What the pointers would change to if this would not just be a preview
				$token["mMatchPos"] = $this->mMatchPos; 
				$token["mPos"] = $this->mMatch[0][$this->mMatchPos][1];
			} else {
				# If linkPrefixExtension is set,  $this->mMatch[2][$this->mMatchPos][0]
				# contains the link prefix, or is null if no link prefix exist.
				if ( isset( $this->mMatch[2] ) && $this->mMatch[2][$this->mMatchPos][0] )
				{
					# prefixed link open tag, [0] is "prefix[["
					$token["type"] = "[[";
					$token["text"] = $this->mMatch[2][$this->mMatchPos][0]; # the prefix
				} else {
					$token["type"] = $this->mMatch[0][$this->mMatchPos][0];
					if ( substr($token["type"],1,4) == "----" )
					{
						# any number of hyphens bigger than four is a <HR>. 
						# strip down to four.
						$token["type"]="----";
					}
				}
				# What the pointers would change to if this would not just be a preview
				$token["mPos"] = $this->mPos + strlen( $this->mMatch[0][$this->mMatchPos][0] );
				$token["mMatchPos"] = $this->mMatchPos + 1;
			}
		} elseif ( $this->mPos < $this->mTextLength ) {
			$token["type"] = "text";
			$token["text"] = substr( $this->mText, $this->mPos );
			# What the pointers would change to if this would not just be a preview
			$token["mPos"] = $this->mTextLength;
			$token["mMatchPos"] = $this->mMatchPos;
		} else {
			$token = FALSE;
		}
		return $token;
	}

		
}
Added real parser/tokenizer. Tokenizer is a new class that splits a text into tokens. Parser calls the tokenizer to get one token by another and handle them one by one. Parser:doAllQuotes and Parser:replaceInternalLinks have been replaced by the new parser. Image thumbnailing now allows links in the captions. 2004-02-28 23:38:08 +00:00			`<?php`
			`class Tokenizer {`
renamed variables for better readability 2004-02-29 11:00:30 +00:00			`/* private */ var $mText, # Text to be processed by the tokenizer`
			`$mPos, # current position of tokenizer in text`
			`$mTextLength, # Length of $mText`
			`$mCount, # token count, computed in preParse`
			`$mMatch, # matches of tokenizer regex, computed in preParse`
			`$mMatchPos; # current token position of tokenizer. Each match can`
			`# be up to two tokens: A matched token and the text after it.`
Added real parser/tokenizer. Tokenizer is a new class that splits a text into tokens. Parser calls the tokenizer to get one token by another and handle them one by one. Parser:doAllQuotes and Parser:replaceInternalLinks have been replaced by the new parser. Image thumbnailing now allows links in the captions. 2004-02-28 23:38:08 +00:00
			`/* private */ function Tokenizer()`
			`{`
			`$this->mPos=0;`
			`}`

			`# factory function`
			`function newFromString( $s )`
			`{`
			`$t = new Tokenizer();`
			`$t->mText = $s;`
			`$t->preParse();`
			`$t->mTextLength = strlen( $s );`
			`return $t;`
			`}`

			`function preParse()`
			`{`
extended tokenizer to handle prefixed links 2004-02-29 13:33:51 +00:00			`global $wgLang;`
Added hook to tokenizer and to parser for language specific processing. Using this hook, added a conversion of spaces to non-breaking spaces for the French wikipedia. Switched ----- -> <hr> processing to tokenizer. 2004-03-02 20:23:56 +00:00
			`# build up the regex, step by step.`
			`# Basic features: Quotes for <em>/<strong> and hyphens for <hr>`
			`$regex = "\'\'\'\'\'\|\'\'\'\|\'\'\|\n-----*";`
			`# Append regex for linkPrefixExtension`
extended tokenizer to handle prefixed links 2004-02-29 13:33:51 +00:00			`if ( $wgLang->linkPrefixExtension() ) {`
Added hook to tokenizer and to parser for language specific processing. Using this hook, added a conversion of spaces to non-breaking spaces for the French wikipedia. Switched ----- -> <hr> processing to tokenizer. 2004-03-02 20:23:56 +00:00			`$regex .= "\|([a-zA-Z\x80-\xff]+)\[\[";`
extended tokenizer to handle prefixed links 2004-02-29 13:33:51 +00:00			`} else {`
Fix sourceforge bug 872981 Render [[[link]]] as [<a href...>link</a>] Render [[[link\|text]]] as [<a href...>text</a>] UNTESTED with $wgLang->linkPrefixExtension() true 2004-03-16 02:17:33 +00:00			`# end tag that can start with 3 [`
			`$regex .= "\|\[\[\[?";`
extended tokenizer to handle prefixed links 2004-02-29 13:33:51 +00:00			`}`
Added hook to tokenizer and to parser for language specific processing. Using this hook, added a conversion of spaces to non-breaking spaces for the French wikipedia. Switched ----- -> <hr> processing to tokenizer. 2004-03-02 20:23:56 +00:00			`# Closing link`
			`$regex .= "\|\]\]";`
Moved ISBN magic to new parser 2004-03-06 20:04:25 +00:00			`# Magic words that automatically generate links`
Added RFC link magic, similar to ISBN magic 2004-03-06 21:30:42 +00:00			`$regex .= "\|ISBN \|RFC ";`
Added hook to tokenizer and to parser for language specific processing. Using this hook, added a conversion of spaces to non-breaking spaces for the French wikipedia. Switched ----- -> <hr> processing to tokenizer. 2004-03-02 20:23:56 +00:00			`# Language-specific additions`
			`$regex .= $wgLang->tokenizerRegex();`
			`# Finalize regex`
			`$regex = "/(" . $regex . ")/";`
extended tokenizer to handle prefixed links 2004-02-29 13:33:51 +00:00
Added hook to tokenizer and to parser for language specific processing. Using this hook, added a conversion of spaces to non-breaking spaces for the French wikipedia. Switched ----- -> <hr> processing to tokenizer. 2004-03-02 20:23:56 +00:00			`# Apply the regex to the text`
extended tokenizer to handle prefixed links 2004-02-29 13:33:51 +00:00			`$this->mCount = preg_match_all( $regex, $this->mText, $this->mMatch,`
Added real parser/tokenizer. Tokenizer is a new class that splits a text into tokens. Parser calls the tokenizer to get one token by another and handle them one by one. Parser:doAllQuotes and Parser:replaceInternalLinks have been replaced by the new parser. Image thumbnailing now allows links in the captions. 2004-02-28 23:38:08 +00:00			`PREG_PATTERN_ORDER\|PREG_OFFSET_CAPTURE);`
renamed variables for better readability 2004-02-29 11:00:30 +00:00			`$this->mMatchPos=0;`
Added real parser/tokenizer. Tokenizer is a new class that splits a text into tokens. Parser calls the tokenizer to get one token by another and handle them one by one. Parser:doAllQuotes and Parser:replaceInternalLinks have been replaced by the new parser. Image thumbnailing now allows links in the captions. 2004-02-28 23:38:08 +00:00			`}`

			`function nextToken()`
			`{`
			`$token = $this->previewToken();`
			`if ( $token ) {`
extended tokenizer to handle prefixed links 2004-02-29 13:33:51 +00:00			`$this->mMatchPos = $token["mMatchPos"];`
			`$this->mPos = $token["mPos"];`
Added real parser/tokenizer. Tokenizer is a new class that splits a text into tokens. Parser calls the tokenizer to get one token by another and handle them one by one. Parser:doAllQuotes and Parser:replaceInternalLinks have been replaced by the new parser. Image thumbnailing now allows links in the captions. 2004-02-28 23:38:08 +00:00			`}`
			`return $token;`
			`}`


			`function previewToken()`
			`{`
Fixed what seems to be an off-by-one error (it tried to access one past the end of the array quite consistently). Someone who understands this code, please check. 2004-03-08 02:46:27 +00:00			`if ( $this->mMatchPos < $this->mCount ) {`
Added real parser/tokenizer. Tokenizer is a new class that splits a text into tokens. Parser calls the tokenizer to get one token by another and handle them one by one. Parser:doAllQuotes and Parser:replaceInternalLinks have been replaced by the new parser. Image thumbnailing now allows links in the captions. 2004-02-28 23:38:08 +00:00			`$token["pos"] = $this->mPos;`
renamed variables for better readability 2004-02-29 11:00:30 +00:00			`if ( $this->mPos < $this->mMatch[0][$this->mMatchPos][1] ) {`
Added real parser/tokenizer. Tokenizer is a new class that splits a text into tokens. Parser calls the tokenizer to get one token by another and handle them one by one. Parser:doAllQuotes and Parser:replaceInternalLinks have been replaced by the new parser. Image thumbnailing now allows links in the captions. 2004-02-28 23:38:08 +00:00			`$token["type"] = "text";`
			`$token["text"] = substr( $this->mText, $this->mPos,`
renamed variables for better readability 2004-02-29 11:00:30 +00:00			`$this->mMatch[0][$this->mMatchPos][1] - $this->mPos );`
extended tokenizer to handle prefixed links 2004-02-29 13:33:51 +00:00			`# What the pointers would change to if this would not just be a preview`
			`$token["mMatchPos"] = $this->mMatchPos;`
renamed variables for better readability 2004-02-29 11:00:30 +00:00			`$token["mPos"] = $this->mMatch[0][$this->mMatchPos][1];`
Added real parser/tokenizer. Tokenizer is a new class that splits a text into tokens. Parser calls the tokenizer to get one token by another and handle them one by one. Parser:doAllQuotes and Parser:replaceInternalLinks have been replaced by the new parser. Image thumbnailing now allows links in the captions. 2004-02-28 23:38:08 +00:00			`} else {`
extended tokenizer to handle prefixed links 2004-02-29 13:33:51 +00:00			`# If linkPrefixExtension is set, $this->mMatch[2][$this->mMatchPos][0]`
			`# contains the link prefix, or is null if no link prefix exist.`
Fixed what seems to be an off-by-one error (it tried to access one past the end of the array quite consistently). Someone who understands this code, please check. 2004-03-08 02:46:27 +00:00			`if ( isset( $this->mMatch[2] ) && $this->mMatch[2][$this->mMatchPos][0] )`
extended tokenizer to handle prefixed links 2004-02-29 13:33:51 +00:00			`{`
			`# prefixed link open tag, [0] is "prefix[["`
			`$token["type"] = "[[";`
			`$token["text"] = $this->mMatch[2][$this->mMatchPos][0]; # the prefix`
			`} else {`
			`$token["type"] = $this->mMatch[0][$this->mMatchPos][0];`
Added hook to tokenizer and to parser for language specific processing. Using this hook, added a conversion of spaces to non-breaking spaces for the French wikipedia. Switched ----- -> <hr> processing to tokenizer. 2004-03-02 20:23:56 +00:00			`if ( substr($token["type"],1,4) == "----" )`
			`{`
			`# any number of hyphens bigger than four is a <HR>.`
			`# strip down to four.`
			`$token["type"]="----";`
			`}`
extended tokenizer to handle prefixed links 2004-02-29 13:33:51 +00:00			`}`
			`# What the pointers would change to if this would not just be a preview`
			`$token["mPos"] = $this->mPos + strlen( $this->mMatch[0][$this->mMatchPos][0] );`
renamed variables for better readability 2004-02-29 11:00:30 +00:00			`$token["mMatchPos"] = $this->mMatchPos + 1;`
Added real parser/tokenizer. Tokenizer is a new class that splits a text into tokens. Parser calls the tokenizer to get one token by another and handle them one by one. Parser:doAllQuotes and Parser:replaceInternalLinks have been replaced by the new parser. Image thumbnailing now allows links in the captions. 2004-02-28 23:38:08 +00:00			`}`
			`} elseif ( $this->mPos < $this->mTextLength ) {`
			`$token["type"] = "text";`
			`$token["text"] = substr( $this->mText, $this->mPos );`
extended tokenizer to handle prefixed links 2004-02-29 13:33:51 +00:00			`# What the pointers would change to if this would not just be a preview`
Added real parser/tokenizer. Tokenizer is a new class that splits a text into tokens. Parser calls the tokenizer to get one token by another and handle them one by one. Parser:doAllQuotes and Parser:replaceInternalLinks have been replaced by the new parser. Image thumbnailing now allows links in the captions. 2004-02-28 23:38:08 +00:00			`$token["mPos"] = $this->mTextLength;`
extended tokenizer to handle prefixed links 2004-02-29 13:33:51 +00:00			`$token["mMatchPos"] = $this->mMatchPos;`
Added real parser/tokenizer. Tokenizer is a new class that splits a text into tokens. Parser calls the tokenizer to get one token by another and handle them one by one. Parser:doAllQuotes and Parser:replaceInternalLinks have been replaced by the new parser. Image thumbnailing now allows links in the captions. 2004-02-28 23:38:08 +00:00			`} else {`
			`$token = FALSE;`
			`}`
			`return $token;`
			`}`


			`}`