* Use a doubly-linked list for the AFE list, instead of an array,
allowing efficient insertion and removal from the middle, and trivial
O(1) lookup of existing elements.
* Use a hashtable of singly-linked lists for storing Noah's Ark buckets,
instead of iterating through the entire AFE list on every push.
* Store attributes in an array instead of serializing them in the
tokenizer. This allows us to avoid sorting them in the output. For the
Noah's Ark clause, the array is copied and then sorted on demand.
* XHTML-style serialization with self-closing tags.
* Clear the AFE list in stopParsing(), otherwise all the BalanceElement
objects are kept alive until after serialization, thus using O(N^2)
memory (in stack depth N) since the full serialization is stored at
each stack level.
Change-Id: I517129c0658f03eb2ddee61fdf33ffe6fbd48509
This is an HTML5-compliant parse/serialize tidy implementation, with
well-delineated hacks to support the <p>-wrapping done by legacy tidy.
Change-Id: I4fd433fd6f1847061b0bf4b3e249c918720d4fae
This adds an implementation of the HTML5 Tree Builder algorithm to PHP,
along with test cases from the tree builder derived from the
html5lib-tests package on github. The test cases were preprocessed
into JSON for the `domino` HTML5 parser, and we're using the JSON
form of the tests.
The implementation follows both the language of the HTML5 specification
and the implementation in `domino` very closely, easing updates if the
specification changes.
This code is used in follow-on commits to support an HTML5-based
"tidy" for mediawiki and the `{{#balance}}` parser function, which
ensures that a template expands to properly-balanced HTML, with all
tags closed and nothing left on the HTML active formatting elements
list.
See: https://github.com/fgnass/domino
Change-Id: I6f4d20a43510dd819776bb333b639315b19d150d