Commit graph

3 commits

Author SHA1 Message Date
Bartosz Dziewoński
0313128b10 Use PHP 7 "\u{NNNN}" Unicode codepoint escapes in string literals
In cases where we're operating on text data (and not binary data),
use e.g. "\u{00A0}" to refer directly to the Unicode character
'NO-BREAK SPACE' instead of "\xc2\xa0" to specify the bytes C2h A0h
(which correspond to the UTF-8 encoding of that character). This
makes it easier to look up those mysterious sequences, as not all
are as recognizable as the no-break space.

This is not enforced by PHP, but I think we should write those in
uppercase and zero-padded to at least four characters, like the
Unicode standard does.

Note that not all "\xNN" escapes can be automatically replaced:
* We can't use Unicode escapes for binary data that is not UTF-8
  (e.g. in code converting from legacy encodings or testing the
  handling of invalid UTF-8 byte sequences).
* '\xNN' escapes in regular expressions in single-quoted strings
  are actually handled by PCRE and have to be dealt with carefully
  (those regexps should probably be changed to use the /u modifier).
* "\xNN" referring to ASCII characters ("\x7F" and lower) should
  probably be left as-is.

The replacements in this commit were done semi-manually by piping
the existing "\xNN" escapes through the following terrible Ruby
script I devised:

  chars = eval('"' + ARGV[0] + '"').force_encoding('utf-8')
  puts chars.split('').map{|char|
    '\\u{' + char.ord.to_s(16).upcase.rjust(4, '0') + '}'
  }.join('')

Change-Id: Idc3dee3a7fb5ebfaef395754d8859b18f1f8769a
2018-06-04 16:20:13 +00:00
Thiemo Mättig
e16191caa3 Remove unused and unnecessary imports
Change-Id: I26e623a4e4ba965c07670369a90c8a95185ea1e4
2017-06-12 15:50:43 +00:00
Tim Starling
9341a00ed1 RemexHtml tidy driver with p-wrapping
Pull in the RemexHtml library, which is an HTML 5 library I recently
created.

RemexCompatMunger mutates the event stream, inserting <mw:p-wrap>
elements where necessary, and occasionally taking even more invasive
action such as reparenting and removing nodes maintained in Serializer's
tree.

RemexCompatFormatter produces a MediaWiki-style serialization which is
relatively compatible with existing parser tests. It also does final
empty element handling, including translating <mw:p-wrap> to <p>

Tests are imported from both Html5Depurate and Subbu's pwrap.js.

Depends-On: I864f31d9afdffdde49bfd39f07a0fb7f4df5c5d9
Change-Id: I900155b7dd199b0ae2a3b9cdb6db5136fc4f35a8
2017-03-08 16:54:13 +11:00