Pages

08 October 2011

Regexes Parse XML Just Fine, Actually

Despite what you may believe, or may have heard from a particular famous Lovecraftian answer on Stack Overflow, you can actually use regular expressions to parse arbitrary XML. Of course, they’re not regular expressions in the strict mathematical sense, but rather in the sense with which most of us are familiar—Perl regexen.

This relies on a seemingly little-known but tremendously handy feature of Perl regular expressions: (?N) will recursively match an instance at the current position of whatever is in capture group N. This lets you recursively match delimiters, limited to a depth of 50 unless you build yourself a special Perl.

The only bit you can’t do in the regular expression engine itself is assert that the names of two matching tags are equal, because you can’t match backreferences in recursive submatches. This isn’t a problem if you assume your input is well-formed. I’m talking about parsing XML, not checking whether some input actually is XML. Correctness is a Boolean, after all: invalid XML is not XML.


I wrote a quick Perl script to demonstrate this using a number of test strings, both those well formed and those less than perfect. This was in part because I feel like doing “impossible” things lately, and in part because I wanted to brush up on my Perl. The point of it is not to be a good solution, but rather the opposite: there is a certain joy in getting something to work in completely the wrong way.

Anyway, the body of it goes like this:

my $element = qr{ ( $STag ( $CharData? (?: $Reference | $CDSect | $PI
    | $Comment | (?1) )* $CharData? ) $ETag ) }sx;
                 # ^
                 # The morsel that matches $element recursively.

my @stack = '<html><head><title>Title</title></head><body><h1>Lies.</h1>'
    . '<p><i>You</em>, my friend, have been told them.</p></body></html>';

$" = "\n\n";
while (@stack) {
    my $string = shift @stack // '';
    my @groups = $string =~ m/$element/g;
    print "@groups\n\n" if @groups;
    unshift @stack, map {
        s/^<($Name)(?:$S*$Attribute)*$S?>//; my $a = $1;
        s/<\/($Name)$S*>$//;                 my $b = $1;
        $a eq $b ? $_ : undef
    } @groups;
}

This recursively enumerates all tags and their contents in some rough semblance of hierarchical order:
  • <html>…</html>
  • <head>…</head><body>…</body>
  • <head>…</head>
  • <title>…</title>
  • <body>…</body>
  • <h1>…</h1><p>…</p>
  • <h1>…</h1>
  • Lies.
  • <p>…</p>
  • <i>…</em>, my friend, have been told them.
  • <i>…</em>
  • You
  • <title>…</title>
  • Title
Here for your enjoyment is the script in its entirety, which ought to match all matched tags that are valid according to the XML specification, including weird tag names, comments, <![CDATA[...]]>, and all that other fun stuff.

I’m admittedly unsure of whether the rules involving lookahead and lookbehind actually match the spec, but it was what came to mind. The rules in question are $CData, $CharData, $PITarget, $PI, and $Comment (but they seem to work okay from what little testing I’ve done).

#!/usr/bin/perl

use warnings;
use strict;

my $S = qr{ \x20 | \x09 | \x0D | \x0A }x;

my $NameStartChar = qr{ : | [A-Z] | _ | [a-z] | [\xC0-\xD6] | [\xD8-\xF6]
    | [\xF8-\x{2FF}] | [\x{370}-\x{37D}] | [\x{37F}-\x{1FFF}]
    | [\x{200C}-\x{200D}] | [\x{2070}-\x{218F}] | [\x{2C00}-\x{2FEF}]
    | [\x{3001}-\x{D7FF}] | [\x{F900}-\x{FDCF}] | [\x{FDF0}-\x{FFFD}]
    | [\x{10000}-\x{EFFFF}] }x;

my $NameChar = qr{ $NameStartChar | - | \. | [0-9] | \xB7
    | [\x{0300}-\x{036F}] | [\x{203F}-\x{2040}] }x;

my $Char = qr{ \x09 | \x0A | \x0D | [\x20-\x{D7FF}] | [\x{E000}-\x{FFFD}]
    | [\x{10000}-\x{10FFFF}] }x;

my $Name = qr{ $NameStartChar $NameChar* }x;

my $EntityRef = qr{ & $Name ; }x;

my $CharRef = qr{ % $Name ; }x;

my $Reference = qr{ $EntityRef | $CharRef }x;

my $AttValue = qr{ " (?: [^<&"] | $Reference )* " }x;

my $Attribute = qr{ $Name $S? = $S? $AttValue }x;

my $STag = qr{ < $Name (?: $S* $Attribute )* $S? > }x;

my $ETag = qr{ </ $Name $S? > }x;

my $CDStart = qr{ <!\[CDATA\[ }x;

my $CDEnd = qr{ \]\]> }x;

my $CData = qr{ (?<! $CDEnd ) $Char* (?! $CDEnd ) }x;

my $CharData = qr{ (?<! $CDEnd ) [^<&]* (?! $CDEnd ) }x;

my $CDSect = qr{ $CDStart $CData $CDEnd }x;

my $PITarget = qr{ (?! [Xx][Mm][Ll] ) $Name }x;

my $PI = qr{ <\? $PITarget (?: $S (?: (?<! \?> ) $Char* (?! \?> )))? \?> }x;

my $Comment = qr{ <!-- (?: (?: (?! - ) $Char ) | (?: - (?! - ) $Char ) )*
    --> }x;

my $element = qr{ ( $STag ( $CharData? (?: $Reference | $CDSect | $PI
    | $Comment | (?1) )* $CharData? ) $ETag ) }sx;

my @stack = '<html><head><title>Title</title></head><body><h1>Lies.</h1>'
    . '<p><i>You</em>, my friend, have been told them.</p></body></html>';

$" = "\n\n";
while (@stack) {
    my $string = shift @stack // '';
    my @groups = $string =~ m/$element/g;
    print "@groups\n\n" if @groups;
    unshift @stack, map {
        s/^<($Name)(?:$S*$Attribute)*$S?>//; my $a = $1;
        s/<\/($Name)$S*>$//;                 my $b = $1;
        $a eq $b ? $_ : undef
    } @groups;
}

I would call this a good (okay, slow and overspecified, but still kinda neat) practical solution to a problem that’s theoretically unsolvable. That, incidentally, will be the theme of many of my upcoming posts, so stay tuned if you’re into it.

4 comments:

  1. Except the famous answer is about using a singular regexp to parse HTML. Of course you can craft a series of regexp and additional code to parse HTML, big deal?

    ReplyDelete
  2. @Anonymous:

    It is one regex, split only for convenience.

    ReplyDelete
  3. The "famous answer" predates Perl 5.10. In pre-5.10 Perl, the techniques Jon showed aren't available (unless you consider (??{ }) to be a true regexp construct)

    ReplyDelete
  4. Check out this work from:

    http://www.cs.sfu.ca/~cameron/REX.html

    I used it some time ago (circa 2000) in Perl scripts that parsed XML quite nicely and fast in memory.

    Sergio

    ReplyDelete