Repair malformed XML
spoulson
created: 2005-02-03 10:20:38
I'm using a program (by Microsoft, no less) that generates an extremely large (80MB) XML file. In some cases, this program is known to output malformed XML. Closing tags are missing in much of the file. There is a possibility this could be done with some tool out there, but I'd really like to see if Perl can do this quickly and easily with existing modules.

An example of the malformed XML looks like:

   
      
      
         
      
         
         
         1.0
         
         2/1/2005 6:41:30 PM
         approved
         False
   
Here, you can see that does not have a closing tag. While I do not have a DTD or Schema of this file, I believe I can make the assumption that encloses everything up to . This pattern of missing tags repeats over and over throughout the XML file for each file that it describes (there are 20,000+ files). But, there are some cases in the XML where the closing tag does appear where it should. The process needs to determine if the tag is missing. Heck, if any closing tag is missing.

I know XML::LibXML won't like it, because it must be well-formed. I imagine XML::Parser could do this, but I can't really visualize how to do it. Could someone please offer some wisdom?

Re: Repair malformed XML
created: 2005-02-03 10:33:28

I know this isn't the solution to your problem per se (and I anxiously await any good solution since I could find many a use for it!), but I know that if my company were to have an XML output like this, we would have so many customers complaining ...

I realise that MS's home user support is nearly non-existant. But if this is something you're doing at work (and I would bet that's the case), have you tried raising a stink internally to the point where your contact person with MS raises a stink with them? (If that's you, and if you're not a manager, I would get your manager's approval to go raise a stink - most people would be willing to let you, I would hope.) It's probably going to be cheaper than writing a fix-it tool.

Re: Repair malformed XML
created: 2005-02-03 10:46:39
Well, reporting missing closing tags is trivial. Just for each type of element, count the number of opening tags, and the number of closing tags. If they are equal, no closing tags are missing (assuming no openings tags are missing). Else, the difference is the number of closing tags missing.

As for repairing -- without a DTD, it's going to be heuristics. And I'm not going to suggest any heuristics based on a tiny sample (643 bytes out of 80 Mb, about 0.00077%) of the file.

Re^2: Repair malformed XML
created: 2005-02-03 10:58:00
If I reverse engineered a DTD, would my chances of earning my XML repair badge be better? What module is capable of validating against DTD to identify a dropped tag like this?
Re^3: Repair malformed XML
created: 2005-02-03 12:01:38
If I reverse engineered a DTD, would my chances of earning my XML repair badge be better?
Maybe. That will depend on the DTD. But how do you know that what you reverse engineer is correct? Or perhaps you reverse engineer a DTD (which may, or may not) be correct, and allows non-ambigious repairs. (That's not so far-fetched. Consider an HTML or XHTML document with the some of the </EM> tags missing. It will not always be clear where to insert the missing tags, even if you assume they belong just before or after some other tag).

One disadvantage of attempting to repair, and not knowing how to recognize a correct document, is that you may end up with a document that is well-formed, or even conforming to the DTD you have, or reversed engineered, is that you do not know whether you ended up with the right document.

Consider a Perl program of which a quote is missing. You could write a "repair" program that noticed a quote is missing, and puts a quote back into the program. Now, if you just randomly inserted the quote in the program, you're likely to end up with a program that still doesn't compile. But for most programs that are missing a quote, there will be more than one place the quote can be inserted, and you still have a compilable program. Which one should your repair program pick? How does it now it's right?

Re^2: Repair malformed XML
created: 2005-02-03 11:23:26

I agree that this is largely a guess, but there is one relatively simple heuristic that might actually help this case. Well-formed XML documents may nest tags, but can't have an inner tag close after the enclosing tag. For example:

Some text  
Some text  

So, an algorithm that makes sure nested tags are closed before the enclosing tags is a good step, and if the sample above is representative such a step will likely go a long way toward solving the problem.

Anima Legato
.oO all things connect through the motion of the mind

Re^3: Repair malformed XML
created: 2005-02-03 11:49:40
Yeah, but with that heuristics, one could immediately close any open tag that doesn't have a corresponding opening tag (and hence promoting them to empty elements). Or, by the same token, simply remove openings tag that don't have a corresponding closing tag (eliminating the element). Or you keep a stack of elements (push on open; pop on close), and if you encounter a closing tag that doesn't belong to the element on top of your stack, keep popping and closing till you find a correct one (implicite closing elements, like HTML's P, LI and TD elements).

Any one could be right. Or wrong. Or right sometimes, and wrong at other times. You end up with a document that is "well-formed". It may be correct, but it may not. You don't know. If you leave the document unmodified, any parser will tell you it's incorrect. That might even be a better situation.

Re^4: Repair malformed XML
created: 2005-02-03 12:22:36

Not quite. If the DTD tells you, for example, that element a may contain elements b, c, or d, and that b can contain e and f, then if it looks like element a contains one b, and two e's, you can be pretty sure that the b was close improperly (if at all), and the e's should be in b.

There are still many possibilities for confusion. But a heuristic that started with the DTD could do quite a good job. I'm not going to pretend it would be easy and/or fun ... but in theory the information may be there that could do a good job - and, if the DTD does not allow overlaps (such as a and b both allowing d's, so that the d can either be a child or a grandchild of a), you may even be able to do a perfect job.

Re^5: Repair malformed XML
created: 2005-02-03 12:38:14
Ah, I see. You mean (X)HTML, and 'a' is P, 'b' is SPAN, 'c' is CODE, 'd' is SMALL, 'e' is EM and 'f' is STRONG. So, now you encounter:
  

foo bar baz qux quux

Now, assuming tags aren't to be inserted inside words, I still can find five places to put in </SPAN>: before 'bar', between 'bar' and 'baz', after 'baz', between the two EM elements, and before the </P> tag.

Now, if you have a DTD that says that the only possible content of a 'b' is exactly two 'e's, you know where the missing closing tag should have been.

Note also that if you have a DTD where you can always unambigiously deduce where a missing closing tag should have gone, the closing tag is redundant - and if it were an SGML DTD instead of an XML DTD, the closing tag would have been optional. (And that would have solved the problem instantly - the document would be conforming).

Re: Repair malformed XML
created: 2005-02-03 11:32:14
I would definitely give XML::LibXML a try. It has a nice command line tool, xmllint, which can make wonders if used correctly. On the other hand, if you want Perl, you should experience with setting the recover flag of the XML::LibXML::Parser object to true. Although the manual states that it is for parsing HTML, it, as far as I can tell, serves for parsing ill-formatted XML just as well.

The quick and dirty hack below could repair your badly formatted XML snippet (after adding the missing namespace declarations):

use XML::LibXML;
my $parser = XML::LibXML->new();
$parser->recover(1);
my $doc = $parser->parse_file($ARGV[0]);
print $doc->toString(1);
Note that, however, I am not entirely sure that it always gueesses right on adding the remaining closing tags back, so I would not rely on this feature...

rg0now

Re^2: Repair malformed XML
created: 2005-02-03 11:55:00
I was unaware of the recover property. Your code example worked great on a test xml with a missing tag.

However, it appears I've reached a size limitation on the LibXML library. Both xmllint and the Perl code indicate problems parsing corrupt data:

my.xml:85: parser error : expected '>'
tentclasses>True

I've toyed with XML::Parser some more.  I've given it simple handlers to print the tags that are parsed, but XML::Parser croaks when it detects the missing tag, without first allowing a handler to override it.

Is there something I'm missing?

Re^3: Repair malformed XML
created: 2005-02-03 12:25:38

You don't say what version of perl you're using. My first attempt to use XML::Twig was with perl 5.6, and it died a horrible death ... simply upgrading perl to 5.8.1 was sufficient to handle the reading/writing of XML that I was doing with no other changes (same level of XML::Twig, my code unchanged). If you're not using 5.8 for XML handling, I highly suggest it.

Re^4: Repair malformed XML
created: 2005-02-03 13:23:30
Thanks for the suggestion. Yes, I am running ActiveState 5.8.6 on WinXP. I'll have a look at XML::Twig, as well.
Re^3: Repair malformed XML
created: 2005-02-03 12:38:38
I am a little lost here. You told us that all the problems you have with your XML is that it has some unclosed tags. XML::LibXML::Parser's recover flag will handle it, as the manual tells:

"The recover mode helps to recover documents that are almost wellformed very efficiently. That is for example a document that forgets to close the document tag (or any other tag inside the document)."

Now, you seem to indicate that some tags in your XML are corrupt. Well, I do not really know, how to handle that one...

Also, I do not think that you hit some obscure size limitations of XML::LibXML (you seem to get the error at the 85th input line).

Re^4: Repair malformed XML
created: 2005-02-03 12:55:29
Sorry if I was unclear. The XML data is not corrupt. It appears that LibXML cannot load an 80MB XML without corrupting its own data. When I search within the XML, I do not find the offending parser error on line 85, or anywhere in the file.

So, I believe it to be a size limitation that causes internal memory management issues. Why 85th line? Maybe a pointer wrapped and happened to clobber the 85th line. Who knows. :/

Re^2: Repair malformed XML
created: 2005-02-03 15:54:11
I stand corrected about the size limitation. Upon further testing, it is not the size, but the encoding. The XML file is unicode with encoding="iso-10646-ucs-2". If I convert to ASCII and set encoding="UTF-8", LibXML parses it fine.

Unfortunately, the output of above script becomes mangles after a few thousand lines. It begins to only output the Text objects, and no tags, cdata's, etc. Strange.

While I haven't discovered a generalized and automated method that works, I've managed to get by with a simple procedural rule of inserting tags before if not already present. Then I convert back to Unicode and the XML can be parsed.

Re: Repair malformed XML
created: 2005-02-03 12:02:27
Like you, I don't have the DTD, but I'm not sure of your assumption. I think the indentation gives you the sense that should span until just before , but looking at how other o: tags work, I think this would be the proper closing: