5

I was chatting with a friend about my love/hate relationship with XML. He made the comment that, "xml is broken primarily because parsers for recursive self-defined document are basically impossible to get right."

I've heard the critique that XML is very difficult to parse before, but my understanding is limited to difficulties around balancing memory usage (i.e. streaming) and the impedance mismatch between XML and most programming languages (such as not supporting basic primitives like an array).

What is unique about XML as a self-defining document that makes parsing difficult?

Indolering
  • 150
  • 5
  • 3
    Since the only possible justification for the existence of XML is that it is easy to parse, I think it would be helpful if you could cite an example of it being “difficult” - other than your friend not liking it. There are many, many other disadvantages of XML: indeed, ease of parsing is about its only virtue. That’s why it would be interesting to know what is meant. – Martin Kochanski Dec 14 '20 at 08:20

1 Answers1

-1

Thanks for posting your question! See, normally parsers implement huge usage of regex or regular expressions. But regular expressions are just for text manipulation and cannot count. Because of this defect of regular expressions, it becomes quite difficult to balance number of open and close tags. Hence, software engineers have to do some extra work to deal with it. You can read more about regular expressions at Mozilla Developer Network Docs and at this post on Medium.

  • 3
    (I fail to see how discussing regular expressions sheds any light on XML parsing - the question doesn't mention them.) – greybeard Dec 13 '20 at 09:03
  • Believe me or not, regex is used intensively in parsers, except XML ones. – Abhigyan Kumar Dec 13 '20 at 12:36
  • 2
    I can see it(them?) used in lexer/scanner parts of compilers: figuring out tokens. Can you name applications to figure out syntax? – greybeard Dec 13 '20 at 12:38
  • @greybeard, I did not understand your question. – Abhigyan Kumar Dec 13 '20 at 12:42
  • I thought you can use regex to detect tags. It is just that counting tags that adds the extra problem? – rus9384 Dec 13 '20 at 12:54
  • You need to count parentheses to parse arithmetic, but that doesn't prevent the use of regexes. The problem -- if there were one -- with tags is that the text of an end tag must match the text of its start tag. The tags can be recognised easily with a regex but the marching isn't even contex-free. It's not really a problem for XML parsing because XML prohibits omitted tags. But it complicates parsing HTML and other SGML dialects because missing tags need to be imputed according to rules difficult or impossible to define in a CFG. None of this discussion is relevant to this question. – rici Dec 13 '20 at 15:23
  • So you are saying that XML is hard to parse because people use techniques that are not really applicable to parsing (i.e. regex) instead of standard applicable ones (i.e. CFG)? Sounds like a problem with people, not with XML. –  Dec 13 '20 at 21:14