Most performant matching of "any char"

Question

On https://www.emacswiki.org/emacs/MultilineRegexp one finds the hint to use

[\0-\377[:nonascii:]]*\n

instead of the standard

.*\n

to match any character up to a newline to avoid stack overflow for huge texts (37 KB). Is the overflow the concern here, or is a matching run for the former also more performant than the latter?

score 9 · Accepted Answer · answered Nov 21 '16 at 14:43

9

In Emacs's regexps, . does not match all characters. It is a synonym of [^\n]. So the reason for using [\0-\377[:nonascii:]] is when you want to match "any char, even a newline".

W.r.t overflowing the stack, .*\n should be handled very efficiently, i.e. without backtracking and without eating up the stack. On the contrary [\0-\377[:nonascii:]]*\n is handled rather inefficiently by Emacs's regexp engine because it will eat up a bit of the stack for every character matched, so on "huge" texts it will tend to overflow the stack.

Note that the emacswiki suggests [\0-\377[:nonascii:]]* and not [\0-\377[:nonascii:]]*\n.

answered Nov 21 '16 at 14:43

Stefan

26,404
3
48
85

Thanks for the clarification. However, for the stack overflow, are you sure that [\0-\377[:nonascii:]]\n will cause an overflow? This is the contrary to what the wiki claims. Is this bcs of the \n at the end? What use would a pattern like [\0-\377[:nonascii:]] without an ending character be then? – Vroomfondel Nov 21 '16 at 15:37
Any regexp which matches "anything" will eat up stack space (with Emacs's regexp engine, I mean), and I don't see why [\0-\377[:nonascii:]]* would do so less then \\(.\\|\n\\)*. So I think the emacswiki is wrong on this one. – Stefan Nov 21 '16 at 15:46
Any way (or anyone) to authoritatively clarify on this issue? – Vroomfondel Nov 21 '16 at 16:01
@Vroomfondel test it and see. I can imagine that the regexp with | might need more backtracking, but whether it actually does depends on how it's compiled. – npostavs Nov 21 '16 at 20:05
+1. Very good to know. I've always used \\(.\\|\n\\)* and never even thought about [\0-\377[:nonascii:]]*. It's good to know about the latter, but it's even better to know that it doesn't add anything (so I'll stick to the one that is easier for me to read). – Drew Nov 21 '16 at 21:42
I find that (re-search-forward "\\(.\\|\n\\)*") on a large buffer gives "Stack overflow in regexp matcher", while (re-search-forward "[\0-\377[:nonascii:]]*") does not. It seems emacswiki was right. – npostavs Nov 23 '16 at 18:24
3

That is true only if the regexp ends with [\0-\377[:nonascii:]]* (which is rather unusual, since you might as well use point-max rather than search for it via such a regexp) (for the curious: the crux of the matter is whether the set of chars that can match after the * is disjoint from the set of char that can match within the . If it is disjoint, then the regexp engine will skip recording intermediate steps, and hence avoid eating up stack space. So `.\nand[^a]adon't consume the stack, whereas.a` does). – Stefan Nov 23 '16 at 18:30

Most performant matching of "any char"

1 Answers1