On https://www.emacswiki.org/emacs/MultilineRegexp one finds the hint to use
[\0-\377[:nonascii:]]*\n
instead of the standard
.*\n
to match any character up to a newline to avoid stack overflow for huge texts (37 KB). Is the overflow the concern here, or is a matching run for the former also more performant than the latter?
[\0-\377[:nonascii:]]*
would do so less then\\(.\\|\n\\)*
. So I think the emacswiki is wrong on this one. – Stefan Nov 21 '16 at 15:46|
might need more backtracking, but whether it actually does depends on how it's compiled. – npostavs Nov 21 '16 at 20:05\\(.\\|\n\\)*
and never even thought about[\0-\377[:nonascii:]]*
. It's good to know about the latter, but it's even better to know that it doesn't add anything (so I'll stick to the one that is easier for me to read). – Drew Nov 21 '16 at 21:42(re-search-forward "\\(.\\|\n\\)*")
on a large buffer gives "Stack overflow in regexp matcher", while(re-search-forward "[\0-\377[:nonascii:]]*")
does not. It seems emacswiki was right. – npostavs Nov 23 '16 at 18:24[\0-\377[:nonascii:]]*
(which is rather unusual, since you might as well usepoint-max
rather than search for it via such a regexp) (for the curious: the crux of the matter is whether the set of chars that can match after the * is disjoint from the set of char that can match within the . If it is disjoint, then the regexp engine will skip recording intermediate steps, and hence avoid eating up stack space. So `.\nand
[^a]adon't consume the stack, whereas
.a` does). – Stefan Nov 23 '16 at 18:30