1

I am recoqnizing this pattern

<.*>

From string

<a href="hello world">Hi Baby</a>

Now, there are several match

<a href="hello world"> is a match

<a href="hello world">Hi Baby</a>

is also a match.

However, that's very confusing. I thought regular expression are solved with determined finite automata.

So I would imagine that the definite finite automata would go to each letter. one by one. However, it would somehow branch. It would think that the first > is art of the closing > in the pattern. It can also mean part of the . pattern.

So how does it decides?

In vb.net, it seems that the pattern that's recognized is the second one. That is why I have to replace the pattern with

<[^>]*>

if I want the pattern to match the first (say I want to eliminate all html tags from a string)

And why is that? What does vb actually do to select the second string as those that match the pattern?

I've heard that vb is "greedy". It matches the longest string that match the pattern instead of the first working pattern. So uhmm...is this inherently ambiguous or is there a way we can how this is actually implemented?

user4951
  • 709
  • 6
  • 14
  • * is greedy, i.e. it tries to match as much as possible. *? is lazy and tries to match as little as possible. – CodesInChaos Nov 09 '15 at 08:32
  • Ah, in vb.net? So the way microsoft implement this is, if a string match both . and > presume it's . – user4951 Nov 09 '15 at 09:05
  • 4
    If you want to parse HTML, don't use regular expressions... http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 – Darsstar Nov 09 '15 at 09:14
  • Regular expressions are not ambiguous just because they match more than one string. Otherwise by that logic, y* matching y and yy would be ambiguous. It is still determinate because it doesn't give you all possible matches, it simply gives you one that fits. – Neil Nov 09 '15 at 09:17
  • @JimThio No. * matches as many repetitions as possible. . and > are completely irrelevant for this. – CodesInChaos Nov 09 '15 at 09:35
  • @JimThio This is true in most (all?) regular expression engines. Append ? to the quantifier for non-greedy matching. E.g. *? means 0 or more times, non-greedy. – Brandin Nov 09 '15 at 14:07

1 Answers1

1

There are 2 ways regex are matched against a string: Check if the entire string matches the pattern or find the first match of the pattern.

The first is often used for input validation.

The second would be used in parsing large portions of text to isolate interesting parts for further parsing. This will return the first match it can find.

ratchet freak
  • 25,876