15

I am trying to understand what is meant by "deterministic" in expressions such as "deterministic context-free grammar". (There are more deterministic "things" in this field). I would appreciate an example more then the most elaborate explanation! If possible.

My primary source of confusion is from not being able to tell how this property of a grammar is different from (non-)ambiguity.

The closest I got to finding what it means is this quote from the paper by D. Knuth On the Translation of Languages from Left to Right:

Ginsburg and Greibach (1965) have defined the notion of a deterministic language; we show in Section V that these are precisely the languages for which there exists an L R ( k ) grammar

which becomes circular as soon you get to the Section V, because there it says that what LR(k) parser can parse is the deterministic language...


Below is an example that I could find to help me understand what "ambigous" means, please take a look:

onewartwoearewe

Which can be parsed as one war two ear ewe or o new art woe are we - if a grammar allows that (say it has all the words I just listed).

What would I need to do to make this example language (non-)deterministic? (I could, for example, remove the word o from the grammar, to make the grammar not ambiguous).

Is the above language deterministic?

PS. The example is from the book Godel, Esher, Bach: Eternal Golden Braid.


Let's say, we define the grammar for the example language like so:

S -> A 'we' | A 'ewe'
A -> B | BA
B -> 'o' | 'new' | 'art' | 'woe' | 'are' | 'one' | 'war' | 'two' | 'ear'

By the argument about having to parse the whole string, does this grammar make the language non-deterministic?


let explode s =
  let rec exp i l =
    if i < 0 then l else exp (i - 1) (s.[i] :: l) in
  exp (String.length s - 1) [];;

let rec woe_parser s =
  match s with
  | 'w' :: 'e' :: [] -> true
  | 'e' :: 'w' :: 'e' :: [] -> true
  | 'o' :: x -> woe_parser x
  | 'n' :: 'e' :: 'w' :: x -> woe_parser x
  | 'a' :: 'r' :: 't' :: x -> woe_parser x
  | 'w' :: 'o' :: 'e' :: x -> woe_parser x
  | 'a' :: 'r' :: 'e' :: x -> woe_parser x
  (* this line will trigger an error, because it creates 
     ambiguous grammar *)
  | 'o' :: 'n' :: 'e' :: x -> woe_parser x
  | 'w' :: 'a' :: 'r' :: x -> woe_parser x
  | 't' :: 'w' :: 'o' :: x -> woe_parser x
  | 'e' :: 'a' :: 'r' :: x -> woe_parser x
  | _ -> false;;

woe_parser (explode "onewartwoearewe");;
- : bool = true

| Label   | Pattern      |
|---------+--------------|
| rule-01 | S -> A 'we'  |
| rule-02 | S -> A 'ewe' |
| rule-03 | A -> B       |
| rule-04 | A -> BA      |
| rule-05 | B -> 'o'     |
| rule-06 | B -> 'new'   |
| rule-07 | B -> 'art'   |
| rule-08 | B -> 'woe'   |
| rule-09 | B -> 'are'   |
| rule-10 | B -> 'one'   |
| rule-11 | B -> 'war'   |
| rule-12 | B -> 'two'   |
| rule-13 | B -> 'ear'   |
#+TBLFM: @2$1..@>$1='(format "rule-%02d" (1- @#));L

Generating =onewartwoearewe=

First way to generate:

| Input             | Rule    | Product           |
|-------------------+---------+-------------------|
| ''                | rule-01 | A'we'             |
| A'we'             | rule-04 | BA'we'            |
| BA'we'            | rule-05 | 'o'A'we'          |
| 'o'A'we'          | rule-04 | 'o'BA'we'         |
| 'o'BA'we'         | rule-06 | 'onew'A'we'       |
| 'onew'A'we'       | rule-04 | 'onew'BA'we'      |
| 'onew'BA'we'      | rule-07 | 'onewart'A'we'    |
| 'onewart'A'we'    | rule-04 | 'onewart'BA'we'   |
| 'onewart'BA'we'   | rule-08 | 'onewartwoe'A'we' |
| 'onewartwoe'A'we' | rule-03 | 'onewartwoe'B'we' |
| 'onewartwoe'B'we' | rule-09 | 'onewartwoearewe' |
|-------------------+---------+-------------------|
|                   |         | 'onewartwoearewe' |

Second way to generate:

| Input             | Rule    | Product           |
|-------------------+---------+-------------------|
| ''                | rule-02 | A'ewe'            |
| A'ewe'            | rule-04 | BA'ewe'           |
| BA'ewe'           | rule-10 | 'one'A'ewe'       |
| 'one'A'ewe'       | rule-04 | 'one'BA'ewe'      |
| 'one'BA'ewe'      | rule-11 | 'onewar'A'ewe'    |
| 'onewar'A'ewe'    | rule-04 | 'onewar'BA'ewe'   |
| 'onewar'BA'ewe'   | rule-12 | 'onewartwo'A'ewe' |
| 'onewartwo'A'ewe' | rule-03 | 'onewartwo'B'ewe' |
| 'onewartwo'B'ewe' | rule-13 | 'onewartwoearewe' |
|-------------------+---------+-------------------|
|                   |         | 'onewartwoearewe' |
wvxvw
  • 1,388
  • 9
  • 13
  • 1
    -1, since the question now makes little sense. First off, a string is not a language; strings are not ambiguous, unambiguous, deterministic or nondeterministic; they're just strings. The grammar you give does not generate the example string. I've not checked all 180 derivations to see whether there are duplicates, but in theory that's all you'd need to do to see whether the grammar is ambiguous. Sadly, the language can't be inherently ambiguous, since the language is finite, hence regular, hence accepted by a DPDA, hence deterministic. – Patrick87 Sep 25 '13 at 18:49
  • @Patrick87 eh? Where does it say that the string is the language? This string is an example product, and sure it is possible to generate using the given grammar. What makes you think otherwise? The string in question is exactly the case, where two different sequences of rule applications produce the same string, thus the grammar is ambiguous, but if you remove some rules (for example, B -> 'o', then it will no longer be ambiguous... – wvxvw Sep 25 '13 at 19:55
  • First off, can you please provide a derivation of the example string using the grammar? From your own question: "Is the above language deterministic?" You never name a language, just a string, generated by an infinitude of grammars, albeit not the one you propose. – Patrick87 Sep 25 '13 at 20:02
  • Can you write it in English? E.g., "Start with S. By the application of the rule S := ..., we get ..., ..." – Patrick87 Sep 25 '13 at 21:12
  • @Patrick87 I've added step-by step generation procedure, as well as I've realized I made a mistake in the grammar, which I've fixed. – wvxvw Sep 26 '13 at 09:16
  • Alright, now that you have a grammar that generates your example string, what can we say? First off, the grammar is very much ambiguous, since the example string has two derivations. However, the language of this grammar is the regular language $(o + new + ... + ear)^+(we + ewe)$. Since the language is regular, there is a DFA that accepts it. A DFA is a DPDA that doesn't use the stack. A regular language is deterministic context free, and therefore cannot be inherently ambiguous; i.e., there is an unambiguous CFG for this language; the CFG is one of the infinitude of other CFGs. – Patrick87 Sep 26 '13 at 17:04
  • @Patrick87 that regular expression wouldn't parse that language. The one which would would look like this: $(o|one|new|...|ear)+(we|ewe)$ which does require backtracking (alot of it!), which means it needs a stack. So further argument about equivalence to DPDA is wrong. The reason for this grammar was though to determine if it is required to parse the whole string, then the grammar is non-deterministic. There is a reason I'm not saying "language", because language is not important to me (I can change it, if the grammar doesn't have the desired properties). – wvxvw Sep 26 '13 at 21:16
  • In the theory of regular expressions, $+$ is equivalent to your | and means "union". In the theory of regular languages, $a^+b$ means "at least one a followed by a single b", whereas $a+b$ means "either a single a or a single b". In the theory of formal languages, all regular languages can be parsed by a DFA without any stack; no stack is required to parse a regular language. In the theory of regular languages, it is trivial to convert a DFA to a DPDA. The argument is not wrong; you simply misunderstand, or possibly have never studied the theory of languages and their automata. – Patrick87 Sep 26 '13 at 21:46
  • Just to be completely clear: there is no such thing as a regular language that "requires" backtracking. That thing simply does not exist. Does not compute. More generally: what problem are you trying to solve? I've answered the questions you ask, but it seems clear there's some other question that you'd like an answer to. The answer is: the language is regular, hence deterministic context-free, hence not inherently ambiguous; so no, the fact that you provide an ambiguous grammar means nothing of any particular interest whatsoever. – Patrick87 Sep 26 '13 at 21:48
  • @Patrick87 I'm studying it now, and that's why the questions. It is a lot easier to understand it by trying something practical, and sure superscript plus isn't going to work in any programming language... so that's why I never saw it being used. What I've showed above is that in some cases there may be two different ways to parse the same string, and, if, for example, you were to compile such a regular expression, with a compiler that can validate it, you would get an error (such expression would not compile, because the compiler would spot an ambiguity). – wvxvw Sep 26 '13 at 21:59
  • If a compiler failed to compile such a regular expression, it's a limitation of the compiler. There is definitely a DFA that could parse strings described by the regular expression. For sure, 100% of the time. There are standard algorithms for converting a regular expression to an NFA, and for converting an NFA to a DFA. – Patrick87 Sep 26 '13 at 22:02
  • @Patrick87 sigh... no, it's not the compiler's fault... in fact, non-validating compilers will cope with it just fine, because it would impose some order in which to apply the rules, and use them only in that order. (This is actually a bad compiler, a simpler one). The good compiler will refuse to compile it. But it's a pointless debate because you keep talking about the language, which is not important. All I want to know is what "deterministic" means when used with grammar, not the language (maybe it is a misnomer, and there aren't non-deterministic grammars? maybe it's only languages?) – wvxvw Sep 26 '13 at 22:08
  • I'm aware of no such thing as nondeterministic grammar. Perhaps I'm misunderstanding what you mean by compile? Unless I've made some error, I think there's a DFA with ~15 states that would accept exactly the language described the regular expression. If a compiler refuses to generate that or some other DFA, how is it not a limitation? – Patrick87 Sep 26 '13 at 22:26
  • @Patrick87 the ML code I've posted (the one with pattern matching) will not compile as posted (the comment there explains why it wont). It would have 15 states, but the compiler detects that it can make more then one path through matched text, and that is just wrong. And... oh my... this is just so bad... there indeed is no such thing as nondeterministic grammar! I'm so sorry for wasting so much of your time. I didn't do it on purpose, honest! Now that I think of it, I don't know where did I get this idea. Whoops :( – wvxvw Sep 26 '13 at 22:39
  • @Patrick87 and wvxvw: can you please clean up this comment thread? – Raphael Sep 14 '16 at 13:26
  • ("$L$ has a here is an $LR(k)$ grammar" iff "$L$ is deterministic") isn't true if you take $k=0$ (but is true for $k>0$). – xavierm02 Jan 01 '17 at 15:10

5 Answers5

9

A PDA is deterministic, hence a DPDA, iff for every reachable configuration of the automaton, there is at most one transition (i.e., at most one new configuration possible). If you have a PDA which can reach some configuration for which two or more unique transitions may be possible, you do not have a DPDA.

Example:

Consider the following family of PDAs with $Q = \{q_0, q_1\}$, $\Sigma = \Gamma = \{a, b\}$, $A = q_0$ and $\delta$ given by the following table:

q    e    s    q'   s'
--   --   --   --   --
q0   a    Z0   q1   aZ0
q0   a    Z0   q2   bZ0
...

These are nondeterministic PDAs because the initial configuration - q_0, Z0 - is reachable, and there are two valid transitions leading away from it if the input symbol is a. Anytime this PDA starts trying to process a string that begins with an a, there's a choice. Choice means nondeterministic.

Consider, instead, the following transition table:

q    e    s    q'   s'
--   --   --   --   --
q0   a    Z0   q1   aZ0
q0   a    Z0   q2   bZ0
q1   a    a    q0   aa
q1   a    b    q0   ab
q1   a    b    q2   aa
q2   b    a    q0   ba
q2   b    b    q0   bb
q2   b    a    q1   bb

You might be tempted to say this PDA is nondeterministic; after all, there are two valid transitions away from the configuration q1, b(a+b)*, for instance. However, since this configuration is not reachable by any path through the automaton, it doesn't count. The only reachable configurations are a subset of q_0, (a+b)*Z0, q1, a(a+b)*Z0, and q2, b(a+b)*Z0, and for each of these configurations, at most one transition is defined.

A CFL is deterministic iff it is the language of some DPDA.

A CFG is unambiguous if every string has at most one valid derivation according to the CFG. Otherwise, the grammar is ambiguous. If you have a CFG and you can produce two different derivation trees for some string, you have an ambiguous grammar.

A CFL is inherently ambiguous iff it is not the language of any unambiguous CFG.

Note the following:

  • A deterministic CFL must be the language of some DPDA.
  • Every CFL is the language of infinitely many nondeterministic PDAs.
  • An inherently ambiguous CFL is not the language of any unambiguous CFG.
  • Every CFL is the language of infinitely many ambiguous CFGs.
  • An inherently ambiguous CFL cannot be deterministic.
  • A nondeterministic CFL may or may not be inherently ambiguous.
Patrick87
  • 12,824
  • 1
  • 44
  • 76
  • 1
    Wiki says PDA is not deterministic (there may be a deterministic version and a non-deterministic), but you could as well omit the first part of the sentence, it's not really contributing to what you are saying :/ But, again, this defines a deterministic language as an input language of deterministic something, and that something is called deterministic because it accepts deterministic language - it's like saying "the grass is green because green is the colour of the grass". It's true, but not helpful :( Please, example would be more then precious! – wvxvw Sep 25 '13 at 07:10
  • @wvxvw: you are not reading this correctly. It says: "A PDA is deterministic if and only if every state/symbol/stacktop triple has only one next state." There's nothing in that definition about what language the automaton accepts. – Wandering Logic Sep 25 '13 at 13:13
  • 2
    @wvxvw The definition of deterministic PDA, or DPDA, which I give in no way, shape, or form relies on the definition of a deterministic context free language. I define DPDAs based only on properties of the automaton. I then define what a deterministic CFL is in terms of the definition of a DPDA. Please re-read the answer in light of these and Wandering Logic's comments and try to see whether this makes sense. I will endeavor to provide some brief examples. – Patrick87 Sep 25 '13 at 18:18
  • I think I finally understand... but I also think you have couple of typos there, don't you? Just to make clear, there isn't a configuration $q_1, b(a+b)$ in the table, perhaps you meant $q_2, b(a+b)$? The states probably have to be $Q={q_0,...q_2}$ or some such? Or, maybe I don't understand what you mean by "configuration". Shouldn't configuration include the stack and the current character? Also, is my interpretation correct? x+ - one or more x, (x)* - zero or more x? – wvxvw Sep 25 '13 at 19:16
  • @wvxvw Configuration refers to the current state and the current contents of the stack. x+ typically refers to "one or more of x, whereas x* typically refers to "zero or more of x; I may use xx* in place of x+, since these are identical. – Patrick87 Sep 25 '13 at 19:59
8

Here are examples (from Wikipedia):

The language of even-length palindromes over the alphabet of 0 and 1 is a non-deterministic, but unambiguous language. A grammar for this language is $S \rightarrow 0S0 | 1S1|\varepsilon$. The language is non-deterministic because you need to look at the whole string to figure out where the middle is. The grammar is unambiguous because there is one and only one parse tree for each string in the language.

A context free language is deterministic if and only if there exists at least one deterministic push-down automaton that accepts that language. (There may also be lots of non-deterministic push-down automata that accept the language, and it would still be a deterministic language.) Essentially a deterministic push-down automata is one where the machine transitions are deterministically based on the current state, the input symbol and the current topmost symbol of the stack. Deterministic here means that there is no more than one state transition for any state/input symbol/topmost stack symbol. If you have two or more next states for some state/input symbol/topmost stack symbol triple then the automaton is non deterministic. (You would need to "guess" which transition to take in order to decide whether the automaton accepts or not.)

What Knuth proved was that every LR(k) grammar has a deterministic pushdown automaton and that every deterministic pushdown automata has an LR(k) grammar. So LR(k) grammars and deterministic pushdown automata can handle the same set of languages. But the set of languages that have a deterministic pushdown automaton that accepts them is (by definition) the deterministic languages. The argument isn't circular.

So deterministic language implies that there exists an unambiguous grammar. And we've shown an unambiguous grammar that has no deterministic pushdown automaton (and thus it is an unambiguous grammar that accepts a non-deterministic language.)

Are there context free languages for which no unambiguous grammar exists? It turns out there are. An example (again from Wikipedia) is the union of $\{a^nb^mc^md^n|n,m>0\}$ and $\{a^nb^nc^md^m|n,m>0\}$. Each of the sets individually is obviously context free and the union of context free languages is context free. Strings of the form $\{a^nb^nc^cd^n|n>0\}$ are obviously in this language (in fact that's the intersection of the two languages) and Hopcroft and Ullman proved that no matter what grammar you come up with for the union language, there will be some string in the intersection set that has two different parse trees.

Wandering Logic
  • 17,743
  • 1
  • 44
  • 87
  • Can you please elaborate, why having to look at the whole string before determining the middle makes this language non-deterministic? I read another explanation of what "deterministic" is, and there it says that "if you don't need to backtrack when parsing, that language is deterministic". I don't see a need to backtrack to parse this language... – wvxvw Sep 25 '13 at 08:41
  • 1
    Consider the input string "10011001". The pushdown automata does not know how long the string is until it gets to the end. When you get to the second 0 you need to make a choice: is this the 4-character string "1001", or a longer string that looks like "100????001" ? When you get to the fifth character you still don't know: is this the 8-character string "10011001" or a longer string that looks like "10011????11001" ? – Wandering Logic Sep 25 '13 at 11:27
  • A deterministic push-down automata can only look at the current element of the string, the current state and the top element of the stack. At various times there are multiple choices for what the next state should be, and you can't figure out which one to choose without having more information. – Wandering Logic Sep 25 '13 at 11:29
  • So, to rephrase your last comment: if you could look at any element of the stack (not just the topmost), you would then know what next state to enter? (this doesn't seem to be true... you'd still need to read the whole string first). 2. Could I ask you to look at the example I added to my question and tell whether the last paragraph describes a non-deterministic language? I find the requirement of needing to read the whole string confusing / probably not being specific enough to explain what is meant by "deterministic".
  • – wvxvw Sep 25 '13 at 12:46
  • 1
    The "parse the whole string" thing is not the definition of non-deterministic. It was just some intuition I was trying to add. Both @Patrick87 and I gave you the real definition of deterministic: From every state there is at most one next state. If a language has no unambiguous grammar it must be non-deterministic. I can't answer about your example without doing more work: you've shown an ambiguous grammar, but that's not what matters, you need to demonstrate that there is no unambiguous grammar if you want to show that the language is inherently ambiguous. – Wandering Logic Sep 25 '13 at 13:10
  • Well, then this thing which is meant to help does the opposite of helping :) If it's not important that you should read the whole string - why even mention it? I've red the definition even before I've asked the question, but it's not helping. What I need is the example and an explanation of why it is called deterministic. What I'd expect to read: "a DPDA cannot parse the language L because it would require X, which is not possible", instead it says "it would require to read the whole string", which is not only possible, but will certainly happen one way or another. – wvxvw Sep 25 '13 at 13:36
  • PS. I don't want to show that the language is inherently ambiguous. I want to show that it is (or is it?) non-deterministic. If a language can be either unambiguous or ambiguous while being non-deterministic, why would I even bother with ambiguity? The mechanism that will tell me whether the language is non-deterministic is exactly what I'm after (I'm not a theorist at all, I just need some tool to make sure that some embedded language can be parsed in linear time, so unless I can come up with a concrete procedure of ruling out "bad languages", I've achieved nothing...) – wvxvw Sep 25 '13 at 13:45
  • The problem is that determinism and unambiguous are both "there exist" definitions. That means that to prove non-determinism you have to show that all PDAs that accept the language are not deterministic. That's much harder than showing a single PDA that is deterministic. Likewise to show a language is ambiguous you have to show that all grammars that accept the language are ambiguous. That's much harder than showing a single grammar that is unambiguous. – Wandering Logic Sep 25 '13 at 13:46
  • If you want to demonstrate that a language is deterministic, then demonstrate either that you can construct a LR(k) grammar for it or a deterministic PDA for it. If you can't construct either of those, then the language is non-deterministic. – Wandering Logic Sep 25 '13 at 13:54
  • Bah... that is very very bad... given that I'm constructing the grammar, not the language, isn't there a hope to get the proof "cheaper"? Showing that the grammar is ambiguous is really cheap in this sense :( If there's no way around but to construct the functional parsers, that's more like mission impossible task. I.e. the goal is to gain confidence that the parser can be constructed before attempting to construct it. Proof by failure may take ages to accomplish... – wvxvw Sep 25 '13 at 14:13
  • Wait, if it is possible to show that there is single PDA that is deterministic, that is good. That is what I need! How do I do it? :) – wvxvw Sep 25 '13 at 14:19
  • If your input is a context free grammar then put it through an LR(1) parser generator. If you get a result it will be a deterministic PDA. If you don't get a result you will have to try a different grammar. See http://cs.stackexchange.com/questions/4888/is-there-any-way-to-distinguish-between-llk-and-lrk-grammar. – Wandering Logic Sep 25 '13 at 15:30
  • Nah, sorry, no can do. Generating parsers, no matter the complexity is out of question. It makes the task more time consuming, then I have time available for it. There should be a reason why it is not possible to build a parser for some grammar. If I knew that reason, I'd be able to search just for the suspicious situations before I attempt building the parser. Otherwise I'm going to retire before I finish my job. :( – wvxvw Sep 25 '13 at 16:45
  • What I mean: think of this situation, let's say, I have a partially designed language which so far has 9 grammar rules. Let's say, I need to add one more. Let's say, I follow your advise and build an LR parser - I'll get a table of 10x10 pairs of rules. Now let's say it "didn't work" - and it means I don't know which one of the 100 pairs didn't work... now, if I don't have a concept of what does it mean "didn't work", I might keep trying to "get it to work" until the second coming, and not be done even by then. – wvxvw Sep 25 '13 at 17:11
  • @wvxvw: google "yacc" or "bison" or "parser generator". – Wandering Logic Sep 25 '13 at 19:54
  • Unfortunately, I have quite a bit of experience with those (you can also add wysent, ANTLR and couple more libraries to the list...) Even though I'm a hobbyist, I've got quite a bit of experience with these, and this is why I have very pessimistic time estimates. But the problem is different really. I can identify ambiguity in grammar w/o ever needing to build a parser - it is cheap and I make few mistakes when designing one. But if I can't predict a mistake before I create a parser - that would mean I cannot gradually refine the design. – wvxvw Sep 25 '13 at 20:30
  • 1
    @wvxvw If you're looking for a computational procedure, you're likely out of luck... according to http://en.wikipedia.org/wiki/List_of_undecidable_problems, it's undecidable whether a CFG is ambiguous, let alone whether its language is inherently ambiguous; it's also undecidable whether a CFG generates all strings. Given this, I seriously doubt it's decidable, much less efficient to decide, whether a CFG's language is a deterministic CFL. – Patrick87 Sep 25 '13 at 20:46
  • @Patrick87 wait... what if I find two terminals that have the same prefix, then I continue appending to them more terminals in such a way that the common prefix grows or remains the same, and then eventually run into two identical strings - doesn't this prove that the grammar is ambiguous? I don't need to solve for the general case, all I need is the mechanism that will let me avoid most of the problems before I even attempt to prove anything about the grammar, just as the first step, the way of thinking about how to construct it. – wvxvw Sep 25 '13 at 21:20
  • 1
    @wvxvw If you happen to be as fortunate as that, you're dealing with what we call a happy case, i.e., not one of the cases that makes this an undecidable problem. You can define heuristics that work for lots of happy cases and don't blow up on the rest, but they won't work on all the happy cases; if they did, you'd have a decider for the problem, which by our premise is impossible. – Patrick87 Sep 25 '13 at 21:25