2

I'm trying to create simple tokenizer to transform following (only part shown) search expression to tokens

word1 near(1) word2

where word1, word2 are some words and near(1) is distance operator. The question is how this expression should be tokenized. I see two ways

1. <WORD, word1> <WORD, near> <LPAREN> <NUMBER,1> <RPAREN> <WORD, word2>.
2. <WORD, word1> <NEAROP, 1> <WORD, word2>

But should I really try to tokenize NEAR(\d+) during tokenization, or should I go first way and handle NEAR operator at parser level, during building parse tree?

Yuval Filmus
  • 276,994
  • 27
  • 311
  • 503
Oleg
  • 123
  • 3
  • 2
    Up to you. If you want to allow expressions inside the parentheses, you probably want option (1), otherwise option (2) works just as well. – Yuval Filmus Jul 24 '13 at 15:49
  • oh! Forgot to add. If there is something like word1 near (word2) word3 - they all are words. I.e. expressions in parentheses are possible. So way 1, right? – Oleg Jul 24 '13 at 15:57
  • In that case, you have no choice. – Yuval Filmus Jul 24 '13 at 16:04

1 Answers1

1

Since you indicate that the parameter of the near operator can be an arbitrary expression, it should be handled at the parser rather than at the lexer. Otherwise how would you handle things like near(x+y)?

Yuval Filmus
  • 276,994
  • 27
  • 311
  • 503
  • Thanks! Was going to ask you to add your comment as an answer to mark is as accepted. – Oleg Jul 26 '13 at 13:14
  • Wait... I don't plan to handle NEAR(x+y). If near(\d+) where d+ is not expression itself, then it's NEAROP, otherwise (if there are something else in the parentheses) NEAR(asd) is transformed to the <WORD, near>, LPAREN, <WORD, asd> RPAREN. – Oleg Jul 26 '13 at 14:16