Tokenizer and complex operators

Question

I'm trying to create simple tokenizer to transform following (only part shown) search expression to tokens

word1 near(1) word2

where word1, word2 are some words and near(1) is distance operator. The question is how this expression should be tokenized. I see two ways

1. <WORD, word1> <WORD, near> <LPAREN> <NUMBER,1> <RPAREN> <WORD, word2>.
2. <WORD, word1> <NEAROP, 1> <WORD, word2>

But should I really try to tokenize NEAR(\d+) during tokenization, or should I go first way and handle NEAR operator at parser level, during building parse tree?

Up to you. If you want to allow expressions inside the parentheses, you probably want option (1), otherwise option (2) works just as well. — Yuval Filmus, Jul 24 '13 at 15:49
oh! Forgot to add. If there is something like word1 near (word2) word3 - they all are words. I.e. expressions in parentheses are possible. So way 1, right? — Oleg, Jul 24 '13 at 15:57

score 1 · Accepted Answer · answered Jul 25 '13 at 08:41

1

Since you indicate that the parameter of the near operator can be an arbitrary expression, it should be handled at the parser rather than at the lexer. Otherwise how would you handle things like near(x+y)?

answered Jul 25 '13 at 08:41

Yuval Filmus

276,994
27
311
503

Thanks! Was going to ask you to add your comment as an answer to mark is as accepted. – Oleg Jul 26 '13 at 13:14
Wait... I don't plan to handle NEAR(x+y). If near(\d+) where d+ is not expression itself, then it's NEAROP, otherwise (if there are something else in the parentheses) NEAR(asd) is transformed to the <WORD, near>, LPAREN, <WORD, asd> RPAREN. – Oleg Jul 26 '13 at 14:16

Tokenizer and complex operators

1 Answers1