Treetop Introductory Tutorial Part 9 of 10 -- Looking ahead
There is one way that our Treetop grammar is different than the harried UN translator. Treetop grammars have a lookahead function, which allows them to peer a short way into the future, rather like Nicholas Cage in Next. We need that feature in order to handle the last case of our email list.
If we match a name in quotes, we know when to stop: when we reach the second quote. But when we are parsing the name outside the quotes, we actually want to stop when we reach the (optional) space before the open-angle-bracket. What we would like to do, in effect, is to push that (optional) space and open-angle-bracket back into the future, so the main parser routine (full_email_address) will pick it up properly.
So the rule we want is something like this:
gather up all the non-blank-characters into a word.
If the word is going to be followed by optional spaces and an open-angle-bracket,
then stop.
Otherwise, parse the rest of the unquoted email name.
We can represent this in Treetop as follows:
rule optional_email_name
('"' email_name '"'
/ unquoted_email_name &([ ]* '<')
/ '' ){
def email_name_text_value
if self.terminal?
''
else
email_name.text_value
end
end
}
end
rule unquoted_email_name
unquoted_email_name_word (&([ ]* '<') / ([ ]+ unquoted_email_name))
end
rule unquoted_email_name_word
(!(' '/'<') .)+ # (equivalent to: [^ <]+)
end
The grammar is getting a little bit more complicated here. Now you know why we left it until last. We can see there are two types of lookaheads in Treetop: Positive Lookaheads and Negative Lookaheads. Postive Lookaheads are indicated by an ampersand (&) and succeed if that pattern is in the immediate future. Negative Lookaheads are indicated by an exclamation mark (!) and succeed if that pattern is not in the immediate future.
So &([ ]*<)
will match optional spaces followed by an ampersand, but will not consume them, so they can be available to the calling routine, namely full_email_address
. On the other hand, (!(' '/'<) .)+
says “match any character as long as it is not space or open-angle-bracket”. So it will stop when a space or an open-angle-bracket is about to happen.
Now we have to figure out how to extract this. The principle is that the calling object (in this case full_email_address
) should not have to change, to limit the damage.
Now we actually get to see a feature of Treetop that makes things simpler! We can add routines in the middle of our patterns, not just at the end. This means we can take the if out of our email_name_text_value method. Modify the grammar as follows:
rule optional_email_name
'"' email_name '"' {
def email_name_text_value
email_name.text_value
end
}
/ unquoted_email_name &([ ]* '<') {
def email_name_text_value
unquoted_email_name.text_value
end
}
/ '' {
def email_name_text_value
''
end
}
end
What we are saying is that, if the email_name
matches, email_name_text_value
returns the text value from email_name
; otherwise, if the unquoted_email_name matches
, email_name_text_value
returns the text value from unquoted_email_name
; otherwise it returns an empty string.
Guess what! Nothing needs to change in our Ruby program! That’s the benefit of limiting the effect of changes. Give it a try with our test string:
"Jena L. Dovie" <jdovie_qs@agora.bungi.com>, <marleen_df@acg-aos.com>; Charmain Lashunda <c.lashunda_mc@promero.com>; "Traci Shauna" <traci_shaunaxp@cs.com>
You should get something like:
I say yes! I understand!
You said the following email addresses:
jdovie_qs@agora.bungi.com, Jena L. Dovie
marleen_df@acg-aos.com
c.lashunda_mc@promero.com, Charmain Lashunda
traci_shaunaxp@cs.com, Traci Shauna
We have now completed the task we set out to do. We have written a simple Treetop grammar to parse our email list, and along the way we have learned about regular expressions, tail recursion, lookaheads and so much more.
After this, sad to say, you’re on your own. You will find Treetop quite counter-intuitive, unless you are naturally used to this type of grammar. The next tutorial will summarize what we’ve learned as well as provide more in-depth debugging techniques to help you write more complex grammars.