Treetop Introductory Tutorial Part 8 of 10 -- Chasing our tail
Our goal has always been to understand a list of email addresses. We would like to apply the full_email_address
rule over and over until our whole list is consumed. Can we? Yes, we can!
Remember, our parser is like the poor UN translator — it can only take things in the order it receives them. So how about if we think of our email list using the following rule:
followed by (maybe) some spaces
followed by either a comma or a semicolon
followed by (maybe) some more spaces
followed by another email list.
Basically, an email list is an email item followed by an email list (which is an email item followed by an email list, which is…). That’s the principle of Tail Recursion. Tail Recursion is an important part of grammars as it allows us to process a list when we don’t know at the start how many items there will be in it.
We can represent this in Treetop as follows:
grammar EmailList
rule email_list
full_email_address [ ]* (','/';') [ ]* email_list
end
rule full_email_address
optional_email_name [ ]* '<' email_address '>'
end
rule optional_email_name
'"' email_name '"' / ''
end
rule email_name
[^"]*
end
rule email_address
[^>]*
end
end
Notice that we put our new rule at the head of the list. The first rule of our grammar is called the root. It represents what, all together, we are looking for. Since we are looking for an email list, it’s appropriate that our root rule is email_list
.
You may be asking why we used (','/';')
instead of the simpler [,;]
. The reasons are rather difficult to explain, having to do with backtracking capability (the ability to pretend we never translated something). From your point of view, items you put in Regular Expressions will not generate error messages when they’re missing, whereas items in quotes will.
This is all very well and fine if we are processing an infinite email list. But email lists, like all good things, have to end some time. So we need to build in an way to end our email list. We already know how to use options, because we did that in the previous tutorial.
Modify your grammar as follows to make the remainder of the list optional:
rule email_list
full_email_address [ ]* optional_email_list
end
rule optional_email_list
(','/';') [ ]* email_list / ''
end
Now we need a way to access the list of email names. What do you think is the easiest way? Of course, you’re right! We add a Ruby routine to extract the email name and address, and then we append the return value from our recursive call to that list. Your code will look something like this:
rule email_list
full_email_address [ ]* optional_email_list {
# return an array of email pairs,
# each pair consisting of an (optionally empty) email name
# followed by an email address.
def list_of_email_pairs
# start out with the email name and address for the first email
email_list = [ [full_email_address.optional_email_name.email_name_text_value, full_email_address.email_address.text_value] ]
# unless we are the last pair,
unless optional_email_list.terminal?
# add in the pairs from the rest of the list
email_list += optional_email_list.email_list.list_of_email_pairs
end
email_list
end
}
end
The modifications to the email list program should be easy. Just use the each method to display the pairs according to our original intent:
puts "You said the following email addresses:"
result.list_of_email_pairs.each do |email_pair|
print " #{email_pair[1]}"
print ", #{email_pair[0]}" unless email_pair[0].empty?
print "\n"
end
Give it a spin with our first two full addresses together:
"Jena L. Dovie" <jdovie_qs@agora.bungi.com>, <marleen_df@acg-aos.com>
Works? Good! Okay, at this point, you have learned quite a bit about Treetop, and you could go on to write more of your own grammars with a little help from the (sparse) documentation.
However, there’s one email address we haven’t figured out how to parse: Charmain Lashunda <c.lashunda_mc@promero.com>
To parse that address we have to look at some of the advanced features of Treetop grammars. Read on, if you dare!