Tuesday, April 3, 2007

TAS Must Die, Chapter 6

I started working up the lexical definitions for TAS. To keep it simple, I started with the "comment" command:

<!-- #REM Globalization Labels -->

It looks a little like HTML since it is meant to be embedded in an HTML page. TAS is an incestuous mix of HTML and magic commands. TAS looks for "<!-- #" followed by a command - REM in this case. After the REM comes the free-form comment text. While you'd think lexing this would be the easiest thing evah, I found out that the lexical rule to recognize the comment text (.*) conflicted with other lexical rules as per the previous chapter of this trail of anguish. I tried all manner of hokery to get the lexer to work but nothing was remotely satisfying.

A friend of mine suggested that I needed a 'mode' that would let me distinguish between various ambiguous rules. Since I am creating a purely table-driven lexer, putting in a traditional mode (which in Tony-speak is an IF statement) didn't appeal to me.

My proposed solution is to allow the configuration file to contain multiple related lexical rule sets. Here's a representative set of rules:

comment: .*
rem: #REM
number: [0-9]+

The immediate problem is that, again as per the previous chapter, the 'comment' and 'number' rules are ambiguous. Here's one solution:

# lex0
rem, 1: #REM
number: [0-9]+

#lex1
comment, 0: .*

Now, the differences are:
1) There are two sections of the configuration, lex0 and lex1.
2) there is a "1" behind the 'rem' token
3) there is a "0" behind the 'comment' token

lex0 and lex1 are two separate lexical definitions. By default, the lex0 will be active. But when a rule is satisfied, and what there is an associated number, the default lexer will be switched. So lex0 is processing and scans "#REM". In addition to returning the "rem" token, the default lexer is set to 1. Now we process the comment and produce the 'comment' token. Because of the associated 0, we change back to lex0. Now we have our mode but no explicit if statement.

Here's another way. Let's say for whatever reason we want the comment as individual characters.

# lex0
rem, 1: #REM
number: [0-9]+

#lex1
comment : .
end , 0: ;

We see the #REM which causes LEX1 to become the default rule-set. Now as we scan the input one token at a time we match 'comment' repeatedly. This shows that we stay in lex1 until the scanner is told to switch to another lexer. When we scan a semicolon (which matches both rules, but that's a detail for now) we return the 'end' token and switch back to lex0.

Another nice thing about this method is that with its separate lexers it is easy to create non-conflicting rules for inputs that have very different sections (a row-oriented report terminating in a summary page, for example.)

No comments:

Post a Comment