Sunday, March 25, 2007

TAS must die. Chapter 1.

I've been on an exciting journey of discovery lately. We use a proprietary web server at work called TAS. I suspect it is behaving very very badly. It supports a primitive scripting language which is sort of like QBASIC meets JSP. Except, there's none of the really good parts of either. Like subroutines or sessions. The scripting language is not difficult so I decided to write a translator for it. Then I could write a little servlet that pretended to be TAS. How hard could it be? (dun dun duuuunnnnnn)

Well, it's been several days now and I have not gotten very far. First I fiddled with javacc and antlr. These are java-based parser generators. They provide a banquet but I need a snack. Quite a learning curve. Since I'm doing this for fun on my own time, I decided to write my own lexer and yaccer. These have been favorite subjects of mine since I was in college.

Now, I am aware that people smarter than me have been turning syntax descriptions into code for 50 years. And they do it better. I don't care. It's challenging and I'm having fun.

Converting a language to code is usally done in two main steps - lexical analysis and parsing. The former deals with converting the input file to an equivalent series of tokens. The latter makes sure these tokens are in a recognized order and produces output so we can actually do something.

Let's start with the lexer.

The lexer is a program that ingests input (like a TAS iscript file) and spits out tokens. These tokens are nothing more that the bits of the input that can't be broken down any further without losing meaning. For example, if we're talking about useless animals, "cat" is a token, while "c", "a", and "t" aren't. "cat" represents an orange creature currently atop my monitor, the others are simply letters.

The normal implementation for a lexer is some variation on a state diagram. When we recognize the tokens of a language, we frequently have to handle things like "125.3e-7" or a variable name which could be described as 'one or more letters.' The state diagram is an easy and natural tool for this. The implementation of a state diagram is called a state machine.

Now, converting a real syntax description into a formalized mechanism like a state diagram isn't easy. Sometimes you have to handle inconvenient white space. Sometimes you have ambiguous inputs ("cat" and "cathode" start with the same letters. When you scan a "c" which do you select?) Handling such things as you're building a state machine is very difficult or maybe impossible. You just don't know what's going to happen next. There is a relaxed version of a state machine called a nondeterministic finite automaton, or NFA. Big words, but it means, "one input can lead to many totally valid states, simultaneously." It is possible though unpalatable to implement a NFA for real work. But the NFA is logically valid, is easy to construct, and with some effort we can transform it into the desirable deterministic finite automaton (DFA) once we're done adding new token descriptions.

Here's why it's good to build a NFA. Suppose you have some remarkably complex finite automata F, mercifully represented here by a single state.

Now, your instructions are to change the diagram to recognize zero or one of F's. The NFA provides us with the concept of an 'epsilon' transition. This is a transition that consumes no input whatsoever. That is, epsilon transitions let us make topology changes without immediately saddling us with the practical implications of these changes. Using epsilon transitions, making F optional is trivial:

We solve the problem by create new states X and Y, and sliding them between F and the surrounding diagram. Then we add epsilon transitions so we don't have to worry with the details of F - we're just trying to get the topology right. In reality, when we take the "a" transition from A, we're simultaneously in states X, F, and Y. So we can have one F using the A->X->F->Y->B path. Or we can skip F with the A->X->Y->B path. What if you wanted zero or more F's? Just add a new epsilon transition from Y to X in the above diagram.

Now you can recognize multiple Fs using this path: A->X->F->Y->X->F->Y->B. All this without fighting with the details of F.

Here's some Java code example to apply the 'zero or one' change. In this example, the NFA maintains entry and exit states. I always create new ones when making this sort of topology change since it's easy and accurate.

public void ZeroOrOne()
// X above
newEntry = new State();
// Y above
newExit = new State();
// X->Y via eps
abovenewEntry.addTransition(new Transition(EPSILON, newExit));
// X-> F via eps
newEntry.addTransition(new Transition(EPSILON, currentEntry));
// F->Y via eps
currentExit.addTransition(new Transition(EPSILON, newExit));
currentEntry = newEntry;
currentExit = newExit;

That's plenty for now. Next time we'll use a real example and remove the epsilons which will move us closer to a DFA.

No comments:

Post a Comment