Saturday, March 31, 2007

TAS Must Die, Chapter 4

Ok, here's what I'm working towards:

The 'hand-coded lexer generator' is a java program that, using no real configuration, produces a lexer, which in my world is a series of data and data structures and the bare-bones code to tranverse those data structures. In my case, I generate a lexer that can hopefully recognize everything you'd see in file of lexical rules.

'a lexer' is the lexer generated by the generator.

'input that a lexer is meant to process' is a list of lexical rules. It's the input I'd like the parser to consume, using the lexer to take it apart.

'Hand-coded parser' is a recursive-descent parser (with no backtracking) that consumes the file, using 'A lexer' to break the input into pieces.

'A lexer #2' is the output of the parser. It is a lexical generator that should recognize the rules in 'input that a lexer...' The point is that I'll be able to use this to lex an arbitrary file.

Here's the implementation plan:


To accomplish my true goal, I would write a lexical description of TAS's Iscript language and use 'a lexer #1' to break the input into tokens which a new program, an 'Isript Compiler' would ingest.

Here's how I'll use it once it all works:
Then the pcode will be compiled and interpreted, which is a whole new thing. I'm sure I'll write several chapters about it.

Of course, being unable to just do the job at hand, I'm going to sidetrack a little. In the first diagram, you see the 'hand-coded lexer.' This is odd since I'm writing a parser that should be able to produce that very same lexer. That is, with proper diligence, 'A lexer #1' and 'A lexer #2' should be functionally identical. At this point, the hand-coded lexer can be retired. It's called 'bootstrapping' and it makes my head hurt. It's a chicken-and-egg situation - i'm writing a lexer using the lexer.

A practical effect of boorstrapping the lexer is I'll have two allegedly identical lexers of different origin. I should be able to put large amounts of text through each, and they should respond exactly the same. If they don't, I have work to do.

Hiking with The Scouts

Went on a hike with the boy scouts today in Bernheim Forest, KY, today. We tried for 13+ miles but we didn't make it. We weren't ready. The first 6 miles were quite steep and slick from rain. By then I knew we weren't going to finish. The terrain ahead was as challenging as the terrain behind, the kids were slowing down. and my left knee was acting up. We took a detour and walked 4 miles on pavement to return to the car.

We had a nice time!

Thursday, March 29, 2007

TAS Must Die, Chapter 3

It isn't hard to generate a lexer that has some identical states in it. Identical states aren't wrong, but they are bulky and aesthetically offensive. Removing them will reduce the size of the FA and is a good idea in general.


It's pretty easy to see that states 1 and 4, 2 and 5, and 3 and 6 are functionally identical. It gets less easy when you generate an NFA that has 200 states in it.

So how do we recognize when states are functionally identical?

My first stab was equivalent to trying to reassemble a broken ceramic sculpture by trying all the pieces and parts to see if they fit. It occured to me that testing a universe of disparate states for sameness was not going to be effective. If you have A->x->Z and B->x->Q, you can't tell if A and B are identical because at some stage in the process you may or may not have tested Z and Q for sameness.

It occured to me that it's easier to prove states are different than it is to prove they're the same.
The algorithm below starts off with all the states in a large group. We iterator over the states in the group, querying each for a 'signature' of it's transitions and such. We create new groups and put all states with the same signature in the same group. The signatures are not dependent upon only the states' transitions, but also on the destination group of those transitions. That is, S1->x->s10 and S2 -> x -> S9 look different. But if s10 and s9 are both in group 6, then signature(S1) could be something like "x->G7" and signature(S2) could be "x->G7". That's a match. S1 and S2 could be identical.


make a group G
for each state S
{
G.add(s), where this method includes
s.groupMember = G
}

while(the number of groups keeps changing)
{
for each group g
{
Regroup the states in g into g1, g2, ... gn
based upon signature(state). Whenever
adding a state to a group, we must
be sure to set s.groupMember = owningGroup
}
}

// All groups now contain only redundent states.
For each group G
{
make a new uber state S as a copy of any
member state.
associate S with G
}

// Change the references to the old obsolete
// states with the shiny new uber-states.
For each group G
{
for each of G's uberstate's transitions T
{
T.getDestination = T.getDestinationState.
getGroupMember.
getUberstate
}
}

startState = startState.getGroupMember.getUberstate
end

String signature(state)
{
make a comparable representation of the transitions
and the group number of which the transition's
destination state is a member. If some state S has
transitions x->10, y->11, and z->12, and the groups
are grp#0 = (10 11) and grp#1 = (12), then the result
could be the (well delimited) "~x->0~~y->0~~z->1~"

Also add to the string if the state is stop state,
or any other identifying fact, such as the token
class to be produced by the state. For example
if two states are identical except one produces
a token class of A and the other produces a
token class of B, they aren't identical.

Two states that produce a different result can't
possibly be identical. Either they have different
inputs or the destinations fundamentally different.

Two states that do seem to be identical now could be
proven to be non-identical in a different iteration
of the loop.
}

Tuesday, March 27, 2007

TAS Must Die, Chapter 2

Having a graph that exactly describes all valid inputs is nice, but until it is in the right form, it's of dubious value. Sort of like having a phone book sorted by first name. Consider this NFA which matches abd, acd, ab, and abf.

Note the states with double-circles. These are special states called 'stop states.' Once we consume the entire input string or find ourselves without a valid transition options, if we're in a stop state, the input is valid. Otherwise it isn't. If we scanned "ac" we'd advance from 0 to 5 via epsilon, consume 'a' to get to 6, then consume 'c' to get to 7. Now we're out of input but state 7 isn't a stop state. So 'ac' isn't recognized by this NFA.

But we sort of cheated. If we scanned an 'a', we'd be stuck. We wouldn't know whether to go to state 1, 5, 9, or 12. If we implemented this, we'd have to be prepared to try them all. Ladies and germs, this sucks to do. It's hard to code and easy to get burned by ambiguities.

I'm going to talk about some transformations to convert this NFA into an equivalent NFA that's easier to implement.

We can solve this problem using the following rule: If state x has an epsilon transition to state y, copy all of y's transitions to x and remove x's epsilon to y. Further, if state y is a stop state, make state x a stop state.

Applying this rule to state 0's epsilon to state 1, we create the following NFA.



0's epsilon to 1 is gone, and 0 has assumed 1's 'a' transition to 2.

Now we apply the rule to 0's other epsilon transitions to arrive at this: 's6'



Our NFA is now free of epsilons, so we're closer to having a recognizer that is easy to implement.

However, while removing the epsilons, we've created two new problems:
States 1, 5, 9, and 12 have no inputs. There is no way to get to them. Unreachable states are totally useless and should be removed. This is easy on paper, not always so easy in the computer. Depending on your implementation language, you have to act with varying amounts of vigor. In C, you might keep track of the data structures and free() them if they become unused. I use Java, so once an instance has no references the garbage collector reclaims it. I like not having to worry about it, but then again I can't control the Java's garbage collector. I'm ok with the tradeoff.
So now the unreachable states are gone. There is still a serious problem. Note that state 0 has four transitions that recognize 'a' and take us to different states! Not good. When we scan an 'a', we can't know which transition to take. This is better than the epsilons, which are total wildcards, but not by a lot. Here's how to get rid of these ambiguous transitions:

For an ambiguous input x connecting A x T1, A x T2,... AxTn, create a new state B and connect it to A using x, so AxB. Connect B to all the destination states Tn with epsilons. Yes, we're inserting epsilons back in and thus we'll have to re-run the epsilon-removing code. I'm ok with that. It's certainly possible to do something fancy and create B without epsilons. My take is that code correctness is king. Using the epsilon remover requires less code and thus probably causes fewer bugs. So we're going to create state number 16, and let 0->a->16, and then connect 16 to 2, 6, 10, and 13 using epsilons.

Apply the algorithm to remove the epsilons at state 16... 16 will assume the outputs of 2, 6, 10, and 13, the epsilon transitions are deleted, and then these unreachable states are dropped from the graph.
Now the process repeats. Look at all those ambiguous 'b' transitions. We'll make a 17 and route them there, then epsilon to 3, 11, and 14. Happily, the 'c' transition is fine.
Notice that as we march on, we're applying goodness to the left side of the graph. It's like we're wringing the crap out of it. Which we are, of course. Removing 17's epsilons will drop 3, 11, and 14 out of the graph. Note that 11 has no outputs. It's a stop state. But the rule above says that 17 should also assume the stop state status as well as the outputs.
And we're done. There are no epsilons and no states have ambiguous transitions. Our original NFA was made to recognize abd, acd, ab, and abf. Let's see if it works now.
abd: 0->a->16->b->17->d->4, and 4's a stop state. Check.
acd: 0->a->16->c->7->d->8. 8's a stop state. Check.
ab: 0->a->16->b->17, and 17's a stop state (because we inherited that status from 11.)
abf: 0->a->16->b->17->f->15, stop. check.

The last transformation I employ is an identical state remover. If you have two states that have identical transitions and have the same stop states, those states are identical. One can be removed from the NFA with references to this state being replaced by references to the other. I invented a cool algorithm I'll share soon. (Yes, I realize some white-haired mathematician guy with frightening eyebrows probably invented it 50 years ago. But it is new to me!)

My current implementation features a method in the Nfa class where the epsilon remover, disambigufier, and identical-state combiner are run in a loop until the NFA quits changing. At this point the NFA has become a DFA and is easy to implement.

Sunday, March 25, 2007

TAS must die. Chapter 1.

I've been on an exciting journey of discovery lately. We use a proprietary web server at work called TAS. I suspect it is behaving very very badly. It supports a primitive scripting language which is sort of like QBASIC meets JSP. Except, there's none of the really good parts of either. Like subroutines or sessions. The scripting language is not difficult so I decided to write a translator for it. Then I could write a little servlet that pretended to be TAS. How hard could it be? (dun dun duuuunnnnnn)

Well, it's been several days now and I have not gotten very far. First I fiddled with javacc and antlr. These are java-based parser generators. They provide a banquet but I need a snack. Quite a learning curve. Since I'm doing this for fun on my own time, I decided to write my own lexer and yaccer. These have been favorite subjects of mine since I was in college.

Now, I am aware that people smarter than me have been turning syntax descriptions into code for 50 years. And they do it better. I don't care. It's challenging and I'm having fun.

Converting a language to code is usally done in two main steps - lexical analysis and parsing. The former deals with converting the input file to an equivalent series of tokens. The latter makes sure these tokens are in a recognized order and produces output so we can actually do something.

Let's start with the lexer.

The lexer is a program that ingests input (like a TAS iscript file) and spits out tokens. These tokens are nothing more that the bits of the input that can't be broken down any further without losing meaning. For example, if we're talking about useless animals, "cat" is a token, while "c", "a", and "t" aren't. "cat" represents an orange creature currently atop my monitor, the others are simply letters.

The normal implementation for a lexer is some variation on a state diagram. When we recognize the tokens of a language, we frequently have to handle things like "125.3e-7" or a variable name which could be described as 'one or more letters.' The state diagram is an easy and natural tool for this. The implementation of a state diagram is called a state machine.

Now, converting a real syntax description into a formalized mechanism like a state diagram isn't easy. Sometimes you have to handle inconvenient white space. Sometimes you have ambiguous inputs ("cat" and "cathode" start with the same letters. When you scan a "c" which do you select?) Handling such things as you're building a state machine is very difficult or maybe impossible. You just don't know what's going to happen next. There is a relaxed version of a state machine called a nondeterministic finite automaton, or NFA. Big words, but it means, "one input can lead to many totally valid states, simultaneously." It is possible though unpalatable to implement a NFA for real work. But the NFA is logically valid, is easy to construct, and with some effort we can transform it into the desirable deterministic finite automaton (DFA) once we're done adding new token descriptions.

Here's why it's good to build a NFA. Suppose you have some remarkably complex finite automata F, mercifully represented here by a single state.


Now, your instructions are to change the diagram to recognize zero or one of F's. The NFA provides us with the concept of an 'epsilon' transition. This is a transition that consumes no input whatsoever. That is, epsilon transitions let us make topology changes without immediately saddling us with the practical implications of these changes. Using epsilon transitions, making F optional is trivial:

We solve the problem by create new states X and Y, and sliding them between F and the surrounding diagram. Then we add epsilon transitions so we don't have to worry with the details of F - we're just trying to get the topology right. In reality, when we take the "a" transition from A, we're simultaneously in states X, F, and Y. So we can have one F using the A->X->F->Y->B path. Or we can skip F with the A->X->Y->B path. What if you wanted zero or more F's? Just add a new epsilon transition from Y to X in the above diagram.


Now you can recognize multiple Fs using this path: A->X->F->Y->X->F->Y->B. All this without fighting with the details of F.

Here's some Java code example to apply the 'zero or one' change. In this example, the NFA maintains entry and exit states. I always create new ones when making this sort of topology change since it's easy and accurate.

public void ZeroOrOne()
{
// X above
newEntry = new State();
// Y above
newExit = new State();
// X->Y via eps
abovenewEntry.addTransition(new Transition(EPSILON, newExit));
// X-> F via eps
newEntry.addTransition(new Transition(EPSILON, currentEntry));
// F->Y via eps
currentExit.addTransition(new Transition(EPSILON, newExit));
currentEntry = newEntry;
currentExit = newExit;
}


That's plenty for now. Next time we'll use a real example and remove the epsilons which will move us closer to a DFA.

Friday, March 23, 2007

The first posting

This is a posting so my new blog isn't totally empty.