Tuesday, March 27, 2007

TAS Must Die, Chapter 2

Having a graph that exactly describes all valid inputs is nice, but until it is in the right form, it's of dubious value. Sort of like having a phone book sorted by first name. Consider this NFA which matches abd, acd, ab, and abf.

Note the states with double-circles. These are special states called 'stop states.' Once we consume the entire input string or find ourselves without a valid transition options, if we're in a stop state, the input is valid. Otherwise it isn't. If we scanned "ac" we'd advance from 0 to 5 via epsilon, consume 'a' to get to 6, then consume 'c' to get to 7. Now we're out of input but state 7 isn't a stop state. So 'ac' isn't recognized by this NFA.

But we sort of cheated. If we scanned an 'a', we'd be stuck. We wouldn't know whether to go to state 1, 5, 9, or 12. If we implemented this, we'd have to be prepared to try them all. Ladies and germs, this sucks to do. It's hard to code and easy to get burned by ambiguities.

I'm going to talk about some transformations to convert this NFA into an equivalent NFA that's easier to implement.

We can solve this problem using the following rule: If state x has an epsilon transition to state y, copy all of y's transitions to x and remove x's epsilon to y. Further, if state y is a stop state, make state x a stop state.

Applying this rule to state 0's epsilon to state 1, we create the following NFA.



0's epsilon to 1 is gone, and 0 has assumed 1's 'a' transition to 2.

Now we apply the rule to 0's other epsilon transitions to arrive at this: 's6'



Our NFA is now free of epsilons, so we're closer to having a recognizer that is easy to implement.

However, while removing the epsilons, we've created two new problems:
States 1, 5, 9, and 12 have no inputs. There is no way to get to them. Unreachable states are totally useless and should be removed. This is easy on paper, not always so easy in the computer. Depending on your implementation language, you have to act with varying amounts of vigor. In C, you might keep track of the data structures and free() them if they become unused. I use Java, so once an instance has no references the garbage collector reclaims it. I like not having to worry about it, but then again I can't control the Java's garbage collector. I'm ok with the tradeoff.
So now the unreachable states are gone. There is still a serious problem. Note that state 0 has four transitions that recognize 'a' and take us to different states! Not good. When we scan an 'a', we can't know which transition to take. This is better than the epsilons, which are total wildcards, but not by a lot. Here's how to get rid of these ambiguous transitions:

For an ambiguous input x connecting A x T1, A x T2,... AxTn, create a new state B and connect it to A using x, so AxB. Connect B to all the destination states Tn with epsilons. Yes, we're inserting epsilons back in and thus we'll have to re-run the epsilon-removing code. I'm ok with that. It's certainly possible to do something fancy and create B without epsilons. My take is that code correctness is king. Using the epsilon remover requires less code and thus probably causes fewer bugs. So we're going to create state number 16, and let 0->a->16, and then connect 16 to 2, 6, 10, and 13 using epsilons.

Apply the algorithm to remove the epsilons at state 16... 16 will assume the outputs of 2, 6, 10, and 13, the epsilon transitions are deleted, and then these unreachable states are dropped from the graph.
Now the process repeats. Look at all those ambiguous 'b' transitions. We'll make a 17 and route them there, then epsilon to 3, 11, and 14. Happily, the 'c' transition is fine.
Notice that as we march on, we're applying goodness to the left side of the graph. It's like we're wringing the crap out of it. Which we are, of course. Removing 17's epsilons will drop 3, 11, and 14 out of the graph. Note that 11 has no outputs. It's a stop state. But the rule above says that 17 should also assume the stop state status as well as the outputs.
And we're done. There are no epsilons and no states have ambiguous transitions. Our original NFA was made to recognize abd, acd, ab, and abf. Let's see if it works now.
abd: 0->a->16->b->17->d->4, and 4's a stop state. Check.
acd: 0->a->16->c->7->d->8. 8's a stop state. Check.
ab: 0->a->16->b->17, and 17's a stop state (because we inherited that status from 11.)
abf: 0->a->16->b->17->f->15, stop. check.

The last transformation I employ is an identical state remover. If you have two states that have identical transitions and have the same stop states, those states are identical. One can be removed from the NFA with references to this state being replaced by references to the other. I invented a cool algorithm I'll share soon. (Yes, I realize some white-haired mathematician guy with frightening eyebrows probably invented it 50 years ago. But it is new to me!)

My current implementation features a method in the Nfa class where the epsilon remover, disambigufier, and identical-state combiner are run in a loop until the NFA quits changing. At this point the NFA has become a DFA and is easy to implement.

No comments:

Post a Comment