The technique I proposed previously for removing ambiguous transitions isn't wrong, per se, but it isn't complete, either. My parser was croaking on some test input. It was generating an endless number of new states. I finally figured out what it was.
In diagram 1, we see s0 which contains ambiguous references to two subgraphs which can be described as 'a+'. That is, from S0, we can go to S1 or S2 on an 'a'. We apply my l33t technique for removing such ambiguities to arrive at diagram 2. Now, we don't like epsilons, so we apply the 'remove epsilons' algorithm to arrive at diagram 3. Notice that the subgraph starting at state 3 is identical to the FA we started at in diagram 0. We'll never make any progess towards creating a DFA from this NFA.
In fact, there is little to do except give up. The rules are ambiguous and without end.
I amended my algorithm for removing ambiguous references to be, relating to the diagrams above:
"Once we create diagram 3, if two or more transitions point to the original states from S0, (S1 and S2 in this case) the rules are ambiguous and unending. Time to die."
This isn't an outlandish case, either. Define a few words like
CITY : [A-Z]+
HEXSTRING: ([0-9] | [A-F])+
and that's all it takes for the previous method to fail. Now, how to address this? There are a few ways. If you have enough control over your input, amend your rules so the ambiguity can't happen. For example, we could amend HEXSTRING to require a leading 'x'.
CITY : [A-Z]+
Looks familiar, eh?
Sometimes, we can't change the input.
Those are always a sequence of letters. We can generalize, however, so the lexer can build a DFA, and leave applying meaning to the more powerful parser:
Well, that was today's adventure. Hopefully I'll be able to conclude my testing phase soon.