Reviewing the last entry, the behavior of the two parallel structures on the right could be obtuse. Basically, each state owns a section of these arrays as delimited by begEdge and endEdge. Each row within that range contains a character and the state to go to if we see that character.
That's nice and compact. However, it does constitute a linear scan which could be relatively slow. It's probably immeasuable in any real case. Be that as it may, if the lexer needed to be faster we could replace the parallel edge and dest arrays, as well as the begEdge and endEdge arrays, with a gawd-awful sparse matrix. It would take up more space and probably not be very i18n-able. However, we'd be able to generate tokens about as fast as it could be done.
The table would be a nx256 int array where n is the number of states in the lexer. The 256 ints for a state would represent the transitions possible. The ASCII value of the input character would be an index into this array. The result would be the next state.
nextState = 0; // start state
while(nextState != -1)
input = getInput();
state = nextState;
nextState = sparse[state][input];
// 'state' is where we finished, and 'input' would
// be the unaccepted character that we'd re-ingest
// next time.
Not a lot of code there...
Another option is to go the other way, and say all this array stuff is rubbish. The lexer would include normal classes which we'd instanciate the first time the lexer was used. We could replace the four parallel arrays with a simple class and the two parallel edge arrays with a Map. Instead of having state IDs or Token IDs, etc, we'd use references to instances in our classes.