My 'go for the throat' coding approach, where I implement the big ideas at the expense of the details, is starting to catch up with me. For example, I currently don't have a good way to specify control characters (such as newline) in my rules. Using the common "\n" yields a Nfa that recognizes an 'n'. The backslash is works great for converting lexer tokens
"(" => OPEN_PAREN, "("
"\(" => CHAR", "("
but doesn't work so well for n and r.
Other issues involve the lexer and the file it processes. It doesn't handle EOF very gracefully when the last line of the input file isn't terminated in a newline. I suppose it does exactly what it should, from a logical point of view, but that diverges from what I want it to do, which is to insert a NEWLINE at the end of that line, and then produce the EOF. So I have to wonder if I should rig the lexer to produce BOF (BOL stuff EOL)* EOF automatically since that's what I would really like to see. So I'm thinking I could feed the lexer a raw input stream and a cooked input stream, with the latter inserting the file markers when they are absent. BOL and EOL as explicit tokens give me a nice way of matching strictly-formatted files. The TAS Iscript stuff I intend to eventually parse includes commands that always start at the beginning of the line, at least at our installation. So BOL would give me a nice way to match this.
I am gaining an appreciation for the realities of file I/O, control characters, when to ignore whitespace, etc.
It's getting to the point that I am considering adding another lexer that will be tasked with removing whitespace and inserting file markers. It would be enabled by configuration. When
enabled it would act as a filter on the real token stream.
I'm also flirting with the idea of adding the "not" operator which inverts FAs. In my tiny world, A -> not x -> B produces a FA where A has 255 transitions to B - every possibility except 'x'. Implementing this in the general case (where 'x' in the example could be a FA), however, is not straightforward and maybe not useful. More to come on this.
Finally, I have implemented the multiple-lexer idea. I am not sure it is such a great idea though I don't have a better alternative at this point. I did implement a switch in the lexer generator code to treat characters in a case-insensitive way. This means that instead of adding a 'x' to a FA, I add ('x'|'X'). Remember, if brute force doesn't work, you aren't using enough of it!