Stormcrow: April 2007

Sunday, April 29, 2007

TAS Must Die, Chapter 20

I made the changes to the Operator precedence parser as per my previous entry. Works great.

I decided that I probably had a cruft buildup in my Executable subclasses, so I started writing some junit tests. They aren't traditional low-level tests. I just couldn't build meaningful Executables without writing a lot of code or making the Executable trivial. So I ended up writing test classes that focused on testing an Executable class, but if other Executables had to be invoked, so be it.

The junit tests include a piece of iscript that tests some Executable. This is parsed then compiled in Executables where are then executed. For conditionals, I might have the iscript emit a 'true' or 'false' piece of text in the appropriate part of the if/else/endif. For expressions, I fetch the result from the global variable hashmap and compare it to an expected value.

So far I turned up a little crud, nothing too bad. I need to see how iscript handles weird cases, such as ("a"*3) or ("bob" && true). Once I know the proper behavior for the corner cases I'll be able to fix the rest of the junit errors.

Oh, and if you aren't using junit, you ought to be. It's a Sure Thing.

Saturday, April 28, 2007

TAS Must Die, Chapter 19

I expanded the operator precedence parser (OPP) to handle explicit types of IDs the right way. Then I corrected all the fudging I did with the datatypes. Suddenly the whole thing really works better. Quite satisfying.

I missed another syntax element, namely that a variable being set can also be indexed:



With a little agony, I was able to use the same OPP for this. The first gotcha was that by the time the recursive descent parser (RDP) has scanned BOB and [, we've read too much to satisfy the syntactic needs of the OPP. Fortunately I had built a pushback into my TokenStream class. So upon scanning "BOB[" in the RDP, I push BOB and [ back onto the stream and then invoke the OPP which scans a reasonable expression and works fine. The second gotcha was that I was using the --> token to know when the expression was over. While this works for

VALUE = expression -->

it doesn't work for

NAME = BOB [ expression ] VALUE...

Lovely, eh? I modified the OPP to accept an 'end token'. Now I passed --> or VALUE to the OPP as appropriate for the situation. This produced the right result. Even so, I don't like it - it doesn't feel like a bulls-eye.

Currently the OPP throws an exception if it scans a token which it can't find in its OP table. I believe I can simplify by using this event to inject a synthetic end-of-expression token and let the RDP handle any syntax error caused by the mystery token.

For example, the following input would cause an exception:

id1 + id2 XYZZY

Assuming XYZZY is not the end-of-expression token. What's key is that a parser notes the syntax error. This is the situation that was happening last night. Unfortunately, it WAS valid with the new use of the OPP. I changed the parser so I could specify XYZZY as the end-of-expression token.

Now, upon reading a token that isn't in the OP tables (such as VALUE or -->), the OPP pushes that token back onto the stream, and uses a totally fabricated 'end of expression' token instead. Now what will happen is that the expression will be reduced as per normal and the mystery token is available to the RDP.

Here are some possibilities:



While parseing BOB[a+b]..., the OPP scans VALUE, inserts the end-of-expression token, produces pcode for BOB[a+b], and pushes VALUE back onto the stream. Now the RDP resumes with VALUE. We're good. Then the OPP sees the 10, gets confused by -->, pushes it back, and injects the end-of-expression. 10 is valid. The --> is left to the RDP which likes it.

 (note the typo in VALUE)

After the OPP scans the ], it sees VAXUE which the lexer has probably misinterpreted as a variable. The OPP checks its tables, finds it (!) and throws an error since an VARIABLE isn't allowed after a ]. Good.



Someone typed a known token accidently. The OPP sees it, doesn't have a table entry, injects the end-of-expression token, and pushes WRONGTOKEN. BOB [ a+b] is recognized normally since it is valid. WRONGTOKEN is then scanned by the RDP which throws an error since it expects VALUE. So we're still good.

Stay tuned. This will let me unravel some of the unsatisfying crud I did last night.

Friday, April 27, 2007

TAS Must Die, Chapter 18 (or, why the STAY PUFT Marshmallow Man pwns me)

Everytime I look, I find another stinky wart in my code, lol.

Instead of adding any type of factory or if/else/switch facility to allow me to create specific pcode instances, I made the operator precedence table very very explicit. So when I reduce a '*' I know to generate a MultiplyPcode instance. Previously I generated a generic "binary" pcode with "*" as an attribute. But then getting from "*" a representative class required a Factory or switch().

Now, the explicit parser table works great though adding new Redux/Pcode/Executable classes is tedious. But there's nothing to break in the approach so once it works, it will work forever. But is it too explicit? Individual rows for every possible operator and ID type is great, totally accurate, and easy to use. But the table is getting unweildy. The very incomplete table is already 15x15 or so. I can see it floating up to 21x21.

Currently, the table uses the stack's topmost operator on one axis and the input operator on the other. The intersection identifies the action to perform.

This works great until you get 6 operators with the same precedence. Like >, >=, ==, !=, <=, <. Now I'm adding a huge number of virtually identical cells to the table. Perhaps associating the operators with a numeric precedence would have allowed the table to be structured differently and a lot smaller. Instead of:

action = crossIndex(stackOp, inputOp)

where action is shift() or MultiplyReduce(), I could have done:

diffTypeOfAction = crossIndex(stackOp.precedence, inputOp.precedence)

And now if diffTypeOfAction is reduce(), you execute stackOp.reduce(). That extra step seems to allow the table to be about 1/4 the current size without any loss of functionality. Since my operators are all left-associative and of consistent precedence (there is no a>b>c>a type of precedence chain) I think I can create a reasonable set of abstract precedences.

Making the table too explicit caused me to make another poor decision. I fudged ID-type tokens. Because I was tired of adding rows and columns to the precedence table one night, I slammed the Token's class to VARIABLE so I didn't have to mess with numeric or string constants. I knew I'd be revisiting it. But like 3 of the 4 Ghostbusters, I didn't know the form of The Destroyer. In my case, it's a bunch of Executables that don't know the type of their operands! D'oh!

Perhaps the Stay Puft marshmallow man would have been better.

Thursday, April 26, 2007

TAS Must Die, Chapter 17

I started working on code generation. For me, this is converting a list of pcodes to a list of executables, executables being classes that implement an interface called Executable. If you can imagine it, it requires implementors to define a method called "execute()".

Preliminary work on the assign, emit, and emitVariable executables went well but did point out flaws in my symbol management, or lack thereof. There are a few places where I 'forget' that some operands might be constants and these won't be found in the run-time datastructure of iscript variables. I could cheese out and check the values for a leading number or quote and figure it out that way. But fixing the problem is the right answer. This will reverberate all the way into the operator precedence parser, I fear, and ultimately cause me to add two more rows and columns to the precedence table.

It's probably just as well, my current mechanism, as I think about, probably doesn't complain about having a constant as an l-value. I don't know what would happen if I did.

Once I have the symbol issue resolved, I'll be sailing again.

Last night's main trauma was an issue caused by my BinaryOperatorPcode class. This class manages an operator, two operands, and the result of an operation. Think

result <- op1 operator op2

So I have this class which can display the pcode for any binary operator - plus, minus, divide, multiple, and, or, etc. It works great. The problem comes when I want to generate the Executable instance. But which executable! It's one thing to print a string like "+" or "*". It's quite another to create an instance of a class that does the actual work.

So I have a pcode class with any one of 15 or so binary operators in it. How DO I determine the proper Executable. One way is to use some hideous switch() statement in the pcode... I hate these, and running if-else-if constructs for two reasons. First, such statements imply bad OO design. Sometimes they are necessary but can be hidden in a Factory or Builder of some sort. Second, with a little work, you frequently find that there's no need for the switch() at all. switch() is an easy symptom-fix.

I think that this is the case here. When in the operator parser, I know each operator specifically since I have to shift and reduce based upon each operator's specific precedence. Invoking the "BinaryOperatorRedux" class was convenient but ultimately unhelpful with regards to generating the very specific Executable instances.

I'll revamp my precedence table and remove all of the BinaryOperatorRedux references. I'll replace them with references to operator-specific reduction classes. Since operator manipulation is identical for all binary reductions, specific classes will almost certainly subclass an augmented BinaryOperatorRedux class, supplying a "protected Pcode getPcode()" method. And once I have an operator-specific pcode class I don't have to do any logic to figure out the appropriate Executable. MultiplyPcode will probably produce MultiplyExecutable, after all.

The cost of this is a larger precedence table and many more classes. But the classes are all trivial. Each operator-specific reduction class adds one 'run time' line of code to the system - the code that tells which Pcode class is proper. Each operator-specifc pcode class again adds one 'run time' line of code - the code that tells which Executable is appropriate. Each operator-specific Executable class will add one 'run time' line of code to the system - the code that does the actual work.

As I typed this, I became sure that I have a good answer. I'll add a fair number of one-liner classes and reduce run-time complexity by eliminating switch() statements.

Monday, April 23, 2007

TAS Must Die, Chapter 16

Made nice progress lately. The operator precedence parser is integrated now and works fine. Discovered I had forgotten that within the HTML, iscript allows you to embed iscript variables. For example, in:

my name is (*name*)

(*name*) will be replaced with the value of an iscript variable named "name."

I added the '(*' and '*)' tokens to the lexer file and built a new iscript lexer. Then I changed the iscript parser to recognize this and produce a slightly different 'emit' command. It worked the first time. /flex

Here's a snipped of my pcode which is pretty darn experimental:

Click it for a full-sized view.

I'll probably start the servlet next. It shouldn't be any harder than anything else. :-/

I'll need a translator to convert the pcode to executable stuff, and probably a really lame caching mechanism. Then I should be able to see pages display.

I'll then be in a loop of:

1. parser doesn't handle X
2. lexer doesn't support the tokens
3. mod lexer
4. mod parser
5. view using servlet
6. goto 1 until exhausted

Friday, April 20, 2007

TAS Must Die, Chapter 15b

I implemented an operator precedence parser in a more object-oriented way. I am not sure if it's actually better than the common C implementation.

It currently doesn't have the mojo for IScript's syntax. But I could get there simply by defining precedence rules. It would probably take me about 20 minutes.

I ended up with 5 new notable classes and interfaces:

1) A ParserAction interface, instances of which perform shift, accept, reduce, and error processing. A traditional OPP includes a table of actions which are scalars. The scalars are used in some manner of switch or if/else construct to find the code snippet to execute. A more hardcore C implementation would use pointer-to-function. In my OO implementation, the table includes action class to execute. This is analogous to C's pointer-to-function construct. I currently have 4 implementations of this interface. When the final parser is all done, it would be easy to have 15.

2) A map of input-Token-to-action. The key would be the input Token. The value would be an instance of ParserAction.

3) A Map of Token-to-#2. The key is a Token (taken from the top of the operator stack.) The value is the map described in #2 above. When the ParserAction says to shift, I use the input Token to find the appropriate #2 from this map, and I push it onto the operator stack.

4) A Stack of 'operators'. This is the biggest departure from the typical 'student' OOP implementation. In the trivial case, you push operator characters onto the stack. When it comes time to pop the stack and do something, you do a lookup to determine its index into the precedence array. Instead of pushing the operator character (like a "+") or more OO Token, I push the entire map from #2 above. Now when it is time to decide how to treat an input Token that's an operator, I peek() the map on the stack, do a get() from it using the input Token as the key, and invoke the resulting ParserAction. There are no extraneous lookups in this implementation.

5) A Stack of IDs. When we scan an ID, push it onto this stack. When we reduce, we'll pull from this stack and sometimes push back a temp variable.

The code works well.

I didn't get the elegant lean code I was looking for. Perhaps I expect too much. I do have a very nice parser that doesn't feature some god-awful if/else construct or switch statement that's as long as your arm. The actual core of the implementation is perhaps 8 lines of code. The rest includes a lot of setters and getters and other 'non-breaking' code.

Wednesday, April 18, 2007

TAS Must Die, Chapter 15

While spelunking with Iscript syntax, I noticed that the expression on an IF statement is fairly robust with logicals and parentheses. Once you can have nested parentheses, you have to support a stack of some sort or you just can't parse it properly. I did the expression parsing in P1 using a recusrive descent parser. I wasn't in the mood to remove the left recursion from Iscript's more complex conditional statements so I decided to implement an operator precedence parser.

The typical OPP, as per the link, includes a table of precedences. This tells you if 1+3*5 should be treated as (1+3)*5 or 1+(3*5).

I've done a few prototypes now to refresh my memory. It occured to me that I was implementing an algorithm in java using the exact same architecture I used many years ago in c. And it was clunky. So I scrapped it - I decided that the table of precedences just doesn't work as well in java as it does in C.

I'll likely replace the traditional grid with a class that is initialized with Token-on-stack and Token-on-input pairs. Each pair will resolve to a ParserAction class which guides the parser. Java doesn't have a very nice initialization mechanism so it will be annoying. But the part that does the actual parsing will be lean.

Monday, April 16, 2007

TAS Must Die, Chapter 14

I started creating the P2 parser this weekend. I took a delightfully neaderthal approach which was to write just enough code to dump the iscript tokens to sysout, find a cohesive batch of them (like the SET statement, for example) and code it up.

It didn't take long to get REM, SET, and INCLUDE mostly working. I suspect as I feed different iscript files through the parser I'll find different cases that the parser doesn't handle. I don't actually have any documentation for iscript. I have no clue as to the full extent of the supported syntax. That's made maintaining it for these last 2 years pretty invigorating at times.

What's interesting about this phase of the project is designing the pcode. I think of pcode as über assembly language. Each pcode statement will do a specific task and contain all the information to do so. For example, of I know a token is a number, I'll probably mark it as such in the pcode.

|set|someVariableName|314|

The pcode interpreter could figure out of 314 is an integer or a variable. But I've already done that in the lexer and parser. So why not have

|set|someVariableName|314,contant,integer|

I don't know if the pcode interpreter will need all the information, but if I have it I might as well provide it.

I'll probably start writing the Factory in parallel. The factory will accept a list of pcodes and return a list of executable classes. This will allow me to execute the pcode - and check my work.

Sunday, April 15, 2007

Adventures in Anodizing Aluminum

My wife told me to get a hobby or die. So my friend and I are building a tube amp - a copy of the venerable Fender 5F1 circuit. First, it is amazing how involved it's been to convert that schematic to a real thing. There are the parts called out for by the schematic, but then the nuts, bolts, bushings, washers, wire, mounting hardware, you name it. It's been eye opening and educational which is the whole point of the project. I could buy a vintage amp for less than I spent making this one, lol.

Now, we're making everything that a couple of hacks can make. The control panel plate is the thing the guitar plug and volume knob are mounted to. We made ours from aluminum. It's what we had. But aluminum is soft and scratches easy. I could could spray it with a Krylon enamel rattle-can, but that's boring and I have spent a LOT of time on this amp. I don't want to cheese out now. We decided to anodize it.

Aluminum rusts, just as steel does. That's why el-cheapo lawn furniture can turn your hands black. Anodization is a process where the aluminum gets a different type of rust than normal. The anodized surface is a thin coating that is almost as hard as diamond. Remarkably, the basic process is simple and as safe as anything that involves sulphuric acid and a battery charger.

It's a three-step process.

First first step is where we run current from the battery charger through a lead cathode (- side of the charger) through the acid, through the aluminum parts to be anodized, and to the anode (+ side) of the charger. This causes a layer of aluminum hydroxide to form on the aluminum parts.

The second step is where you drop the now-anodized aluminum part into a bath common clothing dye. The aluminum hydroxide surface has microscopic pores that are open. The dye can sneak right into those pores.

The third stage is where you boil the parts in water. The heat makes the pores close, trapping the dye and hardening the surface which, magically, becomes a 'hydroxide monohydrate' which I do not pretend to understand.

All this, and you can't really over-do it. As you apply the current to the acid bath, the oxide layer starts forming on the parts. This layer does not conduct electricity well! As the layer gets thicker and thicker, it conducts electricity less and less. So the process is self-limiting. After a while, electricity can't flow, and it all just... stops. Crazy. The Lord provides.

Armed with our knowledge, we tried an experiment yesterday. We produced lovely un-anodized aluminum and burned our anode hangers in half. Funny, but really, I didn't expect total failure. I thought we'd get a mediocre result and have to make an adjustment. We did deviate from The Plan a little, so next time we'll follow the instructions better.

Here is an example of some parts anodized by a professional. Look around, you're probably surrounded by anodized aluminum. Stay tuned.

Saturday, April 14, 2007

TAS Must Die, Chapter 13

Here's where I'm heading. The square boxes below are programs. The ovals are lexers or other output files. The round-corner boxes are input files that will be consumed by lexers.

L1Gen is the hand-coded lexer generator. It produces L1, a lexer for lexer descriptions. L1 produces the tokens that allow P1 to read a lexer description and produce a lexer. L2 is a lexer that tokenizes iscript files. P2 is a parser that will use L2, ingest an iscript source file, and produce pcode that is the logical equivalent of the iscript source file. The TASMustDie servlet will ingest the pcode and produce HTML which is lobbed at the browser. I expect the servlet will be non-trivial and implement the Command and Factory design patterns. Basically, the servlet sucks in the pcode, rips through it by line, and invokes the Factory to produce Command instances. Now we have an executable version of the textual pcode.

Everything through and including L2 is working as far as I can tell. So now I should start writing P2 and designing the Pcode. Iscript isn't very tricky until we get to the point where we have to communicate with the Tandem. Then the syntax gets a little different. So I probably have more lexer work in my future. For all that I don't forsee and difficulties. I mean, what could possibly go wrong??!

I'm taking a breather now. So far, the lexer generator is working. I'll probably create a pretty brutal test of it soon. This will involve creating a iscript lexer (L2 in the diagram) using L1Gen's lexer, and creating an alternative to L1Gen's using the third iteration of a bootstrapped lexer. I can probably run 10,000 tokens through each one, easy. If the output streams aren't the same, there's a problem.

Friday, April 13, 2007

TAS Must Die, Chapter 12.

Just a quick note. Today I was able to bootstrap the lexer-generating parser. That is, the parser can now generate a lexer which it can use to generate an identical lexer. At one pass of the optimization phase, the lexer flies up to about 5,500 states then finally optimizes down to about 50. I have no idea why it's producing all those temporary states. The hand-coded lexer generator uses the same backend calls and doesn't seem to have any difficulties.

I suspect I'll be revisiting this at some inconvenient time.

Thursday, April 12, 2007

TAS Must Die, Chapter 11

More on design patterns. I have this graph of states, and I need to navigate it for multiple reasons. Sometimes I want to count the nodes. Sometimes I want to group them to remove redundencies. Sometimes I just want to generate a printable version. Or generate the lexer.

Say you want to print the state diagram. Normally what happens is that you write the navigation routine and right there in the middle of it, you put your "print" function. You get that working. Then you decide to count the items in the data structure. The navigation code you already have has the print stuff buried in the heart of it. That's no good for counting. So you copy/paste the navigation code and replace the "print" stuff with "count" stuff. About the 3rd time you do this, you realize it sucks and you want a different solution. In C, you'd pass a pointer-to-function as an argument. You'd navigate and then call the function to do the task-specific work. Pretty good stuff, but maybe a little fast-and-loose. As normal with C, you're at the mercy of the programmer's skill, mentality, and deadline.

Java has no pointer-to-function language structure. But using the Visitor design pattern we can cook up something that acts the same and has a little type checking in it. The wikipedia link is very good, but I'll paraphrase.

1. Create an interface called Visitor that has a visit(SomeObject) method, where SomeObject is germane to your problem domain. For this project, I had a graph of states, so my method was


public Interface Visitor {
void visit(State s);
}

2. To the class that manages the navigation data structures, add the following:


void accept(Visitor v) {...}

This is the class that manages navigation. For a binary tree of type node, you might have


public class Node {
// stuff
    public void accept(Visitor v) {
        if (leftChild != null) leftChild.accept(v);
        v.visit(this);
        if (rightChild != null) rigthChild.accept(v);
       }
}

Rock solid. Now, accept() is totally divorced from what the Visitor is actually doing. Visitor is an interface so by definition there's no implementation and no way for accept() to make any assumptions. A particular Visitor could generate a printable Node, count the nodes, search for a match, anything.

In my case, since my graphs (and states) are managed by the Nfa class, I implemented the method there.

3. Create an implementation of Visitor. Add the function-specific code to the visit method.


public class CountVisitor implements Visitor {
    private int count = 0;
    public void visit(State s) {
        count++:
        }
    public int getCount() {
        return count;
        }
    }

4. Add any auxilliary methods to the Visitor implementation as needed, such as getCount() above.

Here's an example from my project.

When I generate my lexer, it's important that my states be numbered starting at zero. For efficiency, the state id will be used as an index into an array! Given that I am creating and destroying hundreds of states as I convert the NFA to a DFA, the state IDs are all over the place. As I navigate my graph of states, the visit(State) method is called once for each state. The visit method simply adds a reference to the state to an Arraylist. Now I know all the states.
When I am done, I call the renumber() method and just iterate over the states, reassigning their IDs. To use the visitor, the following does the trick:

Visitor v = new StateRenumbererVisitor();
nfa.accept(v);
v.renumber();

It can be hard to get your mind around this. Once you do it, however, you'll be hooked.

University of Kentucky

The family visited the University of KY today. My son is in the process of selecting a school. So we got an appointment for a group-tour/dog and pony show. I went to UK in the early 80's, I figured I go down there and show him around.

Wrong.

The place is totally different now. I suppose I shouldn't be surprised after 25 years, but holy cow. So many more buildings, so much more everything. Plenty of green space still. The huge concrete area outside the Patterson tower is gone, replaced by landscaped green space. There's a new student center, even newer than the one they opened in 84. The old grilles are gone, replaced by chain restaurants. The dining area in the old student center is huge. The new library is magnificent, multiple floors, superblytasteful archtecture, open and airy. And the new engineering building is impressive in its own way - it looks like a modern take on an old factory. But with lots of glass. And there's a "gym" where no classes are held - it's stricty for students' health. It sports a 30' rockwall, large indoor running track, all manner of weights, and a single room with four side-by-side full-size basketball courts. It's huge. There's also an outdoor track and an aquatic center. And the programs my son is interested in (some manner of business) are very hightly rated.

Eye opening trip to my old stomping grounds. Man, I'm obsolete.

Wednesday, April 11, 2007

TAS Must Die, Chapter 10b

I figured I'd goof off with the Decorator pattern. It didn't take long to find a more satisfying solution to the problems outlined previously.

The gist of the Decorator pattern is that you have potentially several classes each implementing some interface X. They each accept and X as an input stream. So you can daisy-chain them together as needed.

IntX t1 = new GetWords(); // GetWords implements IntX
IntX t2 = new Trim(t1); // Trim implements IntX
IntX t3 = new Upper(t2); // Upper implements IntX
loop();
private void loop(IntX q)
while(true)
{
System.out.println(q.read());
}

So as we buzz in the while loop, we read from t3, t3 reads from t2, and t2 from t1. I don't know where t1 gets its input :-) Assume t1 returns " now". t2 trims it to "now". t3 converts it to "NOW". So the loop prints trimmed uppercase words. If you decided you needed to enhance this to produce German, you could create a new IntX decorator that sipped from t3 and performed the translation. A nice thing about this is that you can reorder the IntX's (sometimes it matters), add new ones, or remove some, without any negative impact on your code. Note that the loop accepts the IntX interface - it has absolutely no clue what's going on behind the scenes in "q". That's a good thing.

So, thumbs up for the decorator pattern.

Later I'll talk about another nice pattern, the Visitor. This decouples the navigation in the NFA from some action to do on those states.

TAS Must Die, Chapter 10

I started cleaning up some cruft I left behind in the code. Details, nothing worth mentioning.

Now I use the lexers I generate in two different places. One use is by P1, the parser that produces lexers as its output. P1 interprets common regular expressions and thus uses a lot of operators:

variable: [a-z]([a-z]|[0-9])*

for example. [, -, ], |, (, ), and * are all operators of some sort. Now sometimes you'll want to interpret one of these characters as a common boring character, not as an operator. In this case, I escape it with a backslash:

copyright: \(c\)

for example.

The other use of the lexer is just to read a bunch of iscript. I don't need a lot of escape-type processing. But it would be nice if the lexer ignored whitespace, devoured extraneous new lines, and handled windows newlines.

I've tried a few times to create a class that did both tasks well. It was a failure. They are pretty different. So I tried creating a base class that did very little except implement a pushback stack so I could unread tokens. It's helpful sometimes. I then subclassed two different TokenReaders, onefor the parser and it's crazy operators, and one for more mundane iscript processing.

My result, while it works, is not satisfying. It's clunky and brittle.

So combining it all was terrible, subclassing was unsatisfying. Within the next few nights I'll implement a Decorator design pattern. The gist of this is that you have two classes that implement some interface. You wrap one object in with another. You query the latter in your application. It invokes the wrapped object as necessary, and applies some type of transformation or filter.

Monday, April 9, 2007

TAS Must Die, Chapter 9

All the bugs that I know of are gone now. I haven't taken the 'bootstrap' plunge yet. That being said, it's been a few days since I had to rebuild a base lexer using the L1Gen lexer program.

I'm lexing a large part of the TAS Iscript input files now. Now I'm at the crossroads of

1. Writing a parser for the tokens so I can convert iscript into something else. That's ultimately what this project us about. I previously hand-coded "P1", a recursive-descent parser that produces DFAs from the lexical rules. It was surprisingly easy. I'll probably do the same for the iscript parser. Easy Peasy.

2. Writing a parser generator. Then I could describe iscript's structure using BNF or something and have the parser be generated. This is another favorite topic for me. Recent experiments of turning BNF into a Greibach Normal Form have been very successful.

3. Finding a parser generator for Java, similar to yacc. Where's the fun in that? If I was doing this project as part of my job, I'd be all over it. But I'm not so I'm not.

4. Cleaning up what I have, because it will probably bite me anyway. For example, I still don't handle newlines and other control characters well. The subtelty of this surprises me. If I started all over, I'd probably start with control (and escaped) characters first.

5. Stopping. I've accomplished my first goal and learned a lot.

I'll probably bounce between #1 and #4, addressing weaknesses of the lexer generator as needed. And after a few visual observations, it's no good to have the some manner of p-code being produced without doing anything with it, so I'll probably start building a servlet to interpret said p-code.

This has been quite a learning experience, I need time to breathe now.

Friday, April 6, 2007

TAS Must Die, Chapter 8

My 'go for the throat' coding approach, where I implement the big ideas at the expense of the details, is starting to catch up with me. For example, I currently don't have a good way to specify control characters (such as newline) in my rules. Using the common "\n" yields a Nfa that recognizes an 'n'. The backslash is works great for converting lexer tokens

"(" => OPEN_PAREN, "("
"\(" => CHAR", "("

but doesn't work so well for n and r.

Other issues involve the lexer and the file it processes. It doesn't handle EOF very gracefully when the last line of the input file isn't terminated in a newline. I suppose it does exactly what it should, from a logical point of view, but that diverges from what I want it to do, which is to insert a NEWLINE at the end of that line, and then produce the EOF. So I have to wonder if I should rig the lexer to produce BOF (BOL stuff EOL)* EOF automatically since that's what I would really like to see. So I'm thinking I could feed the lexer a raw input stream and a cooked input stream, with the latter inserting the file markers when they are absent. BOL and EOL as explicit tokens give me a nice way of matching strictly-formatted files. The TAS Iscript stuff I intend to eventually parse includes commands that always start at the beginning of the line, at least at our installation. So BOL would give me a nice way to match this.

I am gaining an appreciation for the realities of file I/O, control characters, when to ignore whitespace, etc.

It's getting to the point that I am considering adding another lexer that will be tasked with removing whitespace and inserting file markers. It would be enabled by configuration. When
enabled it would act as a filter on the real token stream.

I'm also flirting with the idea of adding the "not" operator which inverts FAs. In my tiny world, A -> not x -> B produces a FA where A has 255 transitions to B - every possibility except 'x'. Implementing this in the general case (where 'x' in the example could be a FA), however, is not straightforward and maybe not useful. More to come on this.

Finally, I have implemented the multiple-lexer idea. I am not sure it is such a great idea though I don't have a better alternative at this point. I did implement a switch in the lexer generator code to treat characters in a case-insensitive way. This means that instead of adding a 'x' to a FA, I add ('x'|'X'). Remember, if brute force doesn't work, you aren't using enough of it!

Thursday, April 5, 2007

TAS Must Die, Chapter 7b

Reviewing the last entry, the behavior of the two parallel structures on the right could be obtuse. Basically, each state owns a section of these arrays as delimited by begEdge and endEdge. Each row within that range contains a character and the state to go to if we see that character.

That's nice and compact. However, it does constitute a linear scan which could be relatively slow. It's probably immeasuable in any real case. Be that as it may, if the lexer needed to be faster we could replace the parallel edge and dest arrays, as well as the begEdge and endEdge arrays, with a gawd-awful sparse matrix. It would take up more space and probably not be very i18n-able. However, we'd be able to generate tokens about as fast as it could be done.

The table would be a nx256 int array where n is the number of states in the lexer. The 256 ints for a state would represent the transitions possible. The ASCII value of the input character would be an index into this array. The result would be the next state.

nextState = 0; // start state
while(nextState != -1)
{
input = getInput();
state = nextState;
nextState = sparse[state][input];
}

// 'state' is where we finished, and 'input' would
// be the unaccepted character that we'd re-ingest
// next time.

Not a lot of code there...

Another option is to go the other way, and say all this array stuff is rubbish. The lexer would include normal classes which we'd instanciate the first time the lexer was used. We could replace the four parallel arrays with a simple class and the two parallel edge arrays with a Map. Instead of having state IDs or Token IDs, etc, we'd use references to instances in our classes.

Wednesday, April 4, 2007

TAS Must Die, Chapter 7

While I'm procrastinating on making the final changes to the lexer, I thought I'd post the lexer's architecture. I wrote it in a frenzy the other day and I almost forgot how it works.

To the left is a variable named 'currentLexer'. This points to the start state of the currently active lexer. When we call the lexer for a token, 'currentLexer' tells us the state to begin with.

This refers directly to the four parallel arrays in the center of the diagram. Any row across them represents a single state in the lexer. 'LexerJump' will, upon succesfully lexing a token, change the current lexer. This facility allows lexer designers to create additional lexers to help resolve lexical conflicts. If while scanning the input we can't make any progress, TokenString will tell us if we're in a stop state and which token should be returned. BegEdge and EndEdge identify the rows in the 'edge' tables that are associated with the current state. If the current state transitioned on a, b, and c, there would be 3 rows in the 'edge' tables. BegEdge would point to the first row and EndEdge would point to the last row.

To the right are two more parallel arrays. A single row represents an input character and the state we advance to if we see that character. EdgeInputs contains a character. The paired EdgeDests row tells us the state we advance to upon scanning that character.

Finally, the Tokens array at the bottom holds all the 'class'es that can be returned by the lexer. It is used by the TokenString array.

I've refered a few times to 'multiple lexers' and such. Excepting 'currentLexer' and 'LexerJump', the architecture really doesn't know about lexers at all. There are states, their inputs, and the tokens generated. So stacking several DFAs/lexers into these arrays works fine. The only thing we have to do is to add a mechanism to allow us to point to the first state of any DFA/lexer, and that's what 'currentLexer' and 'LexerJump' do. Previously, instead of a 'currentLexer' variable, I just started at state 0 and there was no concept of more than one lexer.

Tuesday, April 3, 2007

TAS Must Die, Chapter 6

I started working up the lexical definitions for TAS. To keep it simple, I started with the "comment" command:



It looks a little like HTML since it is meant to be embedded in an HTML page. TAS is an incestuous mix of HTML and magic commands. TAS looks for "<!-- #" followed by a command - REM in this case. After the REM comes the free-form comment text. While you'd think lexing this would be the easiest thing evah, I found out that the lexical rule to recognize the comment text (.*) conflicted with other lexical rules as per the previous chapter of this trail of anguish. I tried all manner of hokery to get the lexer to work but nothing was remotely satisfying.

A friend of mine suggested that I needed a 'mode' that would let me distinguish between various ambiguous rules. Since I am creating a purely table-driven lexer, putting in a traditional mode (which in Tony-speak is an IF statement) didn't appeal to me.

My proposed solution is to allow the configuration file to contain multiple related lexical rule sets. Here's a representative set of rules:

comment: .*
rem: #REM
number: [0-9]+

The immediate problem is that, again as per the previous chapter, the 'comment' and 'number' rules are ambiguous. Here's one solution:

# lex0
rem, 1: #REM
number: [0-9]+

#lex1
comment, 0: .*

Now, the differences are:
1) There are two sections of the configuration, lex0 and lex1.
2) there is a "1" behind the 'rem' token
3) there is a "0" behind the 'comment' token

lex0 and lex1 are two separate lexical definitions. By default, the lex0 will be active. But when a rule is satisfied, and what there is an associated number, the default lexer will be switched. So lex0 is processing and scans "#REM". In addition to returning the "rem" token, the default lexer is set to 1. Now we process the comment and produce the 'comment' token. Because of the associated 0, we change back to lex0. Now we have our mode but no explicit if statement.

Here's another way. Let's say for whatever reason we want the comment as individual characters.

# lex0
rem, 1: #REM
number: [0-9]+

#lex1
comment : .
end , 0: ;

We see the #REM which causes LEX1 to become the default rule-set. Now as we scan the input one token at a time we match 'comment' repeatedly. This shows that we stay in lex1 until the scanner is told to switch to another lexer. When we scan a semicolon (which matches both rules, but that's a detail for now) we return the 'end' token and switch back to lex0.

Another nice thing about this method is that with its separate lexers it is easy to create non-conflicting rules for inputs that have very different sections (a row-oriented report terminating in a summary page, for example.)

Sunday, April 1, 2007

TAS Must Die, Chapter 5 (or, nothing is as easy as it looks...)

The technique I proposed previously for removing ambiguous transitions isn't wrong, per se, but it isn't complete, either. My parser was croaking on some test input. It was generating an endless number of new states. I finally figured out what it was.

In diagram 1, we see s0 which contains ambiguous references to two subgraphs which can be described as 'a+'. That is, from S0, we can go to S1 or S2 on an 'a'. We apply my l33t technique for removing such ambiguities to arrive at diagram 2. Now, we don't like epsilons, so we apply the 'remove epsilons' algorithm to arrive at diagram 3. Notice that the subgraph starting at state 3 is identical to the FA we started at in diagram 0. We'll never make any progess towards creating a DFA from this NFA.

In fact, there is little to do except give up. The rules are ambiguous and without end.

I amended my algorithm for removing ambiguous references to be, relating to the diagrams above:

"Once we create diagram 3, if two or more transitions point to the original states from S0, (S1 and S2 in this case) the rules are ambiguous and unending. Time to die."

This isn't an outlandish case, either. Define a few words like

CITY : [A-Z]+
HEXSTRING: ([0-9] | [A-F])+

and that's all it takes for the previous method to fail. Now, how to address this? There are a few ways. If you have enough control over your input, amend your rules so the ambiguity can't happen. For example, we could amend HEXSTRING to require a leading 'x'.

CITY : [A-Z]+
HEXSTRING: x([0-9][A-F])+

Looks familiar, eh?

Sometimes, we can't change the input.

STREETNAME: [A-Z]+
CITY: [A-Z]+

Those are always a sequence of letters. We can generalize, however, so the lexer can build a DFA, and leave applying meaning to the more powerful parser:

CITY_OR_STREETNAME: [A-Z]+

or even

CHAR_STRING: [A-Z]+

Well, that was today's adventure. Hopefully I'll be able to conclude my testing phase soon.

Contributors

Blog Archive