Saturday, April 28, 2007

TAS Must Die, Chapter 19

I expanded the operator precedence parser (OPP) to handle explicit types of IDs the right way. Then I corrected all the fudging I did with the datatypes. Suddenly the whole thing really works better. Quite satisfying.

I missed another syntax element, namely that a variable being set can also be indexed:

<!-- #SET NAME = BOB [ expression ] VALUE = expression -->

With a little agony, I was able to use the same OPP for this. The first gotcha was that by the time the recursive descent parser (RDP) has scanned BOB and [, we've read too much to satisfy the syntactic needs of the OPP. Fortunately I had built a pushback into my TokenStream class. So upon scanning "BOB[" in the RDP, I push BOB and [ back onto the stream and then invoke the OPP which scans a reasonable expression and works fine. The second gotcha was that I was using the --> token to know when the expression was over. While this works for

VALUE = expression -->

it doesn't work for

NAME = BOB [ expression ] VALUE...

Lovely, eh? I modified the OPP to accept an 'end token'. Now I passed --> or VALUE to the OPP as appropriate for the situation. This produced the right result. Even so, I don't like it - it doesn't feel like a bulls-eye.

Currently the OPP throws an exception if it scans a token which it can't find in its OP table. I believe I can simplify by using this event to inject a synthetic end-of-expression token and let the RDP handle any syntax error caused by the mystery token.

For example, the following input would cause an exception:

id1 + id2 XYZZY

Assuming XYZZY is not the end-of-expression token. What's key is that a parser notes the syntax error. This is the situation that was happening last night. Unfortunately, it WAS valid with the new use of the OPP. I changed the parser so I could specify XYZZY as the end-of-expression token.

Now, upon reading a token that isn't in the OP tables (such as VALUE or -->), the OPP pushes that token back onto the stream, and uses a totally fabricated 'end of expression' token instead. Now what will happen is that the expression will be reduced as per normal and the mystery token is available to the RDP.

Here are some possibilities:

<!-- #SET NAME = BOB [ a+b ] VALUE = 10 -->

While parseing BOB[a+b]..., the OPP scans VALUE, inserts the end-of-expression token, produces pcode for BOB[a+b], and pushes VALUE back onto the stream. Now the RDP resumes with VALUE. We're good. Then the OPP sees the 10, gets confused by -->, pushes it back, and injects the end-of-expression. 10 is valid. The --> is left to the RDP which likes it.

<!-- #SET NAME = BOB [ a+b ] VAXUE = 10 --> (note the typo in VALUE)

After the OPP scans the ], it sees VAXUE which the lexer has probably misinterpreted as a variable. The OPP checks its tables, finds it (!) and throws an error since an VARIABLE isn't allowed after a ]. Good.

<!-- #SET NAME = BOB [ a+b ] WRONGTOKEN = 10 -->

Someone typed a known token accidently. The OPP sees it, doesn't have a table entry, injects the end-of-expression token, and pushes WRONGTOKEN. BOB [ a+b] is recognized normally since it is valid. WRONGTOKEN is then scanned by the RDP which throws an error since it expects VALUE. So we're still good.

Stay tuned. This will let me unravel some of the unsatisfying crud I did last night.

No comments:

Post a Comment