1 person following this project (follow)

Project Description
OILexer is a LL parser generator for C# aimed at simple language parsing for language enthusiasts. Does not use recursive descent or bottom-up parsing methods, but rather, a top-down deterministic model is employed. This is a portion of the Abstraction Project.

Update - June 8, 2011
The original concept of the objectified intermediate language has been restructured into a new form. It's now a framework within the Abstraction Project.

Update - January 17, 2011.
Funny thing as you start these kinds of projects. When I started this one, my understanding of lexical analysis and parsing was rudimentary at best. In understanding how such actions can be automated, you learn that you really have to know your stuff to make it work for every case. You also have to understand performance, and pretty much everything about the area before it works right. So as a small favor, don't take the hand-written lexer or parser as an indicator of how I would write one today. It's old code, but it works for what I need it to.


The OILexer project uses what's called the Objectified Intermediate Language (OIL) code generation framework, a part of the Abstraction Project, to transform the production rule and token definitions in your simple grammar to produce a deterministic parser for your language.

Limitations

The parser was initially started prior to the author's awareness of multi-state parsers. Therefore, parsers which allow conditional compilation arguments will be more difficult or impossible to create in this system.

Output

During compile, OILexer produces a series of C♯ source files. For my own debugging purposes, the program pauses before closing so the source used to generate the associated dll file can be viewed and verified (or copied, if you want). The output location used is %TEMP%\Oilexer\, where %TEMP% is a special location on your system associated to the temporary files location relative to the active user. On Windows XP, in explorer, you should be able to navigate directly to that location if an address bar is present.

Status

For change set 67739-61893:
Rewriting build phase, presently Finite automata associated to lexical analysis is underway, standard recognizer-level state-machines are possible; however, state machine construct needs improved to enable a deterministic machine to capture sub-expressions in an efficient manner. Deterministic machine associated to the parse phase has been temporarily removed due to sub-expression issues noted in lexical phase. Once a solution for the lexer phase is found, the parser phase will be adapted to a variation of that theme to include recursive machines. Given that a deterministic infinite look-ahead machine is possible (as proved by change set 41287 and supplemental hand-written code), the same should be possible for a capturing validator.

Previous, unsubmitted, tests have shown building a concrete syntax tree of the grammar is possible, once a viable solution for simple captures is found, it should be possible to extract the necessary data associated to a given parse and associate it to the appropriate generated CST elements.

For change set 41287:
The project so far produces a library for a properly formed grammar file. There are limitations. The current area of focus is taking the unbounded look-ahead state machine for the language and building a proper parse graph. Presently this is not yet possible. The Scanner for the project has a properly structured NextToken method; however, the state machines are not, by default, initialized. This will change as a fully functional parser is generated.

State machines for each individual lexical pattern are created, and a flat deterministic view of the productions in the language are encoded.

Included in the source is a mostly functional reimplementation of the Code Document Object Model, which was originally meant to be a supplement to the model; however, due to its structural limitations it later aimed to replace the model. Due to the initial focus of being a supplement, the original model is weaved in throughout, but a large portion of the code related to converting from the Objectified Intermediate Language (OIL) to CodeDOM is no longer maintained, so there are likely errors in that portion of the code.

The parser portion of the compiler is hand-written and likely contains errors in its implementation. For grammar samples, please see the sample grammar page.

Last edited Jun 9 2011 at 4:15 AM by AlexanderMorou, version 18