Refresh: Checklist

Topics: Current Progress
Sep 1, 2009 at 12:14 AM
Edited Jul 6, 2010 at 2:20 PM

This is just a checklist for what I'm working on.

Since I recently decided to rewrite the Build code, the list is pretty long.  The refresh on the build aspect of the project is due to the scattered nature of the code, as I've gone along, there's a lot that's changed.  There were areas that calculated some feature specific data that wasn't quite enough, so it was then transformed further, over a while this can create a lot of junk code (and increase the processing overhead).  It was an iterative process I used to overcome the fact that I wasn't fully aware of the full requirements necessary for writing every aspect of the project.

Alter Character Range part of parser to include the simple character classes (checked are finished):

UppercaseLetter         -- :Lu: (letter, upper)
LowercaseLetter         -- :Ll: (letter, lower)
TitlecaseLetter         -- :Lt: (letter, titlecase)
ModifierLetter          -- :Lm: (letter, modifier)
OtherLetter             -- :Lo: (letter, other)
NonSpacingMark          -- :Mn: (mark, nonspacing)
SpaceCombiningMark      -- :Mc: (mark, combining)
EnclosingMark           -- :Me: (mark, enclosing)
DecimalDigitNumber      -- :Nd: (number, decimal digit)
LetterNumber            -- :Nl: (number, letter)
OtherNumber             -- :No: (number, other)
SpaceSeparator          -- :Zs: (separator, space)
LineSeparator           -- :Zl: (separator, line)
ParagraphSeparator      -- :Zp: (separator, paragraph)
Control                 -- :Cc: (other, control)
Format                  -- :Cf: (other, format)
Surrogate               -- :Cs: (other, surrogate)
PrivateUse              -- :Co: (other, private use)
OtherNotAssigned        -- :Cn: (other, not assigned)
ConnectorPunctuation    -- :Pc: (punctuation, connector)
DashPunctuation         -- :Pd: (punctuation, dash)
OpenPunctuation         -- :Ps: (punctuation, open /start)
ClosePunctuation        -- :Pe: (punctuation, close/end)
InitialQuotePunctuation -- :Pi: (punctuation, initial quote)
FinalQuotePunctuation   -- :Pf: (punctuation, final quote)
OtherPunctuation        -- :Po: (punctuation, other)
MathSymbol              -- :Sm: (symbol, math)
CurrencySymbol          -- :Sc: (symbol, currency)
ModifierSymbol          -- :Sk: (symbol, modifier)
OtherSymbol             -- :So: (symbol, other)

Token and production rule elements now support ranged repeats instead of just the Kleene operators.  Supported mechanics are:

Token :=
ReferencedItem {min, max} | // Specific Min/Max
"LiteralItem" {, max} | // Up to max
[:Lu:] {min, } | // Min or more
@'a':caseInsensitiveA; {Fixed} ; // Fixed-length match

I'm thinking about adding range subtraction, but I haven't decided yet.  I'll either be adding range subtraction, or reg-ex subtraction and complements.  Things like this will be useful for matching identifiers with the rule that they can't be a keyword, which would yield a regex of:

Identifier := 
Subtract(IdentifierOrKeyword, Keywords) |
'@' IdentifierOrKeyword;

UnicodeEscapeSequence :=
"\\u" HexChar{4};

IdentifierOrKeyword :=
IdentifierStartCharacter+ IdentifierPartCharacter*;

IdentifierStartCharacter :=
LetterCharacter | '_' | UnicodeEscapeSequence;

IdentifierPartCharacter :=
LetterCharacter |
CombiningCharacter |
DecimalDigitCharacter |
ConnectingCharacter |
CombiningCharacter |

LetterCharacter :=

CombiningCharacter :=

DecimalDigitCharacter :=

ConnectingCharacter :=

FormattingCharacter :=

The build process will be clearly defined into the following stages (checked are finished):

  1. ☑ Linking
  2. ☑ Expanding Templates
  3. ☑ Literal Look-up
  4. ☑ In-lining Tokens
  5. ☑ Token NFA Construction
  6. ☑ Token DFA Construction
  7. ☑ Rule NFA Construction
  8. ☑ Rule DFA Construction
  9. ☐ Call Tree Analysis
  10. ☐ Object Model Construction
    1. ☐ Token Captures
    2. ☐ Token Enumerations
    3. ☐ Rule Structure
  11. ☑ Writing Files
  12. ☑ Compiling

The stream analysis phase will utilize an interpreter to analyze files from the language to assist in creating a default set of states appropriate for the language to decrease parse times.  It also doubles as a rough way to debug your language.  This will have an initialization overhead, but since the data will be unfolded such that no calculation will be taking place (just object initialization), the overhead shouldn't compare to that of the parser.  It has the added benefit of being able to unify state path instances that would otherwise just be replicated, since path look-up on tens of thousands of paths would take longer than creating a new, lightweight (~8 bytes), path instance.  The look-up times aren't justifiable on the parser's side, but the compiler can take the time it needs to do a full analysis.

Stage four (In-lining tokens) is a process by which tokens that refer to other tokens are merged into a single token.  This is to eliminate the state-machine dependency and produce a proper state machine for that parent token.  This also ensures that it's easier to detect unused tokens.  Once in-lining is done, if a token was only referred to by other tokens, it is erased from the language, since its state machine will never enter into the contextual awareness of the scanner.  Cyclic token references will cause a compile error.

The new state system I'm utilizing will be a unified model.  Both Tokens and Productions will use the same state base type as well as the same set-analysis features.  I've already completed the FiniteAutomataBitSet`1 class and verified that it works with 64-bit (ulong) and 32-bit (uint) slots.  Similar set grouping will be constructed for organizing the unique elements of the language into 32 or 64-bit sets, to avoid the clutter, and potential errors, of replicated code for a cleaner solution.

Depending on how you call the program, stage 12 may or may not occur.  If you want the files to be written locally to the directory of the project files, that will be an option.  It will write them in a sub-folder of the relative path.  For recompiling, the sub-folder will be purged, as a warning.

Capture Tokens will probably change in this version, rather than all of them being string captures, I'll try to construct a proper sub-capture handler.  This way if you have a token which defines sub-captures (such as the parts to a number: IntegerPart, FloatPart, Exponent, Data-Type signifiers), it should properly capture them.  Since there's no code to handle matching, the sub-captures become more important in post-parse analysis.

Sep 6, 2009 at 5:36 PM
Edited Sep 7, 2009 at 2:23 PM


I decided that on top of the existing functionality to the parser, I'd add a little bit more.  I added the aforementioned subtract command (parse only thus far), but I've also been adding other code necessary to parse the grammar definition pre-processors more in-depth.  Which likely means I've created a whole slew of bugs on top of it, but not much I can do about that right now.

Notable changes are the use of #if inside normal production rules, as well as within the file itself.  Before they were reserved to templates and expanding them, because of this a few functional/meaning changes have occurred:

#ifdef/#ifndef/#elifdef within a template refers to whether a rule is defined.

#ifdef/#ifndef/#elifdef within the file or a production rule refers to whether a compilation symbol is defined.  These allow for grouped checks, such as #ifdef A || B and #ifndef A && B, the logic works as you would think, the first works if A or B is defined, the second works if neither A nor B are defined.

Pre-processors beyond templates allow conditional arguments to have values assigned to them, unlike C#.  The only values allowed are strings, their purpose is pretty simple: if you're making a pre-processor and a normal parser, you might want to segment the code such that the same files are used, but different compilation arguments yield different output.  An example is the C# sample (which I'm rewriting using the language specification), the lexical grammar for C# doesn't concern itself with the trivialities of whether the file itself is properly formed, all it's concerned with is whether it can parse the tokens of the file (and skipped sections, but that's another matter).  The C# parser on the other hand, needs to have the awareness of all the tokens, it might also do well to be able to introduce a white-space token that's unbound to the grammar so it doesn't need to be expressly stated at every opportunity, whereas the C# preprocessor does better to require explicit definition of white-space and where it can be. 

It's also useful to be able to perform conditional versioning, where you introduce new functionality into your grammar and weave it in through conditional compilation arguments.  Other areas would be expansions related to function, if you wanted a full C# grammar, but wanted a second version of it without unsafe-context awareness, well you get the idea.

So far the pre-processor expansion to the grammar works, it's something I've been meaning to add from the start, but I never got around to doing it.  There'll be bugs, but that's what bug reports are for.


Just fixed a bug that's been bothering me over a year: The parser's line reporting has always been somewhat off.  So I decided to take the time to investigate why.  Turns out there were two areas where it was incorrect: parsing whitespace and comments.

Visual Studio nicely allows me to change line ending format, so I simply adjusted it to use Cr, Lf, or CrLf until I found the cause of the problem.

Turns out that it was due to an over complex way of flushing the scanner's buffer.  The next time I have a nearly functional parser compiler and can extract the parse graph, I'm going to rewrite the full grammar definition parser in a heartbeat.

Sep 18, 2009 at 4:21 AM
Edited Sep 18, 2009 at 4:38 AM

Took a while longer than expected.

Constructed a generalized DFA system, mentioned before, and I've found expression subtraction is slow, very slow.

An example of the Identifier state tree from a simpler version of the identifier can be found here.

As you can see, it's very large.  That's just a simplified identifier.  The full thing would take about ten+ times more due to the ranges it covers.  In the example given, state numbers enclosed with '{' and '}' are edge states, states with '[' and ']' are not.


IdentifierChar := [A-Z_a-z];

Identifier := '@' IdentifierChar+ (IdentifierChar | [0-9])* |
    Subtract(IdentifierChar+ (IdentifierChar | [0-9])*, Keywords);

Above is an example of the identifier pattern used.  Subtract effectively ensures that any edge of the removed portion is not an edge in the resulted expression; thus, "yiel" would be an identifier while "yield" would not.


Feb 21, 2010 at 5:27 PM

Five months and I'm reporting that my job prevented me from working much on this.

The good news is, after working on it a few days I was able to make a simple test which constructs a state-machine that is unicode category aware.  To ensure that it does so by more than selecting the character range applicable to the unicode category, it obtains the unicode category from the .NET implementation, this is to ensure that tons of junk code is not created.

To further this goal, it is also aware of partial categories, which means there are negative assertions placed upon cases where a category is mostly there, but without this, specified, range.

Given a few test cases, I should be able to construct a proper capturing system from a state machine or two.  With what I learn there, I'll continue my efforts to automate the process.

Jul 7, 2010 at 4:56 AM
Edited Jul 7, 2010 at 4:57 AM
Progress is going well, the past few months have been fairly stagnant due to personal reasons, but beyond that, I do have some marginal news to report.

Unicode support for tokens is here, it works and works well. Regular language set subtraction works; however, a caveat of sorts: the subtracted sub expression must be wholly contained by the primary expression.

Before, I was working on what I called a 'stream state', which effectively created a new state for each permutation of the langauge as the parser advanced through the file it was parsing. Interesting concept, fully functional, but slow. So I've changed my focus to something a little different: preprocessing. Analyzing the call points of each rule and the associated look-ahead work that would be necessary. It'll involve analyzing points of commonality, wherein if one rule calls two others, which then each call a fourth, being able to minimize the look-ahead logic so something parsed once is parsed only once. Determination will involve unifying all of the state machines in the call graph and discerning which machines are active at any given point and time. Simple idea, whether it works or not remains to be seen.
Jul 20, 2010 at 7:13 PM
Edited Jul 20, 2010 at 7:20 PM
Reworked a large portion of code related to the program's display. Constructed proper syntax display algorithms. Fixed character unicode escape sequence so that \uWXYZ would parse as intended instead of \uWXYY. Reworked left recursion detection; now, no matter how far in a rule is into the description of another rule, if the rules prior to that cyclic reference are potentially empty, the cyclic reference detection works properly.

As an example:

A ::= B C A | D;

D::= 'd';

B ::= F G H;

C ::= I J K;

F ::= 'F'?;

G ::= 'G'?;

H ::= 'H'?;

I ::= 'I'?;

J ::= 'J'?;

K ::= 'K'?;

All of the members leading up to the cyclic reference of A are potentially optional.