LRSTAR: LR(*) parser generator


  Home Downloads Feedback Comparison History Theory Papers Doc. Contact  
  Installation LRSTAR DFA Definitions  

DFA 9.0 User's Manual

Section 1. Introduction


DFA generates compressed-matrix table-driven lexers, which are very fast and small. DFA lexers compile very fast.

Section 2. Command-Line Syntax.

DFA is a command-line (or console-mode) program. The command-line syntax is:

dfa <grammar> [<option>]...

dfa   is the program name.
<grammar>   is the input grammar, (e.g. 'calc' or 'calc.lgr'). Must have a filetype of '.lgr'.
[<option>]...   means zero or more program options.

An <option> must start with a '/' or '-'. '/' means yes. '-' means no (e.g. '-c' means don't list the conflicts).

Here are some examples:

dfa ansic
dfa ansic /v /c /d /st
dfa ansic /v /m -g -st

Here is a typical make.bat file for building a parser (on Windows systems):

..\..\bin\dfa calc /v=2 /d -st -g

Section 3. Program Options

This chapter contains a more in depth explanation of the DFA program options when creating a lexical analyzer. In this case, your input file name should have a filetype of 'lgr'. Here is the list of options.

DFA 1.0.000 Copyright 2018 Paul B Mann.
|   dfa <grammar> [/<option>...]
|   c           0    Conflict report (all conflicts)
|   d           0    Debug, print input file on screen
|   g           0    Grammar listing for the lexical grammar
|   m           0    Minimize lexer-table size
|   st          0    State machine listing (for lexer)
|   sto         0    Optimized state machine listing (for lexer)
|   tab         3    Tab setting in lexical grammar (2,3,4)
|   v           1    Verbose mode for lexical grammar
|   w           1    Warning messages on screen

'c' option: Conflict report for all conflicts.

The 'c' option tells DFA to report shift-reduce conflicts also. Usually only the reduce-reduce conflicts are reported in the "*.lgr.conflicts.txt" file because these are real errors. The shift-reduce conflicts are always resolved by choosing the shift action. The default is 'c=0' (no shift-reduce conflict reporting).

'd' option: Debug option.

The 'g' option tells DFA to put a defined constant in the lexer, which will make the lexer print the input file on the screen.

'g' option: Create a grammar listing file.

The 'g' option tells DFA to generate a grammar file: ?.lgr.grammar.txt.

'm' option: Minimize lexer table size.

The 'm' option tells DFA to minimize the size of the lexer tables.

'st' option: State machine listing.

The 'st' option tells DFA to list the original unoptimized state machine in a file. This state machine is the one that you might want to look at when trying to figure out why your grammar has a conflict. This state machine listing can be very large. The optimized state machine is much smaller, but the state numbers in the optimized one do not correspond to the state numbers in the conflict report.

'sto' option: State machine optimized listing.

The 'sto' option tells DFA to list the optimized state machine.

'tab' option: Tab setting in lexical grammar.

The 'tab' option tells DFA what the tab setting is for the input lexical grammar. This is necessary because indentation is important and the wrong indentation can cause a syntax error in the lexical grammar. The default is 'tab=3'.

'v' option: Verbose mode.

The 'v' option tells DFA to display more information on the screen. The default is 'v=0'. 'v' can be 0, 1, or 2.

'w' option: Warnings messages option.

The 'w' option tells DFA to display warning messages on screen. The warning messages are always listed in the 'warnings.txt' file.

Section 4. Output Files from Lexer Generation.

DFA always creates five output files when analyzing a grammar. For example, if your lexical-grammar is calc.lgr, you will get these five output files:

calc.lgr.conflicts.txt   - A conflicts file, showing the conflicts in your grammar.
calc.lgr.grammar.txt   - A grammar file, showing your grammar and the rule numbers.
calc.lgr.log.txt   - A log file, showing a log of what happened.
calc.lgr.states.txt   - A state-machine file, showing the lexer state machine.
calc.lgr.warnings.txt   - A file with warning messages.

Section 5. Lexical-Grammar Symbols

Lexical-Grammar Symbols

Lexical-grammar symbols are the "words" of your grammar which define the tokens of your language. Here is a list of the different types of symbols allowed in a lexical grammar:

Alpha Symbol        letter, digit, any, neol, not_asterisk
Lexical Symbol      <identifier>, <number>, <eof>, <string>
Ignore Symbol       [whitespace], [ommentline], [commentblock]
Literal Symbol      'program', 'if', 'else', 'while', ' ', '->', '"'
Integer (0..255)    0, 127, 32, 255, 26, 10, 13
Escape Symbol       \n, \z, \t, \r, \b, \f

Note 1. ''' is a single quote and '\'' is not valid. This makes for better readability. A backslash is '\', a double quote is '"'.

Note 2. You create your own escape symbols. There are no predefined escape symbols. You may include escape symbols that are not used in your grammar and they will be ignored (no error message given).

Note 3. Ignore symbols, e.g. [whitespace] and [comment], are bypassed and not given to the parser.

Section 6. Lexical-Grammar Syntax

Grammar Operators and Punctuators

Grammar operator and punctuators are used to further define the syntax of your computer language. Here is a list of the different types of operators and punctuators allowed in a grammar:

Rule Notation

'=>'	means "outputs" or "returns" a token number or defined constant.
'->'	means "is a" (for a rule which defines a token).

'|'	means "alternate choice" (another character or range (e.g. 'a'..'z')).
'..'	means "up to and including" (a range of characters, ('A'..'Z')).
'-'	means "subtract" the next character or range from the set being defined.

EBNF Operators

'+'	means "one or more occurrences".
'*'	means "zero or more occurrences".
'?'	means "optional occurrence".

'('	means "group start".
')'	means "group end".

'|'	means "or" (only used for character sets, not for rules).


Comments are allowed in the same style as C++. Line comments start with '//' and end with the end-of- line character, for example:

// C++ Grammar.
// by John Smith.
// in August 2007.

Block comments start with '/*' and end with '*/'. These cannot be nested (one inside of another). Here is an example:

/* Removed 10-15-04 NKP
CreateStmt -> CREATE TableName ColStuffList ';'
-> CREATE TableName UniqueStuff ';' // Added 09-15-04 JKL

Section 7. Lexical-Grammar Sections

Lexical grammars (.lgr files) have 2 sections:

  • 1. List of tokens and their return codes.
  • 2. Definition of tokens and character sets.

NOTE 1: If you are using DFA only, to create a lexer, you should put defined constants in the ".lgr" file. See the "calc_lexer" project which does this. The whole .lgr file would look like this:

/* Lexical Grammar */
   <eof>             EOFCODE        
   <identifier>      IDENTIFIER
   <integer>         INTEGER

   'else'            ELSE
   'endif'           ENDIF
   'if'              IF
   'program'         PROGRAM
   'then'            THEN

   '!='              NOTEQ
   '('               LP
   ')'               RP
   '*'               MUL
   '+'               PLUS
   '-'               MINUS
   '/'               DIV
   ';'               SEMI
   '='               EQ
   '=='              EQTO
   '{'               LB
   '}'               RB

   <identifier>         -> letter (letter|digit)*
   <integer>            -> digit+
   <eof>                -> \z

   [spaces]             -> (\t | \n | ' ')+
   [commentline]        -> '/' '/' neol*
   [commentblock]       -> '/' '*' na* '*'+ (nans na* '*'+)* '/'

   letter               -> 'a'..'z' | 'A'..'Z' | '_'
   digit                -> '0'..'9'

   any                  -> 0..255 - \z       // any character except EOF
   na                   -> any - '*'         // not asterisk
   nans                 -> any - '*' - '/'   // not asterisk not slash
   neol                 -> any - \n          // not end of line

   \t                   ->  9                // tab
   \n                   -> 10                // newline
   \f                   -> 12                // form feed
   \r                   -> 13                // return
   \z                   -> 26                // end of file
/* End of Grammar. */

NOTE 2: If you are using LRSTAR, it will create the list of tokens in a ".lex" file. Here is a list of tokens created by LRSTAR:

/* calc lexical symbols (generated by LRSTAR). */

   <identifier>            1
   <integer>               2
   <eof>                   3

   'program'              10
   'if'                   15
   'endif'                16
   'then'                 19
   'else'                 20

   '=='                    4
   '!='                    5
   '+'                     6
   '-'                     7
   '*'                     8
   '/'                     9
   '{'                    11
   '}'                    12
   '='                    13
   ';'                    14
   '('                    17
   ')'                    18

And here is the ".lgr" file that you create:

   <identifier>         -> letter (letter|digit)*
   <integer>            -> digit+
   <eof>                -> \z

   [spaces]             -> (\t | \n | ' ')+
   [commentline]        -> '/' '/' neol*
   [commentblock]       -> '/' '*' na* '*'+ (nans na* '*'+)* '/'

   letter               -> 'a'..'z' | 'A'..'Z' | '_'
   digit                -> '0'..'9'

   any                  -> 0..255 - \z       // any character except EOF
   na                   -> any - '*'         // not asterisk
   nans                 -> any - '*' - '/'   // not asterisk not slash
   neol                 -> any - \n          // not end of line

   \t                   ->  9                // tab
   \n                   -> 10                // newline
   \f                   -> 12                // form feed
   \r                   -> 13                // return
   \z                   -> 26                // end of file

If using LRSTAR and you want to specify defined constants for use by the parser, you should put them in the ".grm" file as follow:

/* calc parser grammar. */

   IDENTIFIER   <identifier>     =>     lookup();
   INTEGER      <integer>        =>     lookup();
   EOF          <eof>            ;

   PROGRAM      'program';
   IF           'if';                 
   ENDIF        'endif';
   THEN         'then';
   ELSE         'else';

   EQ           '==';
   NE           '!=';
   PLUS         '+';
   MINUS        '-';
   MUL          '*';
   DIV          '/';
   LCB          '{';
   RCB          '}';
   ASSIGN       '=';
   SEMICOLON    ';';
   LPAR         '(';
   RPAR         ')';
  (c) Copyright Paul B Mann 2018.  All rights reserved.