LRStar: LR(*) parser generator

An LRTEC software product

  Home Testimonials Comparison Papers Installation LRStar DFA Definitions Contact  

DFA 9.0 User's Manual

Section 1. Introduction


DFA generates compressed-matrix table-driven lexers, which are very fast and small. DFA lexers compile very fast.

Section 2. Command-Line Syntax.

DFA is a command-line (or console-mode) program. The command-line syntax is:

dfa <grammar> [<option>]...

dfa   is the program name.
<grammar>   is the input grammar, (e.g. 'calc' or 'calc.lgr'). Must have a filetype of '.lgr'.
[<option>]...   means zero or more program options.

An <option> must start with a '/' or '-'. '/' means yes. '-' means no (e.g. '-c' means don't list the conflicts).

Here are some examples:

dfa ansic
dfa ansic /v /c /d /st
dfa ansic /v /m -g -st

Here is a typical make.bat file for building a parser (on Windows systems):

..\..\bin\dfa calc /v=2 /d -st -g

Section 3. Program Options

This chapter contains a more in depth explanation of the DFA program options when creating a lexical analyzer. In this case, your input file name should have a filetype of 'lgr'. Here is the list of options.

DFA 1.0.000 Copyright 2018 LRTEC.
|   dfa <grammar> [/<option>...]
|   c           0    Conflict report (all conflicts)
|   g           0    Grammar listing for the lexical grammar
|   m           1    Minimize lexer-table size
|   st          0    State machine listing (for lexer)
|   sto         0    Optimized state machine listing (for lexer)
|   tab         3    Tab setting in lexical grammar (2,3,4)
|   v           1    Verbose mode for lexical grammar

'c' option: Conflict report for all conflicts.

The 'c' option tells DFA to report shift-reduce conflicts also. Usually only the reduce-reduce conflicts are reported in the "*.lgr.conflicts.txt" file because these are real errors. The shift-reduce conflicts are usually resolved by choosing the shift action. The default is 'c=0' (no shift-reduce conflict reporting).

'g' option: Create a grammar listing file.

The 'g' option tells DFA to generate a grammar file: ?.lgr.grammar.txt.

'm' option: Minimize lexer table size.

The 'm' option tells DFA to minimize the size of the lexer tables.

'st' option: State machine listing.

The 'st' option tells DFA to list the original unoptimized state machine in a file. This state machine is the one that you might want to look at when trying to figure out why your grammar has a conflict. This state machine listing can be very large. The optimized state machine is much smaller, but the state numbers in the optimized one do not correspond to the state numbers in the conflict report.

'sto' option: State machine optimized listing.

The 'sto' option tells DFA to list the optimized state machine.

'tab' option: Tab setting in lexical grammar.

The 'tab' option tells DFA what the tab setting is for the input lexical grammar. This is necessary because indentation is important and the wrong indentation can cause a syntax error in the lexical grammar. The default is 'tab=3'.

'v' option: Verbose mode.

The 'v' option tells DFA to display more information on the screen. The default is 'v=0'. 'v' can be 0, 1, or 2.

Section 4. Output Files from Lexer Generation.

DFA always creates five output files when analyzing a grammar. For example, if your lexical-grammar is calc.lgr, you will get these five output files:

calc.lgr.conflicts.txt   - A conflicts file, showing the conflicts in your grammar.
calc.lgr.grammar.txt   - A grammar file, showing your grammar and the rule numbers.
calc.lgr.log.txt   - A log file, showing a log of what happened.
calc.lgr.states.txt   - A state-machine file, showing the lexer state machine.
calc.lgr.warnings.txt   - A file with warning messages.

Section 5. Lexical-Grammar Symbols

Lexical-Grammar Symbols

Lexical-grammar symbols are the "words" of your grammar which define the tokens of your language. Here is a list of the different types of symbols allowed in a lexical grammar:

Alpha Symbol        letter, digit, any, neol, not_asterisk
Lexical Symbol      <identifier>, <number>, <eof>, <string>
Ignore Symbol       [whitespace], [ommentline], [commentblock]
Literal Symbol      'program', 'if', 'else', 'while', ' ', '->', '"'
Integer (0..255)    0, 127, 32, 255, 26, 10, 13
Escape Symbol       \n, \z, \t, \r, \b, \f

Note 1. ''' is a single quote and '\'' is not valid. This makes for better readability. A backslash is '\', a double quote is '"'.

Note 2. You create your own escape symbols. There are no predefined escape symbols. You may include escape symbols that are not used in your grammar and they will be ignored (no error message given).

Note 3. Ignore symbols, e.g. [whitespace] and [comment], are bypassed and not given to the parser.

Section 6. Lexical-Grammar Syntax

Grammar Operators and Punctuators

Grammar operator and punctuators are used to further define the syntax of your computer language. Here is a list of the different types of operators and punctuators allowed in a grammar:

Rule Notation

'=>'	means "outputs" or "returns" a token number or defined constant.
'->'	means "is a" (for a rule which defines a token).

'|'	means "alternate choice" (another character or range (e.g. 'a'..'z')).
'..'	means "up to and including" (a range of characters, ('A'..'Z')).
'-'	means "subtract" the next character or range from the set being defined.

EBNF Operators

'+'	means "one or more occurrences".
'*'	means "zero or more occurrences".
'?'	means "optional occurrence".

'('	means "group start".
')'	means "group end".

'|'	means "or" (only used for character sets, not for rules).


Comments are allowed in the same style as C++. Line comments start with '//' and end with the end-of- line character, for example:

// C++ Grammar.
// by John Smith.
// in August 2007.

Block comments start with '/*' and end with '*/'. These cannot be nested (one inside of another). Here is an example:

/* Removed 10-15-04 NKP
CreateStmt -> CREATE TableName ColStuffList ';'
-> CREATE TableName UniqueStuff ';' // Added 09-15-04 JKL

Section 7. Lexical-Grammar Sections

Lexical grammars (.lgr files) have 2 sections:

  • 1. List of tokens and their return codes.
  • 2. Definition of tokens and character sets.

NOTE 1: If you are using LRStar, it will create the list of tokens and in a ".lex" file. DFA reads the ".lex" file first and then reads the ".lgr" file that you create.

NOTE 2: If you are NOT using LRStar, you must create the list of tokens in a ".lex" file, OR put the list of tokens at the top of your ".lgr" file.

List of tokens created by LRStar

/* calc lexical symbols (generated by LRStar). */

   <identifier>            1
   <integer>               2
   <eof>                   3

   'program'              10
   'if'                   15
   'endif'                16
   'then'                 19
   'else'                 20

   '=='                    4
   '!='                    5
   '+'                     6
   '-'                     7
   '*'                     8
   '/'                     9
   '{'                    11
   '}'                    12
   '='                    13
   ';'                    14
   '('                    17
   ')'                    18

/* End of calc lexical symbols. */

Defined constants specified in the parser grammar

If using LRStar and you want to specify defined constants, you would put them in the ".grm" file and NOT in the ".lgr" file, as follow:

/* calc parser grammar. */

   T_IDENTIFIER   <identifier>     =>     lookup();
   T_INTEGER      <integer>        =>     lookup();
   T_EOF          <eof>            ;

   K_PROGRAM      'program';
   K_IF           'if';                 
   K_ENDIF        'endif';
   K_THEN         'then';
   K_ELSE         'else';

   T_EQ           '==';
   T_NE           '!=';
   T_PLUS         '+';
   T_MINUS        '-';
   T_MUL          '*';
   T_DIV          '/';
   T_LCB          '{';
   T_RCB          '}';
   T_ASSIGN       '=';
   T_SEMICOLON    ';';
   T_LPAR         '(';
   T_RPAR         ')';

Token Definitions (in the ".lgr" file)

Here is a sample ".lgr" file:

/* Calc Lexical Grammar. */

/* Token definitions */
   <identifier>         -> letter (letter|digit)*
   <integer>            -> digit+
   <eof>                -> \z

   [spaces]             -> (\t | \n | ' ')+
   [commentline]        -> '/' '/' neol*
   [commentblock]       -> '/' '*' na* '*'+ (nans na* '*'+)* '/'

   letter               -> 'a'..'z' | 'A'..'Z' | '_'
   digit                -> '0'..'9'

   any                  -> 0..255 - \z       // any character except EOF
   na                   -> any - '*'         // not asterisk
   nans                 -> any - '*' - '/'   // not asterisk not slash
   neol                 -> any - \n          // not end of line

   \t                   ->  9                // tab
   \n                   -> 10                // newline
   \f                   -> 12                // form feed
   \r                   -> 13                // return
   \z                   -> 26                // end of file
/* End of Grammar. */
  (c) Copyright LRTEC 2017.  All rights reserved.