LRSTAR: LR(*) parser generator

A.M.D.G.

 
  Home Downloads Feedback Comparison History Theory Papers Doc. Contact  
  Installation LRSTAR DFA Definitions  
 

LRSTAR 9.0 User's Manual

Section 1. Introduction

Introduction.

LRSTAR is more advanced than YACC and Bison.  It offers a 1) cleaner grammar notation, 2) faster parsing speed, and 3) more powerful parsing operations (building a symbol table and an AST).  For simple parsing applications, you specify an action operator following by an action name, (=> goal_ ).  Here is an example:

Goal     -> Program... <eof>                   => goal_
	 			
Program  -> 'program' PgmName '{' Stmt... '}'  => program_(2)
	 			
PgmName  -> <identifier>		
	 			
Stmt     -> Target '=' Exp ';'                 => store_
         -> 'if' RelExp Then 'endif'           => if_
         -> 'if' RelExp Else 'endif'           => if_
         -> 'if' RelExp Then2 Else2 'endif'    => if_
	 			
Target   -> <identifier>                       => target_(1)
	 			
RelExp   -> Exp '==' Exp                       => eq_
         -> Exp '!=' Exp                       => ne_
	 			
Exp      -> Exp '+' Exp                        => add_
         -> Exp '-' Exp	                       => sub_
         -> <integer>                          => int_(1)
         -> <identifier>                       => ident_(1)
         -> '(' Exp ')'
	 			
Then     -> 'then' Stmt...                     => then_
Else     -> 'else' Stmt...                     => else_
Then2    -> 'then' Stmt...                     => then2_
Else2    -> 'else' Stmt...                     => else2_

The action operator (=>) means call this action at parsing time. Parse actions are called as rules are recognized in a bottom-up order.  This is good for simple parsing applications, however, a much more powerful action operator is available with LRSTAR.

The action operator (*>) means build an AST during parsing and after parsing, call this action while traversing the AST.  Consequently, AST actions are called in a top-down order.  This makes life much easier for more complex parsing applications, such as compiler front ends.  Here is the more powerful grammar:

Goal     -> Program... <eof>                   *> goal_
	 			
Program  -> 'program' PgmName '{' Stmt... '}'  *> program_(2)
	 			
PgmName  -> <identifier>		
	 			
Stmt     -> Target '=' Exp ';'                 *> store_
         -> 'if' RelExp Then 'endif'           *> if_
         -> 'if' RelExp Else 'endif'           *> if_
         -> 'if' RelExp Then2 Else2 'endif'    *> if_
	 			
Target   -> <identifier>                       *> target_(1)
	 			
RelExp   -> Exp '==' Exp                       *> eq_
         -> Exp '!=' Exp                       *> ne_
	 			
Exp      -> Exp '+' Exp                        *> add_
         -> Exp '-' Exp	                       *> sub_
         -> <integer>                          *> int_(1)
         -> <identifier>                       *> ident_(1)
         -> '(' Exp ')'		
	 			
Then     -> 'then' Stmt...                     *> then_
Else     -> 'else' Stmt...                     *> else_
Then2    -> 'then' Stmt...                     *> then2_
Else2    -> 'else' Stmt...                     *> else2_

Section 2. Command-Line Syntax.

LRSTAR is a command-line (or console-mode) program.  The command-line syntax is:

lrstar <grammar> [/<option>]...

lrstar   is the program name.
<grammar>   is the input grammar, (e.g. 'calc' or 'calc.grm'). Must have a filetype of ".grm".
[/<option>]...   means zero or more program options.

Here are some examples:

lrstar ansic
lrstar ansic /v=2 /o	-nac -exp
lrstar ansic /g /k=2 /st -ast -sym 

Here is a typical make.bat file for building a parser (on Windows systems):

..\..\bin\lrstar c /g /csr /crr /o /v=2 /d

Section 3. Program Options

If you type 'lrstar' without any arguments, you will get a display showing the program options:

LRSTAR 9.0.000 Copyright 2018 Paul B Mann
|
|   LR(*) PARSER GENERATOR
|
|   lrstar <grammar> [/<option>...]
|
|   OPTION  DEFAULT  DESCRIPTION
|   ast         1    Use AST operators in grammar
|   ca          0    Conflict analysis for each conflict
|   crr         1    Conflict report for RR conflicts
|   csr         1    Conflict report for SR conflicts
|   d           0    Debug option for generated parsers (1,2)
|   exp         1    Expecting list for syntax errors in generated parsers
|   g           1    Grammar output file (1,2)
|   k           1    k lookaheads for LR(k) parsing 
|   m           0    Minimum parser-table size
|   nac         1    Node actions activated in parser
|   o           0    Optimize parser speed 
|   st          0    State machine listing (1,2)
|   sym         1    Symbol table created by parser
|   tab         3    Tab setting in grammar (2,3,4)
|   v           1    Verbose mode (1,2)
|   w           1    Show warning messages on screen
|_

'ast' option: Build an AST.

The 'ast' option tells LRSTAR to generate parser-tables for building an abstract-syntax tree (AST). If not selected, you will get a syntax checker only..

'ca' option: Show a conflict analysis for each conflict.

The 'ca' option tells LRSTAR to show a tract back type of explanation for the conflicts. .

'crr' option: List reduce-reduce conflicts.

The 'crr' option tells LRSTAR to list conflicts for reduce-reduce conflicts. .

'csr' option: List shift-reduce conflicts.

The 'csr' option tells LRSTAR to list conflicts for shift-reduce conflicts. .

'd' option: Debug option.

Set to 1, 2, or 3, displays information that is useful for seeing what is going on during the parsing process. .

'exp' option: Output expecting code.

into the generated parser which will list the expected terminal symbols when a syntax error is encountered. .

'g' option: Output a grammar listing.

The 'g' option tells LRSTAR to output a file with a complete grammar listing in a formatted style. The output file is named ?.grm.grammar.txt.. 'g=1' creates a list with symbol numbers included. 'g=2' creates a grammar listing without the symbol numbers. This is useful for reformatting a grammar.

'k' option: Number of lookaheads for LR(k) parsing.

The 'k' option tells LRSTAR to set k = number of lookaheads for doing LR(k) parsing. The default is 'k=1', for which the generated parser does LR(1) parsing. For 'k=2' or greater, the generated parser does nondeterministic LR(k) parsing.

'm' option: Minimize parser-table size.

The 'm' option tells LRSTAR to try several different orderings of the values in the parser tables and then generate the smallest parser tables, based on the data obtained. It will try 2 different orderings for the B matrix, 4 for the T matrix, 4 for the N matrix, and 2 for the R matrix. Usually, you can expect a 15% decrease in size by using the 'm' option. The default is 'm=0'.

'nac' option: Node actions processing.

The 'nac' option is transfered to the parser header, which tells the parser to do a tree traversal and call the node actions, as defined in the grammar.

'o' option: Optimize parser speed.

The 'o' option tells LRSTAR to do chain-reduction elimination. This increase in speed is built into the parser tables. The default is 'o=0' (no optimization).

'st' option: Output a state-machine listing.

The 'st' option tells LRSTAR to list the state machine in the ?.grm.states.txt file.

'sym' option: Symbol table creation in the parser.

Tells the parser to create a symbol table. 99% of the time you will want a symbol table. 1% of the time you may want to do some debugging or do a speed test. In this case you may specify option '-sym' to avoid creating a symbol table. Without a symbol table, you will not be able to do any context sensitive parsing, which is necessary for the C/C++ language.

'tab' option: Tab setting for a grammar.

The 'tab' option tells LRSTAR how many spaces to substitute for tab characters in a grammar. Spacing is important in LRSTAR grammars. 'tab' may be set to 2, 3, or 4. The default is 'tab=3'.

'v' option: Verbose mode screen display.

The 'v' option tells LRSTAR to display more information on the screen. The default is 'v=1'. v cam be set to 0, 1 or 2.

'w' option: Warning messages.

The 'w' option tells LRSTAR to display the warning messages on the screen. The default is 'w=1'. w cam be set to 0 or 1. '-w' sets it to 0.


Section 4. Output Files.

LRSTAR creates six output files when analyzing a grammar. For example, if your grammar is calc.grm, then you will get these output files:

calc.grm.conflicts.txt - A conflicts file, showing the conflicts in your grammar.
calc.grm.grammar.txt - A grammar file, showing your grammar and the rule numbers.
calc.grm.log.txt - A log file, showing a log of what happened.
calc.grm.states.txt - A state-machine file, showing the parser state machine.
calc.grm.warnings.txt - A warnings file, showing current warning message for your grammar.
calc.lex - A list of tokens and their token numbers, to be read by DFA.


Section 6. Grammar Symbols

Grammar Symbols

Grammar symbols are the "words" of your grammar which define the syntax of your computer language. Here is a list of the different types of symbols allowed in a grammar:

Alpha Symbol        Goal, IfStatement, while, switch, case_statement, target_
Lexical Symbol      <identifier>, <number>, <eof>, <string>
Semantic Symbol     {typedef}, {program_name}, {function_name}
Literal Symbol      'program', 'if', 'else', 'while', '\'true\'', '\"true\"'
String Symbol       "hello world\n", "if then else" "\tprogram\n"
Integer             1, 256, 65535, 32767, 1000000

Reserved Grammar Symbols

Only three grammar symbols are reserved and have a special meaning:

<eof>

The end-of-file terminal symbol (i.e. <eof>). For example:

Goal   -> Statements <eof>

<error>

Used as a terminal, for an input symbol that causes a syntax error. This can be used to pass over symbols that you do not care about.

<error>, [<error>]...

<keyword>

Used as a terminal, to represent all keywords of your language (i.e. all keywords that are not reserved). A reserved keyword is specified as: FORMAT^ at the top of your grammar.

<keyword>, [<identifier>|<keyword>]/','...

Grammar Operators and Punctuators

Grammar operator and punctuators are used to further define the syntax of your computer language. Here is a list of the different types of operators and punctuators allowed in a grammar:

Rule Notation

'->'     "is a" (starts the right side of a rule).
'~>'     "is a" (same as '->', but reverse the order in the AST).

':'      "is a" (starts the right side of a rule, same as '->').
'|'      "alternate choice" for another rule.
';'      "end of a group of rules for a nonterminal" (';' is optional).

Action Operators

Action operators are placed at the end of a rule. Only one action operator may be placed at the end of a rule.

'=>'     "parse action" (call this function at parsing time).
'+>'     "make a node" (make a node in the AST).
'*>'     "make a node" and "call a node action" (during AST traverrsal).

'=+>'    "parse action" and "make a node".
'=*>'    "parse action", "make a node" and "call a node action".

EBNF Operators

'...'    "one or more occurrences".
		   
'('      "group start".
')'      "group end".
		   
'['      "optional start".
']'      "optional end".
		   
'|'      "or", inside of a group.
'/'      "one or more occurrences, separated by ..."
		   
'~'      "reverse the order in the AST".
'~..'    the same as '...' but reverse the order in the AST. 

Alternate EBNF Operators

'+'      "one or more occurrences".
'*'      "zero or more occurrences".
'?'      "optional occurrence".

Operator Precedence Notation

'{'      "operator precedence start".
'}'      "operator precedence end", e.g. { '+' '-' }.

'<<'     "left associative",  used after a list of operators.
'>>'     "right associative", used after a list of operators.

Reserved Word Designator

'^'      "reserved word", placed after a keyword being declared.

Note, the reserved-word indicator '^' should be placed after the keyword when being declared at the top of your grammar. This is only useful when you are using the reserved symbol, <keyword>, in your grammar. If you use <keyword> in your grammar, LRSTAR will create a nonterminal for you, <keyword>, that is defined to be a complete list of all the keywords found in your grammar EXCEPT for those keywords you have designated to be reserved with the '^' operator. Most computer languages will not need this feature.

Here are some useful combinations of EBNF notation:

(A|B|C)         A or B or C.
(A|B|C)...      one or more occurrence of A or B or C. 
[A|B|C]...      zero or more occurrences of A or B or C.

[x]...          zero or more occurrences of x.
[x...]          zero or more occurrences of x (same as above).

Color/','...    a comma separated list of Colors. 
(X|Y|Z)/';'...  a list of X or Y or Z separated by semi-colons. 
[X|Y|Z]/';'...  an optional list separated by semi-colons.

(A|B|C)~..      one or more occurrence of A or B or C, reverse order in AST.
[A|B|C]~..      zero or more occurrences of A or B or C, reverse order in AST.

(X|Y|Z)/';'~..  a list of X or Y or Z separated by semi-colons, reverse order.
[X|Y|Z]/';'~..  an optional list separated by semi-colons, reverse order.

Here is a sample of EBNF for some SQL rules:

SelectClause -> SELECT [ALL|DIST] SelectList FromClause 
                   [Where|Groupby|Having|Intersect|Orderby]...

Where        -> WHERE SearchCond 
Groupby      -> GROUP BY ColumnList 
ColumnList   -> ColumnSpec /','... 
Having       -> HAVING SearchCond 
Intersect    -> INTERSECT StmtName
             -> MINUS	StmtName
             -> UNION	StmtName
             -> UNION ALL StmtName

Orderby      -> ORDER BY ColumnSpecList
             -> FOR UPDATE OF ColList

Comments

Comments are allowed in the same style as C++. Line comments start with '//' and end with the end-of-line character, for example:

// C++ Grammar.
// by John Smith.
// in August 2007.

Block comments start with '/*' and end with '*/'. These cannot be nested (one inside of another). Here is an example:

/* Removed 10-15-04 NKP
CreateStmt -> CREATE TableName ColStuffList ';'
-> CREATE TableName UniqueStuff ';' // Added 09-15-04 JKL
*/

Nonterminal Symbols

Nonterminal symbols are the those symbols which are defined to be a sequence of other symbols. Nonterminals may be composed of letters, numbers and underscores ('_'). The first character cannot be a number. Here is a list of valid nonterminal symbols:

Start, IfStatement, For_loop, Expression1, Expression_2

Terminal Symbols

Terminal symbols are those symbols of the grammar that are pre-defined, such as keywords, identifiers, numbers, operators, punctuators, character strings, etc. Most of these symbols are defined in the lexical grammar and recognized by the lexer. It is this separation of the parser and lexer that allows one to define a grammar without lots of ambiguity. The concept of a separate lexer also presents a coherent set of primitive symbols to someone learning the language.

<identifier>, '=', for, while, '&&', <eof>, <string>, {typedef}, {function_name}.

Keywords

Keywords are composed of letters, numbers and underscores ('_'). The first character must be a letter. Keywords may be specified as literal symbols using single quotes (e.g. 'while'), but this is not necessary unless they contain some special character (e.g. '<html>'). Here is a sample showing some keywords:

Type        -> char | int | short | unsigned 
MetaStart   -> '<META'
MetaItem    -> 'HTTP-EQUIV' '=' Value
            ->	NAME	'=' Value
            ->	CONTENT	'=' Value

Literal Symbols

Literal symbols are symbols that begin and end with a single quote ('). Any character can be specified within single quotes, except end-of-line, tab, space, single quote ('), double quote (") and backslash (\). To have these unallowed characters in a literal, they must be preceded by a backslash (\), for example: '\'', '\"', '\\', etc. Here is an example:

PostfixExp  -> PrimaryExp
            -> PostfixExp '[' Exp ']'
            -> PostfixExp '(' [Exp] ')'
            -> PostfixExp '->' Name

Lexical Symbols

Lexical symbols are those symbols of the grammar such as <identifier>, <number>, <string>, etc. These have no literal form. For example, all words made up of letters, such as WinMain, CloseFile, MsgBox, etc. are usually represented in a grammar by <identifier>. These are tokens recognized by the lexer. Lexical symbols begin with '<' and end with '>'. Inside the angle brackets, you may use letters, numbers and underscores. Here are some examples:

Parameter   -> <identifier>
            -> <integer>
            -> <hexdigits>
            -> <string>

Semantic Symbols

Semantic symbols begin with '{' and end with '}', (e.g. {typedef}). They are variable symbols of the grammar that are defined by the lexer to be one terminal (e.g. <identifier>) and need to be changed later into another terminal (e.g. {typedef}). For example, a function name, "main", is first recognized to be an <identifier>, but later it may need to become a {function_name}. Another classical case is an <identifier> that is defined to be a 'typedef' in the C language. Here is a grammar segment that solves the 'typedef' problem:

<identifier> => lookup ();

Declaration  ->         VarDecl1 /','... ';'
             -> typedef VarDecl2 /','... ';'

VarDecl1     -> Type... <identifier>
VarDecl2     -> Type... <identifier>   => typedefidentifier_ (2, {typedef})

Type         -> SimpleType...
             -> Type '*'

SimpleType   -> char
             -> int
             -> short
             -> unsigned
             -> {typedef}

Here is a difficult input statement that will be parsed correctly without any errors:

typedef unsigned int uint, uint* uintptr;

In this line, when 'uint' is encountered it will be recognized as an <identifier> by the lexer. At the comma (,) 'uint' will be changed to a {typedef} by the parse action, 'defterm'. At the second encounter with 'uint', it will be recognized as a {typedef} by the symbol-table lookup function. Therefore the parsing is handled correctly (with no ambiguity in the grammar).


Section 7. Parser-Grammar Sections

Parser grammars (.grm files) have 4 sections:

  • 1. Defined Constants (optional),
  • 2. Token Declarations (optional).
  • 3. Operator Precedence (optional).
  • 4. Syntax Rules.

1. Defined Constants (optional)

Defined constants can be defined at the top of the grammar. They can have a numeric value or be associated with terminal symbols. These will be useful ins your grammar as arguments to parsing actions and useful in your source code as well. Let's take a look at some examples:

// Defined constants 
   NUMBER_OF_KEYWORDS   36;
   VERSION               3;
   BUILD               138;

2. Token Declarations (optional)

Token declarations are specified at the top of a grammar. Any processing that needs to be done is specified here. '=>' indicates "call this function". For example, all <identifier>s need to be looked up in the symbol table. The symbol-table index number is put on the parse stack and later transferred to a node in the AST. Here is an example:

// Tokens being stored in the symbol table

   <identifier>      =>	lookup();
   <integer>         =>	lookup();
   <string>          =>	lookup();

If you want to assign defined constants to the tokens, you would do it like this:

// Tokens being stored in the symbol table

   T_IDENTIFIER    <identifier>      =>	lookup();
   T_INTEGER       <integer>         =>	lookup();
   T_STRING        <string>          =>	lookup();

In the above examples, <identifier>, <integer> and <string> are input symbols that we want to be put into the symbol table by calling the function, lookup(). We only have to list those tokens which need to be processed by a token action.

3. Operator Precedence (optional)

The Operator Precedence section allows one to define which operators have priority over others. For example, when we are parsing expressions, such as 1+2*3, we want the 2*3 to be calculated first. We specify this operator precedence as follows:

{ '+' '-' } <<  // lower priority
{ '*' '/' } <<  // higher priority

The operators with higher priority are listed below ones with lower priority. The '<<' means left associative and '>>' means right associative. A case of right associativity would be the exponential operator '^', for example 2^2^3 would be calculated as 2^8, instead of 4^3. We want the 2^3 to be computed first. This operator usually has higher priority than the '*' and '/' operators. The operator precedence would look like this:

{ '+' '-' } << // left associative
{ '*' '/' } << // left associative
{ '^'     } >> // right associative

4. Syntax Rules

A rule is a nonterminal symbol followed by a '->' and zero or more symbols. Any null production (empty rule) should be listed first for readability, but this is not strictly enforced. Rules have this format:

Nonterminal ('->' (Nonterminal | Terminal) )... 

Goal Symbol or Start Symbol

The first rule of a grammar must define the goal symbol. The right side of the goal production contains one symbol or one expression and the end-of-file symbol (e.g. <eof>). The end-of-file symbol cannot be used anywhere else in the syntax-rules section of the grammar. Here are some examples of start definitions:

Start -> Input <eof>
Goal	-> [CompilationUnit]... <eof> 

Nonterminals

A nonterminal symbol definition may contain many rules. Here is an example taken from a C grammar.

Stmt -> ';'
     -> AssignmentExp ';'
     -> goto Identifier ';'
     -> return [Exp] ';'
     -> if '(' Exp ')' Stmt
     -> if '(' Exp ')' Stmt else Stmt
     -> switch '(' Exp ')' '{' Cases Default '}'
     -> while '(' Exp ')' Stmt
     -> do Stmt while '(' Exp ')' ';'
     -> for '(' Exp1 ';' Exp2 ';' Exp3 ')' Stmt

Actions

Any rule may have an action attached to it (at the end of the rule). There are 5 types of actions:

=>    Call a parse-action.
+>    Make a node in the AST.
*>    Make a node and call a node-action (during AST traversal).
=+>   Call a parse-action and make a node in the AST.
=*>   Call a parse-action, make a node and call a node-action.

=> Parse action

Parse actions are done by the parser when a rule is recognized. When the last symbol in a rule is accepted, the parser makes a reduction and calls the function indicated by the '=>' operator. Here is the syntax:

'=>' FunctionName '(' [Arg /','...] ')'

Arg may be any one of:

Terminal symbol   (e.g. <identifier>)
Integer   (e.g. 100)
Defined Constant   (e.g. INTEGER)
Literal   (e.g. ':')
String   (e.g. "hello world\n")

+> Make a node in the AST

Make-a-node operation occurs when the parser makes a reduction of a rule. At this time, the parser creates a new parent node and attaches any children nodes to it as may occur in the rule being recognized. Here is an example of how to do expressions, taken from the 'calc.grm':

Exp   -> Primary
      -> Exp '+' Exp	   +> add
      -> Exp '-' Exp	   +> sub
      -> Exp '*' Exp	   +> mul
      -> Exp '/' Exp	   +> div

*> Make a node and call a node action

A Make-a-node operation occurs when the parser makes a reduction of a rule. At this time, the parser creates a new node and attaches the children nodes found in the right side of the rule. Here is the syntax:

*> NodeName
*> NodeName '(' ')'
*> NodeName '(' Arg /','... ')'

Args may be specified, as many as you want. Arg may be any one of:

Terminal symbol   (e.g. <identifier>)
Integer   (e.g. 100)
Defined Constant   (e.g. INTEGER)
Literal   (e.g. ':')
String   (e.g. "hello world\n")

=+> Call a parse action and make a node

This is the combination of two operations, which is more concise than having a => and a +> for the rule. In this case (=+>) the parse action name will be the same as the node name.

=*> Call a parse action , make a node, and call a node action

This is the combination of three operations. Some situations may require this kind of thing. Usually, the parse action name will be the same as the node name and the node-action name will be the same as the node name.

Assignment  ~> Target '=' Exp ';'   =*> assignment()  

In the above case we will need a parse action called 'assignment', and a node action called 'assignment'. These must be supplied in the "calcactions.cpp" source file. See sample project 'calc'.


Section 8. Useful Examples

Example 1. Delayed Typing Of <identifier>s

There are some occasions in which it is very useful to delay the typing of symbols until the whole rule is recognized. This allows one to avoid conflicts in the grammar or avoid writing code that does a lookahead to resolve a conflict at parsing time. Here is a classical example often presented as a reason why you need infinite lookahead to resolve this ambiguity in the C language:

ExternalDef  -> FuncDecl
             -> FuncDef

FuncDecl     -> Type FuncDeclName '(' Arg/','... ')' ';'
FuncDef      -> Type FuncDefName  '(' Arg/','... ')' '{' Stmts '}'

FuncDeclName -> <identifier>             => funcdecl_(1)
FuncDefName  -> <identifier>             => funcdef_(1)

In the above grammar there is a reduce-reduce conflict. At the input symbol '(' the parser cannot decide whether <identifier> is a FuncDeclName or a FuncDefName. In fact, it cannot make this decision until it sees the ';' or the '{'. We can get rid of the ambiguity (conflict) by re-writing the grammar this way:

ExternalDef  -> FuncDecl
             -> FuncDef

FuncDecl     -> Type <identifier> '(' Arg/','... ')' ';'
FuncDef      -> Type <identifier> '(' Arg/','... ')' '{' Stmts '}'

In this new grammar, there is no conflict, however, there is no parse actions for either case. We must delay the action and perform it later during traversal of the AST. This is not a problem, however, it may seem strange to you. You can learn to do it this way. I think it's better than doing a lookahead for the ';' or the '{'. We need to incorporate the appropriate node actions in the grammar as follows:

ExternalDef  -> FuncDecl
             -> FuncDef

FuncDecl  -> Type <identifier> '(' Arg/','... ')' ';'            *> funcdecl_(2)
FuncDef   -> Type <identifier> '(' Arg/','... ')' '{' Stmts '}'  *> funcdef_(2)

Now, we will have two different nodes in the AST, one called 'funcdecl_' and another called, 'funcdef_'. The <identifier> will be attached to the node via its symbol-table-index. At this time we can do appropriate processing of the names. This method avoids the lookahead problem.

 
  (c) Copyright Paul B Mann 2018.  All rights reserved.