LRSTAR: LR(*) parser generator

A.M.D.G.

 
  Home Downloads Feedback Comparison History Theory Papers Doc. Contact  
  Installation LRSTAR DFA Definitions  
 

Definitions

abstract-syntax tree

An abstract-syntax tree is a tree structure which contains only the meaningful elements from the input language. Punctuation that is not meaningful, such as comma's and parentheses, are not part of an abstract-syntax tree. The structure of the tree is defined by the placement of node names in the grammar. Here is a grammar, an input statement and its corresponding AST.

{ '+' } <<
{ '*' } <<

Goal     -> [Stmt]... <eof>

Stmt     -> Exp ';'               +> eval
         -> Target '=' Exp ';'    +> assign 

Target   -> <identifier>          +> address

Exp      -> Exp '+' Exp           +> add
         -> Exp '*' Exp           +> mul 
         -> <number>              +> number
         -> <identifier>          +> ident
         -> '(' Exp ')'
      
y = 2 * (x + 1);

assign
+ address (y)
+ mul
  + number (2)
  + add
    + ident (x)
    + number (1)

AST.

Abstract-syntax-tree.

accessor symbol.

An accessor symbol is the token which permits entry into a state. Each state only has one accessor symbol, unless the 'o' option is used. Several states may have the same accessor, but each state only has one accessor. The terminal accessor numbers are positive and the nonterminal accessor numbers are negative.

back substitution.

The process of replacing every occurrence of a nonterminal symbol in a grammar with its associated productions. This may increase the number of productions. It can be used to remove all null productions from a grammar. It can also be used for eliminating conflicts in a grammar. See the following.

Goal      -> Statement <eof> 

Statement -> Var '=' Exp ; 

Exp       -> Var
          -> Var '+' Var
          -> Var '-' Var

Grammar after backsubstituting Exp:

Goal      -> Statement <eof> 

Statement -> Var '=' Var ';'
          -> Var '=' Var '+' Var ';'
          -> Var '=' Var '-' Var ';'

backtracking.

Backtracking is the process of undoing parsing actions. A nondeterministic parsing algorithm uses backtracking when a wrong choice is made in a state that has more than one transition choice for a token. GLR parsers avoid backtracking by creating more than one parse stack (a parse tree instead of a parse stack) and keep these in memory until later when analyzing the AST. Eventually the wrong parse tree branches will be deleted.

binary.

Binary refers to data in machine readable format rather than human readable format.

binary search.

A process of searching a table for a particular value by repeatedly partitioning the remaining search area into two halves by looking at the center value and determining if it is less or greater than the search value. The table of values must be sorted for this to work.

chain reductions.

Chain reductions are those consecutive unit reductions of the form (a <= b) that the parser makes, but have no associated action with them and cause a waste of time. Languages that have a large number of expression operators that are defined without using operator precedence notation have a lot of chain reductions. These can be eliminated by specifying operator precedence at the top of a grammar and using ambiguous expressions of the form:

Exp    -> Exp '+' Exp
       -> Exp '-' Exp
       -> Exp '*' Exp
       -> Exp '/' Exp

Here is a grammar that will generate a parser that has useless chain reductions:

Stmt   -> LeftSide '=' Exp ';' 

Exp    -> Term
       -> Exp '+' Term
       -> Exp '-' Term

Term   -> Factor
       -> Term '*' Factor
       -> Term '/' Factor

Factor -> <number>
       -> <identifier>
       -> '(' Exp ')'

If the input line is, "y = 100;", the parser makes 3 reductions:

Factor <= <number> 
Term   <= Factor 
Exp    <= Term

If the parser generator does chain-reduction elimination, the generated parser only does one reduction as follows:

Exp    <= <number>

closure.

The process of generating from the kernel set of items (in a state), all possible items that may exist in that state. This is part of the state building process done by a parser generator.

command language.

A language used to "talk to" a computer or command one. A language used in an interactive way with a machine.

compiler.

A program which translates source code written in a high level language (C, Pascal, BASIC, etc.) into an executable file suitable for execution on a particular machine.

conflict.

An ambiguity in a grammar. If, starting from the goal symbol, there are two or more ways to derive a pattern, and it is not clear how to distinguish one from another by using one symbol of lookahead, then there is a conflict. The parser generator must resolve this problem if the parsing algorithm is to be deterministic (without choices). There two types of conflicts: shift-reduce and reduce-reduce. In the case of the shift-reduce, precedence is given to the shift operation (choosing the longer rule) instead of the reduce operation. In the case of a reduce-reduce conflict, precedence is given to the earlier production (rule) in the grammar.

default reduction.

A default reduction in a state is the reduction to be taken if there is no transition (of shift action) for the current input token . Most states have only one reduction, the default reduction. Those states that have multiple reductions, may also have a default reduction. Those states that don't have any reductions are called error states.

derived.

A derived attribute is one that is passed up from a lower node in a tree. A number such as 10 may have a derived attribute of integer, whereas 10L has a derived attribute of long. This is the opposite of an inherited attribute. Inherited attributes are passed from higher nodes down to lower nodes. If 10 were an argument to a function that required a char type then 10 would inherit a char attribute.

disambiguating rules.

Those rules placed at the top of the grammar that specify operator precedence. It is easier to specify expressions using the ambiguous grammar notation shown below, than it is to use the unambiguous notation also shown below.

{ '+' '-' } << /* disambiguating	*/
{ '*' '/' } << /* rules.	*/

Exp    -> Exp '+' Exp
       -> Exp '-' Exp
       -> Exp '*' Exp
       -> Exp '/' Exp
       -> <number>
       -> <identifier>
       -> '(' Exp ')'

Unambiguous grammar:

Exp    -> Term
       -> Exp '+' Term
       -> Exp '-' Term

Term   -> Factor
       -> Term '*' Factor
       -> Term '/' Factor

Factor -> <number>
       -> <identifier>
       -> '(' Exp ')'

error recovery.

Error recovery is the process of recovering from a syntax error. When the parser encounters an error in the input, control is given to an error recovery function whose job is to fix the error or skip over it and find a way to resume parsing without creating erroneous error messages.

error state.

An error state is a state that has no default reduction. When the parser arrives at this state, if there is no matching transition for the input token, it must declare an error.

grammar.

A list of rules that specify a pattern or acceptable sequence of input symbols. A definition of the form (syntax) of a language.

head symbol.

The symbol being defined. A nonterminal on the left of an arrow. The left side of a production.

inherited.

An inherited attribute is one that comes from a higher node. In LR parsing, which is bottom-up parsing, inherited attributes cannot be assigned in the first pass. That is why people build an abstract syntax tree. So that after the parsing is finished, one can traverse the tree (AST) and assign attributes. (See derived).

interpreter.

A program which executes source code written in a high level language (C, Pascal, Ada). An interpreter does not produce object code or an executable file. It may generate intermediate code and execute that. Interpreters are desirable because they can offer debugging techniques superior to compilers. They have the source code readily available in memory. The translation time is less because they usually don't do much optimization on the code.

keyword.

Those words in a language that have special meaning in the language and cannot be used as variable names. FORTRAN and PL/I are exceptions, and allow keywords to be used as variable names. In the C language, some keywords are: if, else, while, for, switch, continue, break.

LALR.

LALR is a term taken from computer science parsing theory which means Look Ahead Left Right. It refers to a class of grammars that belongs in the LALR(1) category of descriptive power, greater than SLR(1) but not quite as powerful as LR(1).

leaf node.

An element in a tree that has no connections to lower elements. A node without children. See node.

left recursion.

Left recursion is defined in a grammar by a rule that contains the head symbol as the first tail symbol. Left recursion is the desired way to specify repetition in an LR grammar. For example:

IdentifierList -> <identifier>
               -> IdentifierList ',' <identifier>

lexer.

A pattern recognizer, that recognizes character patterns instead of symbol patterns. A lexer recognizes the tokens of a language and defines the starting point and end of each token. It usually skips over blanks, newlines, and comments.

lexical.

Pertaining to characters. A lexical analyzer is the same thing as a lexer. A lexical grammar is a grammar that defines the character patterns for the tokens of a language, such as numbers, identifiers, strings, etc.

lookahead.

A symbol used by the parser to determine whether to shift (goto a new state) or reduce (make a reduction and goto a new state). A lookahead set for each state is created by the parser generator.

node.

A element in a tree that has connections to zero or more child elements and a connection to one parent node. The root node has no parent node.

nonterminal.

A symbol of a grammar that represents other symbols of the grammar. A nonterminal is a symbol that is defined in a grammar. It may be defined to be zero or more terminals or nonterminals.

null.

A null production is a rule that has an empty right part. For example:

OptionalList ->

parser.

A parser is a pattern recognizer. A program capable of accepting input that conforms to acceptable patterns as defined by a set of rules (grammar).

production.

A rule in a grammar. A production contains an optional head symbol, an arrow (->) and zero or more tail symbols. For example:

ForStmt -> for '(' [Exp] ';' [Exp] ';' [Exp] ')' Stmt

recursion.

Repetition of itself. In reference to grammars, recursion means that a sequence of symbols is repeatable for an indefinite number of times. See left recursion, right recursion.

reduce action.

A reduce action in a parser is the act of recognizing that a rule has been recognized and then reducing the list of symbols that make up the right side of the rule (on the parse stack) to one symbol, the head symbol for that rule.

reduce-reduce.

A type of conflict in which two or more rules have identical right sides but different head symbols. The parser generator cannot decide which head symbol to choose based on one symbol of lookahead.

reduction.

A reduce action. See reduce.

right recursion.

Right recursion is defined in a grammar by a rule in which the head symbol is used as the last tail symbol. This specifies repetition of itself in a way that postpones all of the reductions until the last one is seen. Then all of the reductions for this rule are done in a reverse order. Sometimes this is desirable, for example:

Factor   -> Primary
         -> Primary '**' Factor

This operator precedence notation and ambiguous rule accomplishes the same thing:

{ '**' } >>

Exp -> Exp '**' Exp

root.

The top or highest node in a tree. The origin. All the elements in a tree can be visited by traversing downward from the root.

semantics.

Semantics refers to meaning rather than syntax. Semantic information usually means information that is found in the symbol table. Some people like to do GLR parsing without building a symbol table during the first pass and then after the AST is built, create the symbol table. Then use the symbol table to do a semantic analysis of the AST to remove erroneous branches (which were created by ambiguity in the grammar). LRSTAR parsers build the symbol table during the parsing phase and thus have semantic analysis going on during the parsing (thereby using less memory and providing faster parsers).

shift action.

A shift action in a parser is one that accepts the input symbol read by the lexer, puts it on the parse stack, and does a transition to a new state.

shift-reduce action.

A shift-reduce action is a parsing action that combines a shift action and a reduce action into one. This is possible when the symbol being accepted is the last one of a rule. The use of shift-reduce actions permits the elimination of reduce-only states, which usually make up about 35% of the total number. This also provides faster parsing. LRSTAR generated parsers have shift-reduce actions.

shift-reduce conflict.

A shift-reduce conflict is a conflict between a shift action and a reduce action. The parser generator cannot decide whether to pursue the recognition of an incompleted rule or accept another rule that has been completed. If, based on one symbol of lookahead, the situation is undecidable, the shift action will be given priority and a shift-reduce conflict will be reported (unless the 'rr' option was specified).

skeleton.

LRSTAR 9.0 does NOT use skeleton files anymore. They were a source of frustration. Instead, LRSTAR generates only the two header files for the parser (e.g. cparser.h and cparsertables.h). The cparser.cpp file is not generated and not read-only. It is the same file for all parsing projects, much easier to work with now, for debugging and modifying.

SLR.

Simple LR. A class of grammars with less descriptive power than LALR grammars. Also a state type in which the parser generator can resolve inconsistencies by looking at the FIRST and FOLLOW sets, rather than the LALR(1) lookahead sets. See books on compiler theory for more information.

state.

A situation which has a set of valid inputs and certain actions associated with them, either shift and make a transition to a new state or reduce and fall back to a previous state. A number of states make up a finite-state machine.

syntax.

Syntax refers to the form of a language. It refers to the correct order of symbols or correct sequence. Syntax does not pertain to meaning. See semantics.

tail.

A terminal or nonterminal symbol that appears on the right side of a rule.

terminal.

The terminal symbols are all those symbols that are undefined in a grammar. They represent themselves. They are the primitive elements of the language. The leaf nodes of the abstract syntax tree. The input tokens coming from the lexer.

token.

A symbol defined by the lexer. A terminal symbol. The primitive symbols of a grammar (undefined in a grammar). For example, <identifier>, <integer>, <string>, 'if', 'else', 'while', '=', '+', '*' and ';' are all tokens of the C language.

token action

An action or function that does special processing for a token before it gets seen by the parser, such as storing of the token in a symbol table. A token action such as a symbol-table lookup may change the terminal number of the token (change it from an <identifier> to a {typedef}).

transition.

A transition is an action of the parser to go to the next state, based in the input token. The next state is determined by looking at the transition matrix. The current state indicates which row of the matrix and the input token indicates which column in the row. The number found at that location is either the next state (if positive) or the reduce action (if negative). This is the simple textbook answer. LRSTAR uses 4 matrix tables instead of one. This gives the best solution: fast parsers and small tables.

tree.

A tree is a data structure which facilitates visiting all the elements in the tree in several different orders. It facilitates restructuring or removing parts of the tree (optimizations). It is a very desirable structure when dealing with language processing.

 
  (c) Copyright Paul B Mann 2018.  All rights reserved.