Parsing¶

What’s the difference between a ‘tokenizer’, ‘lexer’ and ‘parser’?¶

https://www.quora.com/Parsers-What-is-the-difference-between-YACC-and-ANTLR

A tokenizer just splits text into smaller units such as words. Tokenizing into letters, syllables, sentences etc. is also possible.

A lexer does the same plus attachs extra information to each token. If we tokenize into words, a lexer would attach tags like number, word, punctuation etc.

A parser usually uses the output of a lexer and constucts a parse tree.

I tried to give a generic answer which targets both natural and programming languages.

What is YACC and how do we use YACC tools?¶

The name YACC is an acronym for “Yet Another Compiler Compiler.” YACC is a specialized compiler that generates source code for part of a compiler you’re trying to construct. Specifically, YACC is a parser generator, a software tool that helps automate a portion of compiler development, namely building the parser. It generates source code for a LALR (look-ahead left-to-right) parser

YACC recognizes as input a domain-specific language

that consists primarily of grammar rules and actions to be performed when those grammar rules are recognized. As output, YACC typically generates C source code, which then becomes part of the source code for the compiler you’re constructing.

GNU Bison¶

is a more modern GNU incarnation of the classic Unix YACC parser generator.

GNU Bison, commonly known as Bison, is a parser generator that is part of the GNU Project. Bison reads a specification in Bison syntax (described as “machine-readable BNF”[3]), warns about any parsing ambiguities, and generates a parser that reads sequences of tokens and decides whether the sequence conforms to the syntax specified by the grammar.

The generated parsers are portable: they do not require any specific compilers. Bison by default generates LALR(1) parsers but it can also generate canonical LR, IELR(1) and GLR parsers.[4]

In POSIX mode, Bison is compatible with Yacc, but also has several extensions over this earlier program, including

Generation of counterexamples for conflicts
Location tracking (e.g., file, line, column)
Rich and internationalizable syntax error messages in the generated parsers
Customizable syntax error generation,
Reentrant parsers
Push parsers, with autocompletion
Support for named references
Several types of reports (graphical, XML) on the generated parser
Support for several programming languages (C, C++, D, or Java)

Flex, an automatic lexical analyser, is often used with Bison, to tokenise input data and provide Bison with tokens.[5]

Bison was originally written by Robert Corbett in 1985.[1] Later, in 1989, Robert Corbett released another parser generator named Berkeley Yacc. Bison was made Yacc-compatible by Richard Stallman.[6]

Bison is free software and is available under the GNU General Public License, with an exception (discussed below) allowing its generated code to be used without triggering the copyleft requirements of the licence. Features Counterexample generation

One delicate issue with LR parser generators is the resolution of conflicts (shift/reduce and reduce/reduce conflicts). With many LR parser generators, resolving conflicts requires the analysis of the parser automaton, which demands some expertise from the user.

To aid the user in understanding conflicts more intuitively, Bison can instead automatically generate counterexamples. For ambiguous grammars, Bison often can even produce counterexamples that show the grammar is ambiguous.

For instance, on a grammar suffering from the infamous dangling else problem, Bison reports

doc/if-then-else.y: warning: shift/reduce conflict on token "else" [-Wcounterexamples]
  Example: "if" expr "then" "if" expr "then" stmt • "else" stmt
  Shift derivation
    if_stmt
    ↳ "if" expr "then" stmt
                        ↳ if_stmt
                           ↳ "if" expr "then" stmt • "else" stmt
  Example: "if" expr "then" "if" expr "then" stmt • "else" stmt
  Reduce derivation
    if_stmt
    ↳ "if" expr "then" stmt                        "else" stmt
                        ↳ if_stmt
                           ↳ "if" expr "then" stmt •

Reentrancy¶

Reentrancy is a feature which has been added to Bison and does not exist in Yacc.

Normally, Bison generates a parser which is not reentrant. In order to achieve reentrancy the declaration %define api.pure must be used. More details on Bison reentrancy can be found in the Bison manual.[7] Output languages

Bison can generate code for C, C++, D and Java.[8]

For using the Bison-generated parser from other languages a language binding tool such as SWIG can be used. License and distribution of generated code

Because Bison generates source code that in turn gets added to the source code of other software projects, it raises some simple but interesting copyright questions. A GPL-compatible license is not required

The code generated by Bison includes significant amounts of code from the Bison project itself. The Bison package is distributed under the terms of the GNU General Public License (GPL) but an exception has been added so that the GPL does not apply to output.[9][10]

Earlier releases of Bison stipulated that parts of its output were also licensed under the GPL, due to the inclusion of the yyparse() function from the original source code in the output. Distribution of packages using Bison

Free software projects that use Bison may have a choice of whether to distribute the source code which their project feeds into Bison, or the resulting C code made output by Bison. Both are sufficient for a recipient to be able to compile the project source code. However, distributing only the input carries the minor inconvenience that the recipients must have a compatible copy of Bison installed so that they can generate the necessary C code when compiling the project. And distributing only the C code in output, creates the problem of making it very difficult for the recipients to modify the parser since this code was written neither by a human nor for humans - its purpose is to be fed directly into a C compiler.

These problems can be avoided by distributing both the input files and the generated code. Most people will compile using the generated code, no different from any other software package, but anyone who wants to modify the parser component can modify the input files first and re-generate the generated files before compiling. Projects distributing both usually do not have the generated files in their revision control systems. The files are only generated when making a release.

Some licenses, such as the GPL, require that the source code be in “the preferred form of the work for making modifications to it”. GPL’d projects using Bison must thus distribute the files which are the input for Bison. Of course, they can also include the generated files. Use

Because Bison was written as a replacement for Yacc, and is largely compatible, the code from a lot of projects using Bison could equally be fed into Yacc. This makes it difficult to determine if a project “uses” Bison-specific source code or not. In many cases, the “use” of Bison could be trivially replaced by the equivalent use of Yacc or one of its other derivatives.

Bison has features not found in Yacc, so some projects can be truly said to “use” Bison, since Yacc would not suffice.

The following list is of projects which are known to “use” Bison in the looser sense, that they use free software development tools and distribute code which is intended to be fed into Bison or a Bison-compatible package.

Bash shell uses a yacc grammar for parsing the command input.
Bison’s own grammar parser is generated by Bison.[11]
CMake uses several Bison grammars.[12]
GCC started out using Bison, but switched to a hand-written recursive-descent parser for C++ in 2004 (version 3.4),[13] and for C and Objective-C in 2006 (version 4.1)[14]
The Go programming language (GC) used Bison, but switched to a hand-written scanner and parser in version 1.5.[15]
LilyPond requires Bison to generate its parser.[16]
MySQL[17]
GNU Octave uses a Bison-generated parser.[18]
Perl 5 uses a Bison-generated parser starting in 5.10.[19]
The PHP programming language (Zend Parser).
PostgreSQL[20]
Ruby MRI, the reference implementation of the Ruby programming language, relies on a Bison grammar.[21]
syslog-ng uses several Bison grammars assembled together.[22]

A complete reentrant parser example¶

This section is written like a manual or guide. Please help rewrite this section and remove advice or instruction. (September 2016)

The following example shows how to use Bison and flex to write a simple calculator program (only addition and multiplication) and a program for creating an abstract syntax tree. The next two files provide definition and implementation of the syntax tree functions.

/*
 * Expression.h
 * Definition of the structure used to build the syntax tree.
 */
#ifndef __EXPRESSION_H__
#define __EXPRESSION_H__

/**
 * @brief The operation type
 */
typedef enum tagEOperationType
{
    eVALUE,
    eMULTIPLY,
    eADD
} EOperationType;

/**
 * @brief The expression structure
 */
typedef struct tagSExpression
{
    EOperationType type; /* /< type of operation */

    int value; /* /< valid only when type is eVALUE */
    struct tagSExpression *left; /* /<  left side of the tree */
    struct tagSExpression *right; /* /< right side of the tree */
} SExpression;

/**
 * @brief It creates an identifier
 * @param value The number value
 * @return The expression or NULL in case of no memory
 */
SExpression *createNumber(int value);

/**
 * @brief It creates an operation
 * @param type The operation type
 * @param left The left operand
 * @param right The right operand
 * @return The expression or NULL in case of no memory
 */
SExpression *createOperation(EOperationType type, SExpression *left, SExpression *right);

/**
 * @brief Deletes a expression
 * @param b The expression
 */
void deleteExpression(SExpression *b);

#endif /* __EXPRESSION_H__ */

/*
 * Expression.c
 * Implementation of functions used to build the syntax tree.
 */

#include "Expression.h"

#include <stdlib.h>

/**
 * @brief Allocates space for expression
 * @return The expression or NULL if not enough memory
 */
static SExpression *allocateExpression()
{
    SExpression *b = (SExpression *)malloc(sizeof(SExpression));

    if (b == NULL)
        return NULL;

    b->type = eVALUE;
    b->value = 0;

    b->left = NULL;
    b->right = NULL;

    return b;
}

SExpression *createNumber(int value)
{
    SExpression *b = allocateExpression();

    if (b == NULL)
        return NULL;

    b->type = eVALUE;
    b->value = value;

    return b;
}

SExpression *createOperation(EOperationType type, SExpression *left, SExpression *right)
{
    SExpression *b = allocateExpression();

    if (b == NULL)
        return NULL;

    b->type = type;
    b->left = left;
    b->right = right;

    return b;
}

void deleteExpression(SExpression *b)
{
    if (b == NULL)
        return;

    deleteExpression(b->left);
    deleteExpression(b->right);

    free(b);
}

The tokens needed by the Bison parser will be generated using flex.

%{

/*
 * Lexer.l file
 * To generate the lexical analyzer run: "flex Lexer.l"
 */

#include "Expression.h"
#include "Parser.h"

#include <stdio.h>

%}

%option outfile="Lexer.c" header-file="Lexer.h"
%option warn nodefault

%option reentrant noyywrap never-interactive nounistd
%option bison-bridge

%%

[ \r\n\t]*   { continue; /* Skip blanks. */ }
[0-9]+       { sscanf(yytext, "%d", &yylval->value); return TOKEN_NUMBER; }

"*"          { return TOKEN_STAR; }
"+"          { return TOKEN_PLUS; }
"("          { return TOKEN_LPAREN; }
")"          { return TOKEN_RPAREN; }

.            { continue; /* Ignore unexpected characters. */}

%%

int yyerror(SExpression **expression, yyscan_t scanner, const char *msg) {
    fprintf(stderr, "Error: %s\n", msg);
    return 0;
}

The names of the tokens are typically neutral: "TOKEN_PLUS" and "TOKEN_STAR", not "TOKEN_ADD" and "TOKEN_MULTIPLY". For instance if we were to support the unary "+" (as in "+1"), it would be wrong to name this "+" "TOKEN_ADD". In a language such as C, "int *ptr" denotes the definition of a pointer, not a product: it would be wrong to name this "*" "TOKEN_MULTIPLY".

Since the tokens are provided by flex we must provide the means to communicate between the parser and the lexer.[23] The data type used for communication, YYSTYPE, is set using Bison %union declaration.

Since in this sample we use the reentrant version of both flex and yacc we are forced to provide parameters for the yylex function, when called from yyparse.[23] This is done through Bison %lex-param and %parse-param declarations.[24]

%{

/*
 * Parser.y file
 * To generate the parser run: "bison Parser.y"
 */

#include "Expression.h"
#include "Parser.h"
#include "Lexer.h"

// reference the implementation provided in Lexer.l
int yyerror(SExpression **expression, yyscan_t scanner, const char *msg);

%}

%code requires {
  typedef void* yyscan_t;
}

%output  "Parser.c"
%defines "Parser.h"

%define api.pure
%lex-param   { yyscan_t scanner }
%parse-param { SExpression **expression }
%parse-param { yyscan_t scanner }

%union {
    int value;
    SExpression *expression;
}

%token TOKEN_LPAREN   "("
%token TOKEN_RPAREN   ")"
%token TOKEN_PLUS     "+"
%token TOKEN_STAR     "*"
%token <value> TOKEN_NUMBER "number"

%type <expression> expr

/* Precedence (increasing) and associativity:
   a+b+c is (a+b)+c: left associativity
   a+b*c is a+(b*c): the precedence of "*" is higher than that of "+". */
%left "+"
%left "*"

%%

input
    : expr { *expression = $1; }
    ;

expr
    : expr[L] "+" expr[R] { $$ = createOperation( eADD, $L, $R ); }
    | expr[L] "*" expr[R] { $$ = createOperation( eMULTIPLY, $L, $R ); }
    | "(" expr[E] ")"     { $$ = $E; }
    | "number"            { $$ = createNumber($1); }
    ;

%%

The code needed to obtain the syntax tree using the parser generated by Bison and the scanner generated by flex is the following.

/*
 * main.c file
 */

#include "Expression.h"
#include "Parser.h"
#include "Lexer.h"

#include <stdio.h>

int yyparse(SExpression **expression, yyscan_t scanner);

SExpression *getAST(const char *expr)
{
    SExpression *expression;
    yyscan_t scanner;
    YY_BUFFER_STATE state;

    if (yylex_init(&scanner)) {
        /* could not initialize */
        return NULL;
    }

    state = yy_scan_string(expr, scanner);

    if (yyparse(&expression, scanner)) {
        /* error parsing */
        return NULL;
    }

    yy_delete_buffer(state, scanner);

    yylex_destroy(scanner);

    return expression;
}

int evaluate(SExpression *e)
{
    switch (e->type) {
        case eVALUE:
            return e->value;
        case eMULTIPLY:
            return evaluate(e->left) * evaluate(e->right);
        case eADD:
            return evaluate(e->left) + evaluate(e->right);
        default:
            /* should not be here */
            return 0;
    }
}

int main(void)
{
    char test[] = " 4 + 2*10 + 3*( 5 + 1 )";
    SExpression *e = getAST(test);
    int result = evaluate(e);
    printf("Result of '%s' is %d\n", test, result);
    deleteExpression(e);
    return 0;
}

A simple makefile to build the project is the following.

# Makefile

FILES = Lexer.c Parser.c Expression.c main.c
CC = g++
CFLAGS = -g -ansi

test: $(FILES)
       $(CC) $(CFLAGS) $(FILES) -o test

Lexer.c: Lexer.l
       flex Lexer.l

Parser.c: Parser.y Lexer.c
       bison Parser.y

clean:
       rm -f *.o *~ Lexer.c Lexer.h Parser.c Parser.h test

The use of YACC follows this general sequence:¶

Choose or define a language for which you want to build a compiler. For this example, let’s say we want to construct a fictitious ZXJ compiler in C, with the help of YACC. For your language, you’ll need all the details of keywords (reserved words), operators, how identifiers are formed, how constants are formed, all the operator precedence and associativity rules, and all the grammar rules of the language.
Write a lexical analyzer in C for the ZXJ language. The lexical analyzer breaks the input into tokens of the language. You could do this by hand, or use Lex or Flex to generate a lexical analyzer. (Lex and Flex are to lexical analyzers what YACC and Bison are to parsers.)
- Using YACC syntax, write grammar rules for the ZXJ language. Write actions associated with any rules that require them. For example, when grammar rule that involves an add operation between two operands is recognized, you’ll need an action for that rule that generates the appropriate intermediate representation of that operation. If your grammar contains any conflicts, YACC will tell you about them.
- Feed the YACC code you wrote in step 3 into the YACC program. The output will be C source code, the parser for your ZXJ compiler.
- In C, write the back end of your ZXJ compiler (e.g., code generator, optimizer). You’ll also need to implement a symbol table and support for whatever intermediate representation you will be using.
- Collect the output of YACC in step 4 (i.e., the C source code generated by YACC), the lexical analyzer source code you created in step 2, and all the code from step 5 into a project, and build it into an executable program. That final executable result is your ZXJ compiler.

There are many other details involved in the process, but this should give you a general idea of the flow. There are many other tools available to help construct compilers. And, as its name suggests, YACC was not the first “compiler compiler” tool.

Life After Parsing¶

http://www.semdesigns.com/Products/DMS/LifeAfterParsing.html?Site=Quora

Got My Grammar… uh, now what?

It is astonishing how often people think that the key to building a tool to process a computer (or domain-specific) language is to get a parser.

Its true… in the same sense that playing poker is all about putting the ante into the pot. If you believe this, the other poker players are going to eat you alive. Yes, you have to ante. No, that’s not where the game is played.

Only the most trivial computer languages are easy to analyze and transform. The compiler community has been working on this problem for 50 years, and one ignores what they have learned at one’s peril. They tell us following technologies are needed to realistically process computer language(s):

Lexing and Parsing, … of course. But there are some really nasty problems here, including preprocessors.
Representation Capture: A way to represent and capture program structure
Symbol Tables: A mapping between identifiers used in the program, and their meaning (and scope).
Inference Methods: A way to understand the meaning and consequences of the written code:
Simple Fact Collection: Inferences that are easy to extract from the program structure.
Control Flow Analysis: Data about the order in which activities occur in a program.
Data Flow Analysis: Data about how information flows from one part of the program to another.
Symbolic Reasoning: Method to compute complex consequences of semantics and various information flows.
Pattern Matching and Transformation Capability: Means to find program points that look like places of interest, as well as means to convert one program representation instance into others, to provide optimizations or translation to another representation
Pretty-Printing: If the representation is modified, one needs to be able to regenerate valid source code from it.
Multi-module and multi-lingual Integration: Inclusion of other programs in the same or another notation to allow one to process systems of code.

Parsing and Program Analysis using only Regular Expressions is hopeless

Many people start by trying to use regular expressions to parse source code. Regular expressions are useful for scanning many kinds of strings with simple structure, but cannot handle the nested structures typical of a program. This Stack Overflow answer hints at how badly regexes fail for parsing. They are simply not an answer for parsing. Parsing is hard … but just a distraction

Tools like YACC, JavaCC, ANTLR, and variants are all about parsing and a little bit about AST building. It is a big effort to choose the “right” (lexer and) parser generator, find/create a grammar, wrestle into a form acceptable by a parser generator (lookaheads? ambiguities? context sensitivities?), deciding on how to build a tree, … Then there’s the often surprisingly big struggle to make the parser match what the compilers really accept as opposed to what everbody knows to be true without any further thought (Hey, C source code doesn’t have any complications, its a simple language, right? Can’t be that hard…oops, what’s a trigraph?). Putting together a working parser is enough work so that people tend to focus on just getting that right, paying zero attention to what is needed after that. And that makes them lose sight of the full problem.

But success in parsing and tree building is such a small part of the problem that a victory really means very little. Climbing 11,000 feet to the foothills of Everest requires only basic technology: some camping gear and good hiking boots. Any determined clod can get this equipment and climb the foothills. To climb to the peak it is clear that one needs a completely different set of very sophisticated technology (oxygen tanks are pretty different than hiking boots) and a much different climbing process to finish the climb to Everest’s peak. Those that attempt the remaining climb with just the basic equipment simply die. Those that have just a parser never really get a useful tool.

This insight led us to build the DMS Software Reengineering Toolkit, to provide as much as possible of that advanced technology. Semantic Designs has spent over 17 years building and enhancing that technology, and proving it on real languages such as COBOL, Java and C++, so that would-be tool builders don’t have to reinvent the infrastructure, and can get on with building the tool they really wanted. (We don’t expect the enhancement process to ever really stop or even slow down much). Along the way, we found that even the parsing process could be significantly enhanced. Climbing Mt. Everest for Tool Builders

Let us consider these additional mechanisms further. First, what drives the need for this? Our programming languages, like our computers, used to be smaller and simpler: FORTRAN66, COBOL, RPG, BASIC, K&R C, ran on small machines, and were all written in ASCII text in a small number of files. Lots of complications have occured since then:

Big, complex languges: C99, Enterprise COBOL, Ada2005, Java7, C#4, C++11, Fortran2005, HTML5 each in a variety of dialects… even C isn’t small (check out GCC and Microsoft’s variants), and Perl is an amazing mess. Language committees can’t resist adding new goodies; there tends to be a “me too” race (Java and C# seem to co-evolve furiously). And companies providing compilers have a vested interest in adding features to create customer lock-ins.

Values in a wide variety of forms and precision: IEEE Floating point, decimal numbers, infinite precision rationals, …
Preprocessors, with include files, macros, and the ability to conditionally include/not include program fragments
Locale-specific character sets, Unicode in several flavors, EBCDIC
Embeddings of language fragments into other languages (JavaScript and PHP in HTML, SQL in COBOL, …
Generics, reflection, and metaprogramming are used more frequently to consruct software systems

Lets examine the necessary tool-supporting mechanism in a bit more detail. What are they, and why are they needed? (This discussion parallels the opening):

Lexing and Parsing (still necessary!)¶

Lexical Capture: It is classic to perform lexical analysis of the source code, e.g., to break up the stream of input characters into tokens that represent the conceptual elements of a program: identifiers, numbers, keywords, comments, strings, even whitespace. That idea still generally holds. It is also classic to build a hand-coded lexer or use Flex to accomplish this, and process the ASCII characters in files that made up all the languages we knew and loved, and in fact still dominate Western programming. But for highly evolved legacy languages, and modern languages, you need to process possibly many character representations (Katakana in comments in C, anyone?) including Unicode, convert numbers to floating point without losing precision, capture numbers with possibly arbitrarily large precision, handle keywords that might be identifiers (PL/1 is famous: every keyword can be an identifier!), etc.

DMS provides libraries and tools to handle all of these issues. The DMS lexer handles full Unicode, has support for opening nested streams (to handle include files), etc, and can easily build multi-mode lexers to handle the different lexing needs in the possibly different parts of the language. DMS’s lexers also capture comments, which are extremely useful in program analysis and fundamental for program transformation; users will reject perfectly transformed code that has lost their comments. Similarly, DMS will capture lexical details, such as radix of numbers and string quote style; programmers reject modified code in which these are lost or changed, too. We don’t know of other production lexer generators that have any or even most of these properties out of the box. (If your foundations aren’t right, how are you going to construct a large building?)

Parsing: It is surprising that the choice of parser is often made without regard to the language being parsed, often some parser generator whose major property is that is convenient to download (e.g, Bison, ANTLR, …). If your language requires a lot of lookahead, an LL(1) parser is going to be a very painful choice; if your language has ambiguities, a parser generator that can only produce one parse will be trouble (consider parsing C++). But when parsing real languages, it is hard to know what issues the parser faces until you are done building the parser, because you don’t know what the grammar will be until it can basically parse everything. And the manuals lie about what is legal. What this suggests is you should simply use a very strong parsing technology to avoid as much of this pain as you can.

DMS uses a GLR parser, which means it handles arbitrary lookahead and ambiguous grammars with aplomb. Arbitrary semantic reduction conditions allow DMS parsers to eliminate many ambiguous choices while parsing, or they can be retained an eliminated later using semantic information. You write a context-free grammar; DMS’s GLR parser can parse it. So in general parsing is hard; with DMS, parsing is a lot easier.

DMS parsers can be configured to collect preprocessor definitions and conditionals without expanding them. This makes it possible to reason about the code as seen by the programmer, with all the preprocessor directives in place. No other tool known to us can do this.

Representation Capture: Real tools require more than just “parsing”; the structure of the program must be captured to enable further processing. Usually an Abstract Syntax Tree (AST) is needed. Typical parser generators force the grammar engineer to specify not only the grammar, but also to explicitly specify how to procedurally build the AST as well. This essentially (more than) doubles the work, and it makes maintaining the grammar hard because any grammar changes require the AST building code as well. In contrast, DMS automatically builds a syntax tree, either a concrete tree (“CST”, mainly for debugging, containing all the language tokens), or what amounts to an AST (for production, in which non-value carrying terminals are eliminated, useless unary productions are removed, and lists are formed). Changes to the grammar automatically cause the C/AST to change correspondingly. And the tree is always correct with respect to the grammar. It includes precise source line information (file/line/column) as well as comments passed to it by the lexer, attached to appropriate tree nodes, and it can include preprocessor directives.

Symbol Tables: A mapping between identifiers used in the program, and their meaning (and scope). It is only the grammar which is context-free in most programming languages; the meaning of identifiers is usually context (“scope”) and semantics dependent. Generally to analyze, reason about, or transform programs one must know each identifier instance represents, where and how that identifier is defined. This is traditionally done with a symbol table, that maps code regions to sets of identifiers and their definitions. For real languages a symbol table is often relatively complex because different code regions implicitly designate different sets of identifiers (e.g., local scopes, lexical scopes, namespaces, parameter names, object slots, etc.). So the symbol table must model the relationship of the various “scopes” to enable symbols to be correctly found. Such a symbol table has to be constructed from the program representation; the grammar gives no clue about this.

DMS provides a symbol table structure with local mappings (“symbol spaces”) of identifiers to definitions to model scopes. These provide multiple inheritance support, and are have proven strong enough to handle C++ and the many other languages that DMS can presently process. DMS has a library of routines for symbol-definition insert and fast hashed symbol lookup with overload resolution that can automatically navigate the symbol “spaces” and find the proper definition. DMS provides means to help build symbol tables (see “attribute grammars”, below). No parser generator system known to us provides symbol table structure support or help in building symbol tables; the language engineer typically has to write all of this machinery herself.

Inference Methods: A way to understand the meaning and consequences of the written code, e.g., program analysis. Every program analysis or modification task starts with understanding the code, which requires collections of various kinds of facts about the code structures and how code elements interact. We call such determination of facts “inference”. There are inferences that are easy to extract from the program structure, e.g., the AST. One can programmatically traverse an AST with recursive function calls, and almost any parser generator will provide programmatic access to the nodes in an AST, but this is inconvenient for a large or complex grammar. DMS of course provides such programmatic access, but more importantly provides a variety of other inference mechanisms. (Parser generators never provide any capability for this; the language engineer has to build it all by herself):
Simple Fact Collection: One way to collect information is attribute grammars, which enable easy computation of arbitrary properties of syntax instances as annotations on the grammar rules themselves (a domain-specific language). This makes it very easy to collect complexity metrics. A more interesting example is collection of symbol table data. Since scopes for identifier names tend to follow the program structure, and this structure is defined by the grammar, there are almost always grammar rules which define lexical scopes. So attribute grammars are a nearly ideal way to collect the symbol table information. Coupled with the symbol-insertion capability of the symbol table library, attribute grammars make it relatively easy to build symbol tables from parse trees.

Control Flow Analysis: To reason about most (especially procedural) languages, one needs to know the order and under what conditions actions occur. This requires control flow analysis. DMS implements this using a language-engineer specified attribute grammar analyzer that understands the control flow semantics of the language syntax, and a library that constructs a control flow graph graph from information collected by that analyzer. The resulting control flow information encodes a “flowchart” of the code in a function or method, including blocks of actions, conditional branches and exception branches, operations that occur in unspecified (“parallel”/fork-join) order, and provides information such as dominators and control dependences. Additional support allows one to structure this control flow graph into regions, even if the original program is full of unstructured blocks (e.g., using uncontrolled “goto”). DMS also provides tools to build call graphs (even in the face of indirect pointers, as found in C, and implicitly in OO programs as “objects”, which are really just pointers), using data flow anlaysis and points-to information, see below).

Data Flow Analysis: How information flows through the program is fundamental to understanding. What is needed is knowledge about how data flows from one point in the code to another. DMS provides a variety of support for computing data flow information, using built-in iterative solvers operating over the (possibly parallel) control flow graph, operating on arbitrary domains (including bit vectors), using identifier access and update information as determined by another language-engineer specified attribute grammar. Using these solvers, DMS can determine reaching and use-def chains, and consequently points-to analysis. DMS has been used to compute global points-to analysis for very large code bases. Using the explicitly captured data flow information, it is relatively easy to implement foward and backward program slice tracing.

Symbolic Reasoning: More sophisticated analyzers are useful to reason about code, including method to compute complex consequences of semantics and various information flows. At present, DMS provides abstract interpretation to compute symbolic range information for values of variables in terms of other program variables. We have just recently added Binary Decision Diagram support (and an extension to handle “finite domains”) to allow fast reasoning about complex boolean conditions over discrete choices (e.g., enums), especially of interest in reasoning about preprocessor conditionals.

Pattern Matching and Transformation Capability: It is useful to be able to match arbitrary program fragments, or to generate arbitrary program fragments. DMS provides a variety of support for this:
Surface-syntax pattern matching: DMS does this with surface-sytax patterns, in which one writes code fragments in the target notation with placeholders. DMS can match such patterns against ASTs it has parsed, binding placeholders to corresponding subtrees. All one has to do to get this capability is define a parser to DMS. The patterns can also rely entirely or partly on custom-code tree-traversals.

Associative/Commutative matching: Patterns can match on associative and commutative operators that are declared in the grammar. (No other available program transformation system known to us does this.)

Pattern-directed code generation: The same patterns be used to generate ASTs that instantiate the patterns with pre-supplied subtrees, allowing one to build arbitrarily complicated trees fragments easily, and compose them into even larger trees.

Source to source transformations: One can also write optimizations or transforms on code, using rewrite rules consisting of pattern pairs, one pattern for matching and one for replacement when a match is found.

Cross language translation: Since rewrite rules can use surface syntax of one language in the match-pattern, and surface syntax of another in the replace-pattern, one can straightforwardly write language-to-language transformations.

Data flow pattern matching: Nearing release, we are adding the ability to match patterns over dataflows, enabling a semantic matching capability. This is extremely useful in recognizing abstract code idioms that are scattered about in real implementation code. It can recognize variants produced by alternative design choices whose presence is only implicit in the code, as well as identifying the specific design choice configuration that enables a match.

Pretty-Printing: If code transformations are applied to improve the code, one needs to be able to regenerate valid source code from the modified representations. DMS provides this ability in the form of prettyprinting rules, which specify how the AST should be regenerated in terms of its original layout (“fidelity printing”) or in terms of nested, indented text boxes (“pretty printing”). These capabilities can be used to show smaller subtrees, also. The prettyprinter regenerates accurate numerical (especially floating point or infinite precision) values, lexical formatting information, (string quotes, radix, character escape codes, etc.), as well as comments, preprocessor macro declarations, includes, ifdefs and macro calls.

Multi-module and multi-lingual Integration: Real software systems are not coded entirely in one language (Java? What, no SQL or HTML?). As a practical matter to process large systems of code, a good tool must be able to parse all the sublanguages and tie information together across the elements. DMS can host multiple language front ends/symbol tables/flow analyzers/prettyprinters simultaneously.

Bottom line: A parser simply isn’t enough to do serious work. DMS is designed to be enough. The payoff is the ability to build real, useful, and robust tools with modest resources and time, as opposed to the common tool failure of “succeeding in just building a parser”. Bonus: Pretested Language FrontEnds

DMS has many predefined language front ends which have been developed and honed over the last decade. These front ends include grammars, prettyprinters, and to varying degrees, symbol tables, flow analysis, etc. For real languages such as COBOL, Java, C# and C++, you are much better off getting one (or more) of SD’s tested language front ends than trying to define the grammar let alone the full front end yourself, or attempting to bend another that hasn’t been used for serious tasks. Double Bonus: Faster answers through Parallel foundations

Programs are getting bigger (millions of lines of code), and people (programmers and their managers) want answers fast. One way to get faster answers is to use parallelism. Cycles are an engineer’s best friend.

All of the DMS machinery is implemented in a parallel programming language, PARLANSE which is designed to handle multicore irregular (“task”) parallel computation. The DMS implementation provides for thread-safe access and update to all the structures to enable multiple analyses and transformations to occur. This allows DMS to scalably process very large systems of code, using the multiple CPUs available in virtually every desktop workstation that programmers have. We have not seen any other parser generator, program analysis, or program transformation tool that provides this capability.

It should be obvious that defining a full set of tools for processing languages is difficult. Defining such a set with parallel foundations is something a language engineer would not consider. But it is available built into DMS.

Parsing¶

What’s the difference between a ‘tokenizer’, ‘lexer’ and ‘parser’?¶

What is YACC and how do we use YACC tools?¶

GNU Bison¶

Reentrancy¶

A complete reentrant parser example¶

The use of YACC follows this general sequence:¶

Life After Parsing¶

Lexing and Parsing (still necessary!)¶

Table of Contents

Previous topic

Next topic

This Page