How to build Lexical Analyzer (or "Lexer") for a
specific subset of the C programming language
Problem Statement:
You are required to design and implement a simple lexical analyzer for a basic
programming language that
consists of the following elements:
1. Keywords: if, else, while, return, int, float
2. Identifiers: Sequences of letters and digits that begin with a letter.
3. Operators: +, -, *, /, =, ==, !=, >, <, >=, <=
4. Delimiters: (, ), {, }, ;
5. Constants: Integer and floating-point numbers.
6. Comments: Single-line comments that begin with //.
Your program may be written in C, C++, Java or any other programming language.
Solution:
Code:
Explanation:
The main goal of this assignment was to build a Lexical Analyzer (or "Lexer") for a
specific subset of the C programming language. Essentially, the lexer acts as the first
step for a compiler. Its job is to read through a source code file, character by character,
and group those characters into meaningful chunks called "tokens." For example, it
needs to distinguish between a reserved keyword like while, a variable name like
user_count, or a mathematical operator like +. I also had to make sure the program
could handle numbers correctly, including more complex ones like decimals or those
using scientific notation (like 1.2E+5).
Implementation and Logic
Initially, the logic was designed using a manual approach where I had to keep track of
every single "state" the program was in. If the program saw a letter, it moved to an
"identifier" state; if it saw a digit, it moved to a "number" state. This was done using a
system called a Deterministic Finite Automaton (DFA). Later, I moved this logic over to
Flex, which is a professional tool specifically made for generating lexers. Instead of
writing long "if-else" chains or switch statements, Flex allowed me to use simple
patterns (regular expressions) to define what each token looks like. This made the code
much shorter and much more reliable.
How it Works
When the program runs, it scans the input file and matches the text against the rules I
defined. Every time it finds a match, it prints out the "Class" (the type of token) and the
"Lexeme" ( the actual text found). If it runs into a character it doesn't recognize,like a
random symbol that shouldn't be there,it labels it as an "Error" and keeps going. I also
made sure that the program ignores things that don't matter to a compiler, like extra
spaces or comments, so they don't get in the way of the actual logic.
Symbol Table and Results:
As a final step, every time the lexer finds a new variable name (an identifier), it saves it
into a "Symbol Table." This table keeps track of the name of the variable and the line
number where it was first seen. This is useful because, in a real compiler, we would
need this list later to make sure variables are being used correctly. The final output of
the project is a clean list of all tokens found in the input file, followed by a summary of all
the identifiers that were stored during the process.
Combined DFA:
The diagram represents a Unified Deterministic Finite Automaton (DFA) designed to
tokenize source code by transitioning between states based on input characters. It
effectively categorizes the input into five primary token classes: Keywords/Identifiers,
Constants, Operators, Delimiters, and Comments .
1. Identifiers & Keywords: Starting with a letter, the DFA moves to ID_START. It
continues to loop on letters or digits until an other character is encountered,
reaching the IDENTIFIER state. A lookup table is then used to differentiate
between reserved Keywords (e.g., if, while) and user-defined names .
2. Constants (Integer/Float): The DFA handles numeric values through states INT,
DOT, and FRAC. It supports scientific notation via the EXP (Exponent) branch,
ensuring that both simple integers and complex floating-point numbers are
recognized as CONSTANTS.
3. Operators & Relational Ops: Single-character operators (like +, -, *) transition
directly to the OPERATOR state. Multi-character relational operators (like <=, !=)
are handled via the R_START path, which checks for an optional = to finalize a
RELOP token.
4. Delimiters: Fixed characters such as (, ), {, }, and ; are recognized
immediately from the Start state and categorized as DELIMITERS.
5. Comments: When a / is followed by another /, the DFA enters a
COMMENT_START loop. It consumes all characters until a newline (\n) is
reached, at which point the token is discarded as per the assignment logic.
6. Error Handling: Any character that does not fit a defined transition (labeled
invalid) leads to the ERR state, triggering an error message with the
character's position.








0 Comments