How to build Lexical Analyzer (or "Lexer") for a
specific subset of the C programming language

Problem Statement:

You are required to design and implement a simple lexical analyzer for a basic

programming language that

consists of the following elements:

1. Keywords: if, else, while, return, int, float

2. Identifiers: Sequences of letters and digits that begin with a letter.

3. Operators: +, -, *, /, =, ==, !=, >, <, >=, <=

4. Delimiters: (, ), {, }, ;

5. Constants: Integer and floating-point numbers.

6. Comments: Single-line comments that begin with //.

Your program may be written in C, C++, Java or any other programming language.

Solution:

Code:

Explanation:

The main goal of this assignment was to build a Lexical Analyzer (or "Lexer") for a

specific subset of the C programming language. Essentially, the lexer acts as the first

step for a compiler. Its job is to read through a source code file, character by character,

and group those characters into meaningful chunks called "tokens." For example, it

needs to distinguish between a reserved keyword like while, a variable name like

user_count, or a mathematical operator like +. I also had to make sure the program

could handle numbers correctly, including more complex ones like decimals or those

using scientific notation (like 1.2E+5).

Implementation and Logic

Initially, the logic was designed using a manual approach where I had to keep track of

every single "state" the program was in. If the program saw a letter, it moved to an

"identifier" state; if it saw a digit, it moved to a "number" state. This was done using a

system called a Deterministic Finite Automaton (DFA). Later, I moved this logic over to

Flex, which is a professional tool specifically made for generating lexers. Instead of

writing long "if-else" chains or switch statements, Flex allowed me to use simple

patterns (regular expressions) to define what each token looks like. This made the code

much shorter and much more reliable.

How it Works

When the program runs, it scans the input file and matches the text against the rules I

defined. Every time it finds a match, it prints out the "Class" (the type of token) and the

"Lexeme" ( the actual text found). If it runs into a character it doesn't recognize,like a

random symbol that shouldn't be there,it labels it as an "Error" and keeps going. I also

made sure that the program ignores things that don't matter to a compiler, like extra

spaces or comments, so they don't get in the way of the actual logic.

Symbol Table and Results:

As a final step, every time the lexer finds a new variable name (an identifier), it saves it

into a "Symbol Table." This table keeps track of the name of the variable and the line

number where it was first seen. This is useful because, in a real compiler, we would

need this list later to make sure variables are being used correctly. The final output of

the project is a clean list of all tokens found in the input file, followed by a summary of all

the identifiers that were stored during the process.

Combined DFA:

The diagram represents a Unified Deterministic Finite Automaton (DFA) designed to

tokenize source code by transitioning between states based on input characters. It

effectively categorizes the input into five primary token classes: Keywords/Identifiers,

Constants, Operators, Delimiters, and Comments .

1. Identifiers & Keywords: Starting with a letter, the DFA moves to ID_START. It

continues to loop on letters or digits until an other character is encountered,

reaching the IDENTIFIER state. A lookup table is then used to differentiate

between reserved Keywords (e.g., if, while) and user-defined names .

2. Constants (Integer/Float): The DFA handles numeric values through states INT,

DOT, and FRAC. It supports scientific notation via the EXP (Exponent) branch,

ensuring that both simple integers and complex floating-point numbers are

recognized as CONSTANTS.

3. Operators & Relational Ops: Single-character operators (like +, -, *) transition

directly to the OPERATOR state. Multi-character relational operators (like <=, !=)

are handled via the R_START path, which checks for an optional = to finalize a

RELOP token.

4. Delimiters: Fixed characters such as (, ), {, }, and ; are recognized

immediately from the Start state and categorized as DELIMITERS.

5. Comments: When a / is followed by another /, the DFA enters a

COMMENT_START loop. It consumes all characters until a newline (\n) is

reached, at which point the token is discarded as per the assignment logic.

6. Error Handling: Any character that does not fit a defined transition (labeled

invalid) leads to the ERR state, triggering an error message with the

character's position.

How to build Lexical Analyzer (or "Lexer") for a specific subset of the C programming language

How to build Lexical Analyzer (or "Lexer") for a
specific subset of the C programming language

Solution:

Posted by Admin

Post a Comment

0 Comments

Popular Posts

Gaming giants in talks with Immutable to launch token: Web3 Gamer

, The Devil Takes Bitcoin, a wild history of Mt. Gox and Silk Road

Worldcoin’s less ‘dystopian,’ more cypherpunk rival: Billions Network

Categories

Recent Posts

Featured Post

Who was Mt. Gox ? How he lost worth Billion dollars Bitcoins?

SEO

Menu Footer Widget

Contact form

Advertisement

How to build Lexical Analyzer (or "Lexer") for a specific subset of the C programming language

How to build Lexical Analyzer (or "Lexer") for aspecific subset of the C programming language

Solution:

Posted by Admin

You may like these posts

Post a Comment

0 Comments

Social Plugin

Popular Posts

Gaming giants in talks with Immutable to launch token: Web3 Gamer

, The Devil Takes Bitcoin, a wild history of Mt. Gox and Silk Road

Worldcoin’s less ‘dystopian,’ more cypherpunk rival: Billions Network

Categories

Recent Posts

Featured Post

Who was Mt. Gox ? How he lost worth Billion dollars Bitcoins?

SEO

Menu Footer Widget

Contact form

How to build Lexical Analyzer (or "Lexer") for a
specific subset of the C programming language