This content originally appeared on DEV Community and was authored by Sarun Tapee
To implement a lexical parser, we will use a tool called Fast Lexical Analyzer Generator (flex or lex) to perform pattern-matching on text.
You can install the tool using the following command
sudo apt update
sudo apt upgrade
sudo apt install flex
First, we need to create a file to feed to the lex program, and then it will generate a C language file. You can name the file any name you like. I will go with parser.l
. The syntax of the file is as follows.
%{
// section 1
%}
%%
// section 2
%%
// section 3 (optional)
Section 1 begins with %{
and ends with %}
. In this section, you can write any C code.
Section 2 begins with %%
and ends with %%
. In this section, you will write a regular expression to match the string input.
Section 3 is optional. You can write a main
function here if you want to create a program.
Example
Let’s make our parser file to parse the input string and identify if the input is a string, integer, or double format.
%{
extern "C" int yylex();
#include <stdio.h>
#define MAX_STR_SIZE 512
char lex_buffer[MAX_STR_SIZE];
%}
%%
\n {
return 0;
}
0|-?[1-9][0-9]* {
return 1;
}
-?[0-9]*\.[0-9]+ {
return 2;
}
[a-zA-Z0-9_]+ {
return 3;
}
[ ] {
// ignore space
}
. {
// catch any character
printf("ignore special character: %s\n", yytext);
}
%%
int main(void) {
int code;
while (1) {
printf("input -> ");
fgets(lex_buffer, sizeof(lex_buffer), stdin);
yy_scan_string(lex_buffer);
code = yylex();
while (code != 0) {
printf("code: %d, text: %s, len: %d\n", code, yytext, yyleng);
code = yylex();
}
}
}
The API of the Lex tool is shown in the following table.
API | desc |
---|---|
Function | |
void yy_scan_string(char *buffer) | feed input string to lex |
int yylex() | match a pattern that is defined as a regular expression in section 2 and return an integer value |
Global Variable | |
yytext | points to the string that is just being processed by yylex() |
yyleng | hold length of yytext |
Code Explanation
In section 1, we have to declare the yylex()
function due to an issue with the lex tool. Then define lex_buffer
to hold a string input.
In section 2, we define a regular expression to tokenize the string input. The return type should be an integer or void if you want to ignore the matching string. In our code, we define 6 rules for the string matching since the body of code will be run when we call yylex
function, we can put any logic in here (e.g., output the message in the last rule).
The rule that we define is very simple. Starting from a new line \n
, which will identify the end of input if there are more characters following a new line, we just ignore it. The second rule is to match an integer. The third rule is to match doubles. The fourth rule is to match an identifier (e.g., “this_is_variable_1”, “anotherVariable2”). The fifth rule is to ignore space so that we leave the body empty. And the last rule is to catch all the sequences that do not match the rules above it.
Lastly, in section 3, we implement the main
function to loop forever to take an input from stdin
and store it in the global variable lex_buffer
. Then we call yy_scan_string
to set the input string to lex parser. Finally, we keep calling yylex
to process the lex_buffer
until we find the new line character.
Compliation
Generate C source code by using the lex tool as follows.
lex parser.l
You will get a lex.yy.c
file, then compile the source file to an executable.
g++ -o exe lex.yy.c -lfl
-lfl
flag is used to link the lex library. If there is a compilation error, go fix your parser.l
not lex.yy.c
file
Then you can test our program.
./exe
input -> hello42 12 4.2&@ >_<
code: 3, text: hello42, len: 7
code: 1, text: 12, len: 2
code: 2, text: 4.2, len: 3
ignore special character: &
ignore special character: @
ignore special character: >
code: 3, text: _, len: 1
ignore special character: <
input ->
Matching Algorithm
It is good to know how the lex tool matches the input string. It will try to match as long as possible for each rule. If the string matches with more than one rule, the rule listed first will win.
This content originally appeared on DEV Community and was authored by Sarun Tapee