This content originally appeared on DEV Community and was authored by Developer Service
Title: Write a Programming Language in a Weekend (Seriously) With Python
Subtitle: Build a toy language from scratch and understand lexing, parsing, and interpreting — all in plain Python.
Introduction
Ever dreamt of creating your own programming language, but figured that was something only compiler geeks or professors could pull off?
Think again.
In this article, you’ll learn how to write your own toy programming language in a single weekend, using nothing but Python and a bit of brainpower. No compilers, no scary grammar tools, just regular Python code, a few re
patterns, and a dose of curiosity.
You won’t be building the next JavaScript or Rust (yet), but you will build a working interpreter that can understand code like this:
let x = 10;
print(x * 2 + 1);
And the best part? You’ll understand how it works, from converting text into tokens, building an Abstract Syntax Tree (AST), and walking that tree to evaluate results. It’s like writing a mini-brain for your language, and it’s deeply satisfying.
Let’s get started. Your language awaits.
The full source code is available at: https://github.com/nunombispo/ProgrammingLanguage-Article
Step 1: Design Your Language
Before we write a single line of Python code for our new language interpreter, we need to answer a simple question:
What kind of language are we building?
We’re not aiming to replace Python or create a full-fledged compiler. Our goal is to create a simple, interpreted, expression-based language that supports:
- Variable declarations using
let
- Basic arithmetic (
+
,-
,*
,/
) - Built-in
print()
function - A script-style execution (no functions or conditionals, at least not yet)
Let’s review the steps necessary to create a language:
In this step 1, we will take a look at the source code.
Syntax Design
Here’s the minimal syntax we’ll support:
let x = 5;
let y = x + 10;
print(y);
In English, this means:
- Declare a variable
x
and set it to 5 - Declare another variable
y
, set it tox + 10
- Print the value of
y
Each statement ends with a semicolon ;
, similar to JavaScript or C.
Grammar Overview
To build a parser later, we’ll need a rough idea of the grammar. Here’s a simplified version:
program ::= statement*
statement ::= "let" IDENTIFIER "=" expression ";"
| "print" "(" expression ")" ";"
expression ::= term (("+" | "-") term)*
term ::= factor (("*" | "/") factor)*
factor ::= NUMBER | IDENTIFIER | "(" expression ")"
This grammar:
- Is written in EBNF-style notation (Extended Backus-Naur Form)
- Defines how statements and expressions are structured
- Handles operator precedence (i.e.,
*
and/
are evaluated before+
and-
) - Supports grouping with parentheses
Don’t worry if this looks unfamiliar. We’ll break this down step-by-step as we build the tokenizer, parser, and interpreter.
Just keep in mind that this grammar defines the structure of a programming language using basic constructs like variable assignment and printing.
Step 2: Tokenizer (Lexer)
Now that we’ve defined our language’s syntax, it’s time to build the first real component: a tokenizer, also known as a lexer.
Let’s review the steps necessary to create a language:
In this step 2, we will take a look at the tokenizer.
What Is a Tokenizer?
A tokenizer breaks your source code (plain text) into a sequence of meaningful tokens, small labelled pieces like keywords, identifiers, numbers, and symbols.
For example, given this line of code:
let x = 5 + 2;
The tokenizer should return something like:
[
('LET', 'let'),
('IDENT', 'x'),
('EQUALS', '='),
('NUMBER', '5'),
('PLUS', '+'),
('NUMBER', '2'),
('SEMICOLON', ';')
]
These tokens make it easier for the parser (in step 3) to understand what’s going on.
Building the Tokenizer in Python
We’ll use Python’s built-in re
(regular expressions) module to match patterns for each token type.
Let’s define the token types and write a simple lexer:
import re
# Define token types and regex patterns
TOKEN_TYPES = [
('LET', r'let'),
('PRINT', r'print'),
('NUMBER', r'\d+'),
('IDENT', r'[a-zA-Z_][a-zA-Z0-9_]*'),
('EQUALS', r'='),
('PLUS', r'\+'),
('MINUS', r'-'),
('TIMES', r'\*'),
('DIVIDE', r'/'),
('LPAREN', r'\('),
('RPAREN', r'\)'),
('SEMICOLON',r';'),
('SKIP', r'[ \t]+'), # ignore spaces and tabs
('NEWLINE', r'\n'),
]
Now let’s write the function to match and extract these tokens:
def tokenize(code):
tokens = []
index = 0
while index < len(code):
match = None
for token_type, pattern in TOKEN_TYPES:
regex = re.compile(pattern)
match = regex.match(code, index)
if match:
text = match.group(0)
if token_type != 'SKIP' and token_type != 'NEWLINE':
tokens.append((token_type, text))
index = match.end(0)
break
if not match:
raise SyntaxError(f'Unexpected character: {code[index]}')
return tokens
Example
Let’s test it:
code = "let x = 5 + 2;"
print(tokenize(code))
Output:
[('LET', 'let'), ('IDENT', 'x'), ('EQUALS', '='), ('NUMBER', '5'), ('PLUS', '+'), ('NUMBER', '2'), ('SEMICOLON', ';')]
You’ve got a working tokenizer!
Step 3: Building a Parser (AST Generator)
Now that we can tokenize our code, it’s time to make sense of those tokens. This is where the parser comes in.
Let’s review the steps necessary to create a language:
In this step 3, we will take a look at the parser and AST.
What Is a Parser?
A parser reads the list of tokens and builds an Abstract Syntax Tree (AST), which is a structured, hierarchical representation of the code.
Take this input:
let x = 5 + 2;
The tokenizer gives us:
[('LET', 'let'), ('IDENT', 'x'), ('EQUALS', '='), ('NUMBER', '5'), ('PLUS', '+'), ('NUMBER', '2'), ('SEMICOLON', ';')]
The parser turns this into an AST like:
[
LetStatement(
name="x",
value=BinaryOp(
left=Number(value=5),
op="+",
right=Number(value=2)
)
),
PrintStatement(
expr=Identifier(name="x")
)
]
Let’s build that.
Define AST Nodes
We’ll define a few Python classes to represent different AST node types:
class Number:
def __init__(self, value):
self.value = int(value)
def __repr__(self):
return f"Number(value={self.value})"
class Identifier:
def __init__(self, name):
self.name = name
def __repr__(self):
return f"Identifier(name={self.name})"
class BinaryOp:
def __init__(self, left, op, right):
self.left = left
self.op = op
self.right = right
def __repr__(self):
return f"BinaryOp(left={self.left}, op={self.op}, right={self.right})"
class LetStatement:
def __init__(self, name, value):
self.name = name
self.value = value
def __repr__(self):
return f"LetStatement(name={self.name}, value={self.value})"
class PrintStatement:
def __init__(self, expr):
self.expr = expr
def __repr__(self):
return f"PrintStatement(expr={self.expr})"
Create the Parser Class
We’ll make a simple recursive descent parser that consumes tokens one by one and builds AST nodes.
class Parser:
def __init__(self, tokens):
self.tokens = tokens
self.pos = 0
def current(self):
return self.tokens[self.pos] if self.pos < len(self.tokens) else ('EOF', '')
def eat(self, token_type):
if self.current()[0] == token_type:
self.pos += 1
else:
raise SyntaxError(f'Expected {token_type}, got {self.current()}')
def parse(self):
statements = []
while self.current()[0] != 'EOF':
if self.current()[0] == 'LET':
statements.append(self.parse_let())
elif self.current()[0] == 'PRINT':
statements.append(self.parse_print())
else:
raise SyntaxError(f'Unexpected token: {self.current()}')
return statements
Parse let
and print
Statements
def parse_let(self):
self.eat('LET')
name = self.current()[1]
self.eat('IDENT')
self.eat('EQUALS')
expr = self.parse_expression()
self.eat('SEMICOLON')
return LetStatement(name, expr)
def parse_print(self):
self.eat('PRINT')
self.eat('LPAREN')
expr = self.parse_expression()
self.eat('RPAREN')
self.eat('SEMICOLON')
return PrintStatement(expr)
Parse Expressions (with Operator Precedence)
def parse_expression(self):
node = self.parse_term()
while self.current()[0] in ('PLUS', 'MINUS'):
op = self.current()[1]
self.eat(self.current()[0])
right = self.parse_term()
node = BinaryOp(node, op, right)
return node
def parse_term(self):
node = self.parse_factor()
while self.current()[0] in ('TIMES', 'DIVIDE'):
op = self.current()[1]
self.eat(self.current()[0])
right = self.parse_factor()
node = BinaryOp(node, op, right)
return node
def parse_factor(self):
token_type, token_value = self.current()
if token_type == 'NUMBER':
self.eat('NUMBER')
return Number(token_value)
elif token_type == 'IDENT':
self.eat('IDENT')
return Identifier(token_value)
elif token_type == 'LPAREN':
self.eat('LPAREN')
expr = self.parse_expression()
self.eat('RPAREN')
return expr
else:
raise SyntaxError(f'Unexpected factor: {self.current()}')
Test It
code = """
let x = 5 + 2;
print(x);
"""
from pprint import pprint
from tokenizer import tokenize
tokens = tokenize(code)
parser = Parser(tokens)
ast = parser.parse()
pprint(ast)
You should see a structured tree of LetStatement
and PrintStatement
nodes, like this:
[LetStatement(name=x, value=BinaryOp(left=Number(value=5), op=+, right=Number(value=2))), PrintStatement(expr=Identifier(name=x))]
Let’s beautify it for readability:
[
LetStatement(
name="x",
value=BinaryOp(
left=Number(value=5),
op="+",
right=Number(value=2)
)
),
PrintStatement(
expr=Identifier(name="x")
)
]
This is exactly what your interpreter will need next.
Step 4: Evaluating the AST (Running Your Language)
You’ve built a tokenizer and a parser that gives you an abstract syntax tree (AST). Now it’s time to execute that tree, just like a real programming language does.
Let’s review the steps necessary to create a language:
In this step 4, we will take a look at the interpreter, and the output.
Interpreter Basics
An interpreter is a component that:
- Walks the AST.
- Evaluates each node.
- Keeps track of variables (in memory).
- Produces side effects (like printing output).
The Environment
We need a place to store variable values:
class Environment:
def __init__(self):
self.vars = {}
def set_var(self, name, value):
self.vars[name] = value
def get_var(self, name):
if name in self.vars:
return self.vars[name]
raise NameError(f"Variable '{name}' not defined")
The Interpreter
We’ll walk through each statement and expression recursively.
class Interpreter:
def __init__(self):
self.env = Environment()
def eval(self, node):
if isinstance(node, Number):
return node.value
elif isinstance(node, Identifier):
return self.env.get_var(node.name)
elif isinstance(node, BinaryOp):
left = self.eval(node.left)
right = self.eval(node.right)
if node.op == '+':
return left + right
elif node.op == '-':
return left - right
elif node.op == '*':
return left * right
elif node.op == '/':
return left // right # integer division
else:
raise RuntimeError(f"Unknown operator: {node.op}")
elif isinstance(node, LetStatement):
value = self.eval(node.value)
self.env.set_var(node.name, value)
elif isinstance(node, PrintStatement):
value = self.eval(node.expr)
print(value)
else:
raise RuntimeError(f"Unknown node: {node}")
Running It All Together
code = """
let a = 10;
let b = a + 20 * 2;
print(b);
"""
tokens = tokenize(code)
pprint(tokens)
parser = Parser(tokens)
ast = parser.parse()
pprint(ast)
interpreter = Interpreter()
for stmt in ast:
interpreter.eval(stmt)
Output
First, it will output the tokens:
[('LET', 'let'),
('IDENT', 'a'),
('EQUALS', '='),
('NUMBER', '10'),
('SEMICOLON', ';'),
('LET', 'let'),
('IDENT', 'b'),
('EQUALS', '='),
('IDENT', 'a'),
('PLUS', '+'),
('NUMBER', '20'),
('TIMES', '*'),
('NUMBER', '2'),
('SEMICOLON', ';'),
('PRINT', 'print'),
('LPAREN', '('),
('IDENT', 'b'),
('RPAREN', ')'),
('SEMICOLON', ';')]
Then the AST:
[LetStatement(name=a, value=Number(value=10)),
LetStatement(name=b, value=BinaryOp(left=Identifier(name=a), op=+, right=BinaryOp(left=Number(value=20), op=*, right=Number(value=2)))),
PrintStatement(expr=Identifier(name=b))]
And finally the output:
50
And there it is.
Your language interpreted and executed code written in a custom syntax.
In a weekend. With Python.
What’s Next?
Here are a few ideas to expand your language:
- Add
if
statements and comparison operators (==
,<
,>
) - Add functions with arguments and return values
- Create a REPL (Read-Eval-Print Loop) for interactive coding
- Build a small standard library (e.g.,
input()
,len()
, etc.) - Export your language as a CLI tool or package
Conclusion
Building your own programming language might sound intimidating, but now you’ve done it.
You’ve walked through every piece of the puzzle using pure Python.
This is just the beginning. Language design is a deep, fascinating field.
But you’ve proven you can go from zero to interpreter in a weekend.
Now go forth and build something weird, fun, and 100% yours.
Follow me on Twitter: https://twitter.com/DevAsService
Follow me on Instagram: https://www.instagram.com/devasservice/
Follow me on TikTok: https://www.tiktok.com/@devasservice
Follow me on YouTube: https://www.youtube.com/@DevAsService
This content originally appeared on DEV Community and was authored by Developer Service