Separate lexing/tokenization phases¶
Most of the documentation in parsy assumes that when you call
Parser.parse()
you will pass a string, and will get back your final
parsed, constructed object (of whatever type you desire).
A more classical approach to parsing is that you first have a lexing/tokenization phase, the result of which is a simple list of tokens. These tokens could be strings, or other objects.
You then have a separate parsing phase that consumes this list of tokens, and produces your final object, which is very often a tree-like structure or other complex object.
Parsy can actually work with either approach. Further, for the split lexing/parsing approach, parsy can be used either to implement the lexer, or the parser, or both! The following examples use parsy to do both lexing and parsing.
Turtle Logo¶
For our first example, we’ll do a very stripped down Turtle Logo parser. First, the lexer:
"""
Stripped down logo lexer, for tokenizing Turtle Logo programs like:
fd 1
bk 2
rt 90
etc.
"""
from parsy import eof, regex, seq, string, string_from, whitespace
command = string_from("fd", "bk", "rt", "lt")
number = regex(r'[0-9]+').map(int)
optional_whitespace = regex(r'\s*')
eol = string("\n")
line = seq(optional_whitespace >> command,
whitespace >> number,
(eof | eol | (whitespace >> eol)).result("\n"))
flatten_list = lambda ls: sum(ls, [])
lexer = line.many().map(flatten_list)
We are not interested in whitespace, so our lexer removes it all, apart from newlines. We can now parse a program into the tokens we are interested in:
>>> l = lexer.parse("fd 1\nbk 2")
>>> l
['fd', 1, '\n', 'bk', 2, '\n']
The line
parser produces a list, so after applying many
which also
produces a list, we applied a level of flattening so that we end up with a
simple list of tokens. We also chose to convert the parameters to integers while
we were at it, so in this case our list of tokens is not a list of strings, but
heterogeneous.
The next step is the parser. We create some classes to represent different commands, and then use parsy again to create a parser which is very simple because this is a very limited language:
from parsy import generate, match_item, test_item
class Command:
def __init__(self, parameter):
self.parameter = parameter
def __repr__(self):
return "{0}({1})".format(self.__class__.__name__, self.parameter)
class Forward(Command):
pass
class Backward(Command):
pass
class Right(Command):
pass
class Left(Command):
pass
commands = {
'fd': Forward,
'bk': Backward,
'rt': Right,
'lt': Left,
}
@generate
def statement():
cmd_name = yield test_item(lambda i: i in commands.keys(), "command")
parameter = yield test_item(lambda i: isinstance(i, int), "number")
yield match_item('\n')
return commands[cmd_name](int(parameter))
program = statement.many()
To use it, we pass the the list of tokens generated above into
program.parse
:
>>> program.parse(l)
[Forward(1), Backward(2)]
In a real implementation, we could then have execute
methods on the
Command
sub-classes if we wanted to implement an interpreter, for example.
Calculator¶
Our second example illustrates lexing and then parsing a sequence of mathematical operations, e.g “1 + 2 * (3 - 4.5)”, with precedence.
In this case, while doing the parsing stage, instead of building up an AST of objects representing the operations, the parser actually evaluates the expression.
from parsy import digit, generate, match_item, regex, string, success, test_item
def lexer(code):
whitespace = regex(r'\s*')
integer = digit.at_least(1).concat().map(int)
float_ = (
digit.many() + string('.').result(['.']) + digit.many()
).concat().map(float)
parser = whitespace >> ((
float_ | integer | regex(r'[()*/+-]')
) << whitespace).many()
return parser.parse(code)
def eval_tokens(tokens):
# This function parses and evaluates at the same time.
lparen = match_item('(')
rparen = match_item(')')
@generate
def additive():
res = yield multiplicative
sign = match_item('+') | match_item('-')
while True:
operation = yield sign | success('')
if not operation:
break
operand = yield multiplicative
if operation == '+':
res += operand
elif operation == '-':
res -= operand
return res
@generate
def multiplicative():
res = yield simple
op = match_item('*') | match_item('/')
while True:
operation = yield op | success('')
if not operation:
break
operand = yield simple
if operation == '*':
res *= operand
elif operation == '/':
res /= operand
return res
@generate
def number():
sign = yield match_item('+') | match_item('-') | success('+')
value = yield test_item(
lambda x: isinstance(x, (int, float)), 'number')
return value if sign == '+' else -value
expr = additive
simple = (lparen >> expr << rparen) | number
return expr.parse(tokens)
def simple_eval(expr):
return eval_tokens(lexer(expr))
if __name__ == '__main__':
print(simple_eval(input()))