Tutorial
First install parsy, and check that the documentation you are reading matches the version you just installed.
Building an ISO 8601 parser
In this tutorial, we are going to gradually build a parser for a subset of an
ISO 8601 date. Specifically, we want to handle dates that look like this:
2017-09-25
.
A problem of this size could admittedly be solved fairly easily with regexes. But very quickly regexes don’t scale, especially when it comes to getting the parsed data out, and for this tutorial we need to start with a simple example.
With parsy, you start by breaking the problem down into the smallest components. So we need first to match the 4 digit year at the beginning.
There are various ways we can do this, but a regex works nicely, and
regex()
is a built-in primitive of the parsy library:
>>> from parsy import regex
>>> year = regex(r"[0-9]{4}")
(For those who don’t know regular expressions, the regex [0-9]{4}
means
“match any character from 0123456789 exactly 4 times”.)
This has produced a Parser
object which has various methods. We can
immediately check that it works using the Parser.parse()
method:
>>> year.parse("2017")
'2017'
>>> year.parse("abc")
ParseError: expected '[0-9]{4}' at 0:0
Notice first of all that a parser consumes input (the value we pass to
parse
), and it produces an output. In the case of regex
, the produced
output is the string that was matched, but this doesn’t have to be the case for
all parsers.
If there is no match, it raises a ParseError
.
Notice as well that the Parser.parse()
method expects to consume all the
input, so if there are extra characters at the end, even if it is just
whitespace, parsing will fail with a message saying it expected EOF (End Of
File/Data):
>>> year.parse("2017 ")
ParseError: expected 'EOF' at 0:4
You can use Parser.parse_partial()
if you want to just keep parsing as far
as possible and not throw an exception.
To parse the data, we need to parse months, days, and the dash symbol, so we’ll add those:
>>> from parsy import string
>>> month = regex("[0-9]{2}")
>>> day = regex("[0-9]{2}")
>>> dash = string("-")
We’ve added use of the string()
primitive here, that matches just the
string passed in, and returns that string.
Next we need to combine these parsers into something that will parse the whole
date. The simplest way is to use the Parser.then()
method:
>>> fulldate = year.then(dash).then(month).then(dash).then(day)
The then
method returns a new parser that requires the first parser to
succeed, followed by the second parser (the argument to the method).
We could also write this using the >> operator which
does the same thing as Parser.then()
:
>>> fulldate = year >> dash >> month >> dash >> day
This parser has some problems which we need to address, but it is already useful as a basic validator:
>>> fulldate.parse("2017-xx")
ParseError: expected '[0-9]{2}' at 0:5
>>> fulldate.parse("2017-01")
ParseError: expected '-' at 0:7
>>> fulldate.parse("2017-02-01")
'01'
If the parse doesn’t succeed, we’ll get ParseError
, otherwise it is valid
(at least as far as the basic syntax checks we’ve added).
The first problem with this parser is that it doesn’t return a very useful
value. Due to the way that Parser.then()
works, when it combines two
parsers to produce a larger one, the value from the first parser is discarded,
and the value returned by the second parser is the overall return value. So, we
end up getting only the ‘day’ component as the result of our parse. We really
want the year, month and day packaged up nicely, and converted to integers.
A second problem is that our error messages are not very friendly.
Our first attempt at fixing these might be to use the + operator instead
of then
. This operator is defined to combine the results of the two parsers
using the normal plus operator, which will work fine on strings:
>>> fulldate = year + dash + month + dash + day
>>> fulldate.parse("2017-02-01")
'2017-02-01'
However, it won’t help us if we want to split our data up into a set of integers.
Our first step should actually be to work on the year, month and day components
using Parser.map()
, which allows us to convert the strings to other
objects - in our case we want integers.
We can also use the Parser.desc()
method to give nicer error messages, so
our components now look this this:
>>> year = regex("[0-9]{4}").map(int).desc("4 digit year")
>>> month = regex("[0-9]{2}").map(int).desc("2 digit month")
>>> day = regex("[0-9]{2}").map(int).desc("2 digit day")
We get better error messages now:
>>> year.then(dash).then(month).parse("2017-xx")
ParseError: expected '2 digit month' at 0:5
Notice that the map
and desc
methods, like all similar methods on
Parser
, return new parser objects - they do not modify the existing one.
This allows us to build up parsers with a ‘fluent’ interface, and avoid problems
caused by mutating objects.
However, we still need a way to package up the year, month and day as separate values.
The seq()
combinator provides one easy way to do that. It takes the
sequence of parsers that are passed in as arguments, and returns a parser that
runs each parser in order and combines their results into a list:
>>> from parsy import seq
>>> fulldate = seq(year, dash, month, dash, day)
>>> fulldate.parse("2017-01-02")
[2017, '-', 1, '-', 2]
Now, we don’t need those dashes, so we can eliminate them using the >> operator or << operator:
>>> fulldate = seq(year << dash, month << dash, day)
>>> fulldate.parse("2017-01-02")
[2017, 1, 2]
At this point, we could also convert this to a date object if we wanted using
Parser.combine()
, which passes the produced sequence to another function
using *args
syntax.
>>> from datetime import date
>>> fulldate = seq(year << dash, month << dash, day).combine(date)
This works because the positional argument order of date
matches the order
of the values parsed i.e. (year, month, day).
A slightly more readable and flexible version would use the keyword argument
version of seq()
, followed by Parser.combine_dict()
. Putting
everything together for our final solution:
from datetime import date
from parsy import regex, seq, string
year = regex("[0-9]{4}").map(int).desc("4 digit year")
month = regex("[0-9]{2}").map(int).desc("2 digit month")
day = regex("[0-9]{2}").map(int).desc("2 digit day")
dash = string("-")
fulldate = seq(
year=year << dash,
month=month << dash,
day=day,
).combine_dict(date)
Breaking that down:
for clarity, and to allow us test separately, we have defined individual parsers for the YYYY, MM and DD components.
the
seq
call produces a parser that parses the year, month and day components in order, discarding the dashes, to produce a dictionary like this:{ "year": 2017, "month": 1, "day": 2, }
when we chain the
combine_dict
call, we have a parser that passes this dictionary to thedate
constructor using**kwargs
syntax, so we end up callingdate(year=2017, month=1, day=2)
So now it does exactly what we want:
>>> fulldate.parse("2017-02-01")
datetime.date(2017, 2, 1)
Using previously parsed values
Now, sometimes we might want to do more complex logic with the values that are collected as parse results, and do so while we are still parsing.
To continue our example, the above parser has a problem that it will raise an exception if the day and month values are not valid. We’d like to be able to check this, and produce a parse error instead, which will make our parser play better with others if we want to use it to build something bigger.
Also, in ISO8601, strictly speaking you can just write the year, or the year and
the month, and leave off the other parts. We’d like to handle that by returning
a tuple for the result, and None
for the missing data.
To do this, we need to allow the parse to continue if the later components (with their leading dashes) are missing - that is, we need to express optional components, and we need a way to be able to test earlier values while in the middle of parsing, to see if we should continue looking for another component.
The Parser.bind()
method provides one way to do it (yay monads!).
Unfortunately, it gets ugly pretty fast, and in Python we don’t have Haskell’s
do
notation to tidy it up. But thankfully we can use generators and the
yield
keyword to great effect.
We use a generator function and convert it into a parser by using the
generate()
decorator. The idea is that you yield
every parser that you
want to run, and receive the result of that parser as the value of the yield
expression. You can then put parsers together using any logic you like, and
finally return the value.
An equivalent parser to the one above can be written like this:
from parsy import generate
@generate
def fulldate():
y = yield year
yield dash # implicit skip, since we do nothing with the value
m = yield month
yield dash
d = yield day
return date(y, m, d)
Notice how this follows the previous definition of fulldate
using seq
with keyword arguments. It’s more verbose than before, but provides a good
starting point for our next set of requirements.
First of all, we need to express optional components - that is we need to be able to handle missing dashes, and return what we’ve got so far rather than failing the whole parse.
Parser
has a set of methods that convert parsers into ones that allow
multiples of the parser - including Parser.many()
, Parser.times()
,
Parser.at_most()
and Parser.at_least()
. There is also
Parser.optional()
which allows matching zero times (in which case the
parser will return the default value specified or None
otherwise),
or exactly once - just what we need in this case.
We also need to do checking on the month and the day. We’ll take a shortcut and
use the built-in datetime.date
class to do the validation for us. However,
rather than allow exceptions to be raised, we convert the exception into a
parsing failure.
from parsy import fail, generate
optional_dash = dash.optional()
@generate
def full_or_partial_date():
d = None
m = None
y = yield year
dash1 = yield optional_dash
if dash1 is not None:
m = yield month
dash2 = yield optional_dash
if dash2 is not None:
d = yield day
if m is not None:
if m < 1 or m > 12:
return fail("month must be in 1..12")
if d is not None:
try:
datetime.date(y, m, d)
except ValueError as e:
return fail(e.args[0])
return (y, m, d)
This works now works as expected:
>>> full_or_partial_date.parse("2017-02")
(2017, 2, None)
>>> full_or_partial_date.parse("2017-02-29")
ParseError: expected 'day is out of range for month' at 0:10
We could of course use a custom object in the final line to return a more convenient data type, if wanted.
Alternatives and backtracking
Suppose we are using our date parser to scrape dates off articles on a web site. We then discover that for recently published articles, instead of printing a timestamp, they write “X days ago”.
We want to parse this, and we’ll use a timedelta object to represent the value (to easily distinguish it from other values and consume it later). We can write a parser for this using tools we’ve seen already:
>>> days_ago = regex("[0-9]+").map(lambda d: timedelta(days=-int(d))) << string(" days ago")
>>> days_ago.parse("5 days ago")
datetime.timedelta(-5)
Now we need to combine it with our date parser, and allow either to succeed. This is done using the | operator, as follows:
>>> flexi_date = full_or_partial_date | days_ago
>>> flexi_date.parse("2012-01-05")
(2012, 1, 5)
>>> flexi_date.parse("2 days ago")
datetime.timedelta(-2)
Notice that you still get good error messages from the appropriate parser, depending on which parser got furthest before returning a failure:
>>> flexi_date.parse("2012-")
ParseError: expected '2 digit month' at 0:5
>>> flexi_date.parse("2 years ago")
ParseError: expected ' days ago' at 0:1
When using backtracking, you need to understand that backtracking to the other option only occurs if the first parser fails. So, for example:
>>> a = string("a")
>>> ab = string("ab")
>>> c = string("c")
>>> a_or_ab_and_c = ((a | ab) + c)
>>> a_or_ab_and_c.parse("ac")
'ac'
>>> a_or_ab_and_c.parse("abc")
ParseError: expected 'c' at 0:1
The parse fails because the a
parser succeeds, and so the ab
parser is
never tried. This is different from most regular expression engines, where
backtracking is done over the whole regex by default.
In this case we can get the parse to succeed by switching the order:
>>> ((ab | a) + c).parse("abc")
'abc'
>>> ((ab | a) + c).parse("ac")
'ac'
We could also fix it like this:
>>> ((a + c) | (ab + c)).parse("abc")
'abc'
Custom data structures
In the example shown so far, the result of parsing has been a native Python data type, such as a integer, string, datetime or tuple. In some cases that is enough, but very quickly you will find that for your parse result to be useful, you will need to use custom data structures (rather than ending up with nested lists etc.)
For defining custom data structures, you can use any method you like (e.g. simple classes). We suggest dataclasses (stdlib), attrs or pydantic. You can also use namedtuple for simple cases.
For combining parsed data into these data structures, you can:
Use
Parser.map()
,Parser.combine()
andParser.combine_dict()
, often in conjunction withseq()
.See the SQL SELECT example for an example of this approach.
Use the
@generate
decorator as above, and manually call the data structure constructor with the pieces, as infull_date
orfull_or_partial_date
above, but with your own data structure instead of a tuple or datetime in the final line.
Learn more
For further topics, see the table of contents for the rest of the documentation that should enable you to build parsers for your needs.