Parser methods, operators and combinators

Parser methods

Parser objects are returned by any of the built-in parser Parsing primitives. They can be used and manipulated as below.

class parsy.Parser[source]

__init__(wrapped_fn)[source]: This is a low level function to create new parsers that is used internally but is rarely needed by users of the parsy library. It should be passed a parsing function, which takes two arguments - a string/list to be parsed and the current index into the list - and returns a Result object, as described in Creating new Parser instances.

The following methods are for actually using the parsers that you have created:

parse(string_or_list)[source]

Attempts to parse the given string (or list). If the parse is successful and consumes the entire string, the result is returned - otherwise, a ParseError is raised.

Instead of passing a string, you can in fact pass a list of tokens. Almost all the examples assume strings for simplicity. Some of the primitives are also clearly string specific, and a few of the combinators (such as Parser.concat()) are string specific, but most of the rest of the library will work with tokens just as well. See Separate lexing/tokenization phases for more information.

parse_partial(string_or_list)[source]: Similar to parse, except that it does not require the entire string (or list) to be consumed. Returns a tuple of (result, remainder), where remainder is the part of the string (or list) that was left over.

The following methods are essentially combinators that produce new parsers from the existing one. They are provided as methods on Parser for convenience. More combinators are documented below.

desc(string)[source]

Adds a description to the parser, which is used in the error message if parsing fails.

>>> year = regex(r'[0-9]{4}').desc('4 digit year')
>>> year.parse('123')
ParseError: expected 4 digit year at 0:0

then(other_parser)[source]

Returns a parser which, if the initial parser succeeds, will continue parsing with other_parser. This will produce the value produced by other_parser.

>>> string('x').then(string('y')).parse('xy')
'y'

See also << operator.

many()[source]

Returns a parser that expects the initial parser 0 or more times, and produces a list of the results. Note that this parser does not fail if nothing matches, but instead consumes nothing and produces an empty list.

>>> parser = regex(r'[a-z]').many()
>>> parser.parse('')
[]
>>> parser.parse('abc')
['a', 'b', 'c']

times(min[, max=min])[source]: Returns a parser that expects the initial parser at least min times, and at most max times, and produces a list of the results. If only one argument is given, the parser is expected exactly that number of times.

at_most(n)[source]: Returns a parser that expects the initial parser at most n times, and produces a list of the results.

at_least(n)[source]: Returns a parser that expects the initial parser at least n times, and produces a list of the results.

until(other_parser[, min=0, max=inf, consume_other=False])[source]

Returns a parser that expects the initial parser followed by other_parser. The initial parser is expected at least min times and at most max times. By default, it does not consume other_parser and it produces a list of the results excluding other_parser. If consume_other is True then other_parser is consumed and its result is included in the list of results.

>>> seq(string('A').until(string('B')), string('BC')).parse('AAABC')
[['A','A','A'], 'BC']
>>> string('A').until(string('B')).then(string('BC')).parse('AAABC')
'BC'
>>> string('A').until(string('BC'), consume_other=True).parse('AAABC')
['A', 'A', 'A', 'BC']

Added in version 2.0.

optional(default=None)[source]

Returns a parser that expects the initial parser zero or once, and maps the result to a given default value in the case of no match. If no default value is given, None is used.

>>> string('A').optional().parse('A')
'A'
>>> string('A').optional().parse('')
None
>>> string('A').optional('Oops').parse('')
'Oops'

map(map_function)[source]

Returns a parser that transforms the produced value of the initial parser with map_function.

>>> regex(r'[0-9]+').map(int).parse('1234')
1234

This is the simplest way to convert parsed strings into the data types that you need. See also combine() and combine_dict() below.

combine(combine_fn)[source]

Returns a parser that transforms the produced values of the initial parser with combine_fn, passing the arguments using *args syntax.

Where the current parser produces an iterable of values, this can be a more convenient way to combine them than map().

Example 1 - the argument order of our callable already matches:

>>> from datetime import date
>>> yyyymmdd = seq(regex(r'[0-9]{4}').map(int),
...                regex(r'[0-9]{2}').map(int),
...                regex(r'[0-9]{2}').map(int)).combine(date)
>>> yyyymmdd.parse('20140506')
datetime.date(2014, 5, 6)

Example 2 - the argument order of our callable doesn’t match, and we need to adjust a parameter, so we can fix it using a lambda.

>>> ddmmyy = regex(r'[0-9]{2}').map(int).times(3).combine(
...                lambda d, m, y: date(2000 + y, m, d))
>>> ddmmyy.parse('060514')
datetime.date(2014, 5, 6)

The equivalent lambda to use with map would be lambda res: date(2000 + res[2], res[1], res[0]), which is less readable. The version with combine also ensures that exactly 3 items are generated by the previous parser, otherwise you get a TypeError.

combine_dict(fn)[source]

Returns a parser that transforms the value produced by the initial parser using the supplied function/callable, passing the arguments using the **kwargs syntax.

The value produced by the initial parser must be a mapping/dictionary from names to values, or a list of two-tuples, or something else that can be passed to the dict constructor.

If None is present as a key in the dictionary it will be removed before passing to fn, as will all keys starting with _.

Motivation:

For building complex objects, this can be more convenient, flexible and readable than map() or combine(), because by avoiding positional arguments we can avoid a dependence on the order of components in the string being parsed and in the argument order of callables being used. It is especially designed to be used in conjunction with seq() and tag().

We can make use of the **kwargs version of seq() to produce a very readable definition:

>>> ddmmyyyy = seq(
...     day=regex(r'[0-9]{2}').map(int),
...     month=regex(r'[0-9]{2}').map(int),
...     year=regex(r'[0-9]{4}').map(int),
... ).combine_dict(date)
>>> ddmmyyyy.parse('04052003')
datetime.date(2003, 5, 4)

(If that is hard to understand, use a Python REPL, and examine the result of the parse call if you remove the combine_dict call).

Here we used datetime.date which accepts keyword arguments. For your own parsing needs you will often use custom data types. You can create these however you like, but we suggest dataclasses (stdlib), attrs or pydantic. You can also use namedtuple for simple cases.

The following example shows the use of _ as a prefix to remove elements you are not interested in, and the use of namedtuple to create a simple data-structure.

>>> from collections import namedtuple
>>> Pair = namedtuple('Pair', ['name', 'value'])
>>> name = regex("[A-Za-z]+")
>>> int_value = regex("[0-9]+").map(int)
>>> bool_value = string("true").result(True) | string("false").result(False)
>>> pair = seq(
...    name=name,
...    __eq=string('='),
...    value=int_value | bool_value,
...    __sc=string(';'),
... ).combine_dict(Pair)
>>> pair.parse("foo=123;")
Pair(name='foo', value=123)
>>> pair.parse("BAR=true;")
Pair(name='BAR', value=True)

You could also use << or >> for the unwanted parts (but in some cases this is less convenient):

>>> pair = seq(
...    name=name << string('='),
...    value=(int_value | bool_value) << string(';')
... ).combine_dict(Pair)

Changed in version 1.2: Allow lists as well as dicts to be consumed, and filter out None.

Changed in version 1.3: Stripping of args starting with _

tag(name)[source]

Returns a parser that wraps the produced value of the initial parser in a 2 tuple containing (name, value). This provides a very simple way to label parsed components. e.g.:

>>> day = regex(r'[0-9]+').map(int)
>>> month = string_from("January", "February", "March", "April", "May",
...                     "June", "July", "August", "September", "October",
...                     "November", "December")
>>> day.parse("10")
10
>>> day.tag("day").parse("10")
('day', 10)

>>> seq(day.tag("day") << whitespace,
...     month.tag("month")
...     ).parse("10 September")
[('day', 10), ('month', 'September')]

It also works well when combined with .map(dict) to get a dictionary of values:

>>> seq(day.tag("name") << whitespace,
...     month.tag("month")
...     ).map(dict).parse("10 September")
{'day': 10, 'month': 'September'}

… and with combine_dict() to build other objects.

Usually it is better to use seq() with keyword arguments if you want to produce a dictionary.

concat()[source]

Returns a parser that concatenates together (as a string) the previously produced values. Usually used after many() and similar methods that produce multiple values.

>>> letter.at_least(1).parse("hello")
['h', 'e', 'l', 'l', 'o']
>>> letter.at_least(1).concat().parse("hello")
'hello'

result(val)[source]

Returns a parser that, if the initial parser succeeds, always produces val.

>>> string('foo').result(42).parse('foo')
42

should_fail(description)[source]

Returns a parser that fails when the initial parser succeeds, and succeeds when the initial parser fails (consuming no input). A description must be passed which is used in parse failure messages.

This is essentially a negative lookahead:

>>> p = letter << string(" ").should_fail("not space")
>>> p.parse('A')
'A'
>>> p.parse('A ')
ParseError: expected 'not space' at 0:1

It is also useful for implementing things like parsing repeatedly until a marker:

>>> (string(";").should_fail("not ;") >> letter).many().concat().parse_partial('ABC;')
('ABC', ';')

bind(fn)[source]

Returns a parser which, if the initial parser is successful, passes the result to fn, and continues with the parser returned from fn. This is the monadic binding operation.

Here is an example that implements Hollerith constants:

from parsy import regex, string, any_char

hollerith = (regex(r'[0-9]+').map(int) << string('H')).bind(
    lambda num: any_char.times(num).concat()
)

The first parser (regex(r'[0-9]+').map(int) << string('H')) will consume something like "8H" and produce the integer 8. Via bind, we then pass that value as num into the lambda, which can use it to consume more input.

However, since we don’t have Haskell’s do notation in Python, for longer examples this is quite awkward. Instead, you should look at Generating a parser which provides a much nicer syntax for that cases where you would have needed do notation in Parsec.

Also, the methods map(), combine() and combine_dict(), which all use bind internally, are often more convenient ways to chain functions where you are doing transformations but not consuming more input.

sep_by(sep, min=0, max=inf)[source]

Like Parser.times(), this returns a new parser that repeats the initial parser and collects the results in a list, but in this case separated by the parser sep (whose return value is discarded). By default it repeats with no limit, but minimum and maximum values can be supplied.

>>> csv = letter.at_least(1).concat().sep_by(string(","))
>>> csv.parse("abc,def")
['abc', 'def']

mark()[source]

Returns a parser that wraps the initial parser’s result in a value containing column and line information of the match, as well as the original value. The new value is a 3-tuple:

((start_row, start_column),
 original_value,
 (end_row, end_column))

This is useful for being able to report problems with parsing more accurately, especially if you are using parsy as a lexer and want subsequent parsing of the token stream to be able to report original positions in error messages etc.

Parser operators

This section describes operators that you can use on Parser objects to build new parsers.

`|` operator

parser | other_parser

Returns a parser that tries parser and, if it fails, backtracks and tries other_parser. These can be chained together.

The resulting parser will produce the value produced by the first successful parser.

>>> parser = string('x') | string('y') | string('z')
>>> parser.parse('x')
'x'
>>> parser.parse('y')
'y'
>>> parser.parse('z')
'z'

Note that other_parser will only be tried if parser cannot consume any input and fails. other_parser is not used in the case that later parser components fail. This means that the order of the operands matters - for example:

>>> ((string('A') | string('AB')) + string('C')).parse('ABC')
ParseEror: expected 'C' at 0:1
>>> ((string('AB') | string('A')) + string('C')).parse('ABC')
'ABC'
>>> ((string('AB') | string('A')) + string('C')).parse('AC')
'AC'

`<<` operator

parser << other_parser

The same as parser.skip(other_parser) - see Parser.skip().

(Hint - the arrows point at the important parser!)

>>> (string('x') << string('y')).parse('xy')
'x'

`>>` operator

parser >> other_parser

The same as parser.then(other_parser) - see Parser.then().

(Hint - the arrows point at the important parser!)

>>> (string('x') >> string('y')).parse('xy')
'y'

`+` operator

parser1 + parser2

Requires both parsers to match in order, and adds the two results together using the + operator. This will only work if the results support the plus operator (e.g. strings and lists):

>>> (string("x") + regex("[0-9]")).parse("x1")
"x1"

>>> (string("x").many() + regex("[0-9]").map(int).many()).parse("xx123")
['x', 'x', 1, 2, 3]

The plus operator is a convenient shortcut for:

>>> seq(parser1, parser2).combine(lambda a, b: a + b)

`*` operator

parser1 * number

This is a shortcut for doing Parser.times():

>>> (string("x") * 3).parse("xxx")
["x", "x", "x"]

You can also set both upper and lower bounds by multiplying by a range:

>>> (string("x") * range(0, 3)).parse("xxx")
ParseError: expected EOF at 0:2

(Note the normal semantics of range are respected - the second number is an exclusive upper bound, not inclusive).

Parser combinators

parsy.alt(*parsers)[source]

Creates a parser from the passed in argument list of alternative parsers, which are tried in order, moving to the next one if the current one fails, as per the | operator - in other words, it matches any one of the alternative parsers.

Example using *args syntax to pass a list of parsers that have been generated by mapping string() over a list of characters:

>>> hexdigit = alt(*map(string, "0123456789abcdef"))

(In this case you would be better off using char_from())

Note that the order of arguments matter, as described in | operator.

parsy.seq(*parsers, **kw_parsers)[source]

Creates a parser that runs a sequence of parsers in order and combines their results in a list.

>>> x_bottles_of_y_on_the_z = \
...    seq(regex(r"[0-9]+").map(int) << string(" bottles of "),
...        regex(r"\S+") << string(" on the "),
...        regex(r"\S+")
...        )
>>> x_bottles_of_y_on_the_z.parse("99 bottles of beer on the wall")
[99, 'beer', 'wall']

You can also use seq() with keyword arguments instead of positional arguments. In this case, the produced value is a dictionary of the individual values, rather than a sequence. This can make the produced value easier to consume.

>>> name = seq(first_name=regex("\S+") << whitespace,
...            last_name=regex("\S+")
>>> name.parse("Jane Smith")
{'first_name': 'Jane',
 'last_name': 'Smith'}

Changed in version 1.1: Added **kwargs option.

Note

As an alternative, see Parser.tag() for a way of labelling parsed components and producing dictionaries.

Other combinators

Parsy does not try to include every possible combinator - there is no reason why you cannot create your own for your needs using the built-in combinators and primitives. If you find something that is very generic and would be very useful to have as a built-in, please submit as a PR!

Parser methods, operators and combinators

Parser methods

Parser operators

| operator

<< operator

>> operator

+ operator

* operator