This tutorial introduces you to the basics of lexy. Our goal is to parse some simple configuration file of a software package.

A sample input file can look like this:

package.config
name    = lexy
version = 0.0.0
authors = ["Jonathan Müller"]

And we want to parse it into the following C++ data structure using lexy:

PackageConfig
struct PackageVersion
{
    int major;
    int minor;
    int patch;
};

struct PackageConfig
{
    std::string              name;
    PackageVersion           version;
    std::vector<std::string> authors;
};

The final source code can be found at examples/tutorial.cpp.

If anything in the tutorial could be improved (and there is probably a lot), please raise an issue or — even better — create a PR. Thank you!

Overview

To parse something we need to do three things.

  1. We define the grammar. In C++, a grammar is contained in a namespace, usually called grammar.

    It contains one or more productions, which are empty structs with a rule member. Each production corresponds to one function of the generated recursive descent parser. They produce a single value of a user controlled type. Here, we could imagine at minimum a production for parsing a PackageVersion and another one for parsing a PackageConfig.

    The rule of a production does all the heavy lifting. It describes what is valid input and what is not, how many characters are consumed by the input, and it will produce zero or more values. All values are then combined into the single result of the production using a separately specified callback.

  2. We create an Input object. It contains the concrete input we want to parse. The library provides different kinds of inputs, from the simple lexy::string_input which acts like a std::string_view, to the complex lexy::shell, which provides an interactive REPL.

    In this step, we also specify the encoding of the input. This can be plain old ASCII, some Unicode encoding like UTF-8, or bytes as opposed to text. The encoding controls the behavior of many rules as it determines what valid code points are.

    When reading input from a file, we may also need to specify a given endianness or let the library figure it out using a byte-order mark.

  3. Once we have a grammar and input, we can parse it by calling lexy::parse. This will parse the input and convert it according to the rules of the grammar and converts it into the specified type. If an error occurs, it will invoke a callback we have specified with detailled error information. We can then either print it immediately, or store the error in some custom diagnostic object.

    We can also choose to simply validate the input using lexy::validate. Then we don’t convert it to a value and only log error messages if it is ill-formed.

As such, the general structure of the source code is as follows.

examples/tutorial.cpp
#include <string>
#include <vector>

(1)
struct PackageVersion {  };
struct PackageConfig {  };

//=== grammar ===//
#include <lexy/dsl.hpp> // lexy::dsl::*

namespace grammar (2)
{
    

    struct config {  };
}

//=== parsing ===//
#include <lexy/input/file.hpp> // lexy::read_file
#include <lexy/parse.hpp>      // lexy::parse

int main()
{
    auto file = lexy::read_file<lexy::utf8_encoding>(filename); (3)
    if (!file)
    {  }

    auto result = lexy::parse<grammar::config>(file, report_error_callback); (4)
    if (!result) (5)
    {  }

    if (result.has_value()) (6)
    {
        PackageConfig config = result.value();
        
    }
}
  1. The user code that defines the C++ data structures. It does not need to know anything about lexy.

  2. The grammar. It contains multiple productions, but the entry production is grammar::config. This is the production we’re parsing.

  3. We want to read the config from a file, so we use lexy::read_file. We specify that the file uses UTF-8 as the input encoding. Reading a file can fail, so we need to handle that (not shown here).

  4. Then we can parse our entry production using lexy::parse. We give it our file input and a callback to invoke on errors (not shown here).

  5. If parsing produced any errors, handle them somehow.

  6. If parsing was able to give as a value (i.e. the input was either well-formed or lexy could recover from all errors that happened), use that value. If you don’t care about error recovery, you just need the !result check.

The rest of the tutorial will only focus on the rules and productions, as that is the interesting part of the library. Refer to the documentation for further details on the surrounding infrastructure.

Again, the full final source code can be found at examples/tutorial.cpp.

Parsing the package name

We will create a separate production for each of the fields (name, version, authors). Let’s start with the production for the name, as that is the simplest one.

Package name
name = lexy

Here, we’re only concerned with the part after the equal sign, so the lexy in the example above. A package name follows the same rules as a C++ identifier, except that leading underscores are not allowed. As a regex, a name is described by [a-zA-Z][a-zA-Z_0-9]*, so one alpha character, followed by zero or more alphanumeric characters or underscores.

How can we express this as a lexy rule?

Every rule is defined in the namespace lexy::dsl. As this is rather lengthy, it is a good idea to use a namespace alias to shorten it.

The namespace alias
namespace grammar
{
    namespace dsl = lexy::dsl; (1)
}
  1. A convenience alias, so we can write dsl::foo instead of lexy::dsl::foo when defining the grammar.

Luckily for us, there are predefined rules for the various ASCII classifications. One of those, is the rule dsl::ascii::alpha: this rule matches one of a-zA-Z and consumes it from the input. We can put it in a production and parse it:

The dsl::ascii::alpha rule (godbolt)
struct alpha (1)
{
    static constexpr auto rule = dsl::ascii::alpha; (2)
};
  1. The production that contains the rule.

  2. The rule itself, it is a static constant.

Likewise, dsl::ascii::alnum matches one of a-zA-Z0-9. To match a single underscore, we can use dsl::lit_c<'_'>. The latter rule matches and consumes the specified character.

All of the three rules are so called tokens: they are the fundamental, atomic parse unit of the input. Tokens play an essential role in parsing as we’ll see, because the library can easily check whether a token matches at a given position.

Of course, here we don’t want a single alpha(numeric) character or underscore, we want one alpha character followed by zero or more alphanumeric characters or underscores. For that, we need to combine rules.

The simplest way to combine rules is using the sequence rule. The sequence rule matches one rule after the other in the specified order. It is implemented using an overload operator+:

The sequence rule (godbolt)
// Match an alpha character, followed by an alphanumeric character, followed by a literal c.
dsl::ascii::alpha + dsl::ascii::alnum + dsl::lit_c<'_'>

The sequence rule is alright, but it is static. How can we match a dynamic amount of alpha numeric characters after the initial alpha character? For that, we can use the while rule. The while rule takes a rule and matches it as often as possible.

The while rule (godbolt)
// Match an alpha character, followed by zero or more alphanumeric characters.
dsl::ascii::alpha + dsl::while_(dsl::ascii::alnum)

The while rule is different from all other rules we’ve seen: it needs to decide whether it should match again or be done with it. If the argument is a token, that can be done very easily — just try to match the token (remember: that can be done very efficiently) If it matched, it works. Otherwise, it backtracks to the previous position and is done.

Let’s consider a more complex token to see how it works: LEXY_LIT("ab"). This one is equivalent to dsl::lit_c<'a'> + dsl::lit_c<'b'> (match a then b), but it is a single token, not a sequence of tokens. If you have a C++20 compiler, you can write it as dsl::lit<"ab"> without using a macro.

Parsing dsl::while_(LEXY_LIT("ab"))
ababa
^ start, try to match ab

ababa
--^ that worked, try to match it again

ababa
----^ that worked, try to match it again

ababa
-----^ that did not work, we're missing a `b`, backtrack!

ababa
----^ done, next character on input is `a`

Don’t worry about backtracking. The library will only do it when you’ve explicitly requested it, or when it is efficient like here.

Back to our problem at hand: we’re almost there now! All we need is to allow the underscore as well as an alphanumeric character in the while loop.

For that, we can use the alternative rule, which matches one of the given token. It does that by trying to match each token in order. If that works, great. Otherwise, it rewinds the input (backtracking) and tries the second rule, and so on. Remember, for tokens this is efficient. As the alternative rule matches only exactly one token, it is also considered to be a token itself (although it’s strictly speaking another combination of tokens).

In the DSL, the alternative rule is implemented using operator/ (read "or"). With all that, we can finally write our first production:

The name production (godbolt)
struct name
{
    // Match an alpha character, followed by zero or more alphanumeric characters or underscores.
    static constexpr auto rule
        = dsl::ascii::alpha + dsl::while_(dsl::ascii::alnum / dsl::lit_c<'_'>);
};

If we have an alternative rule of literals, as in LEXY_LIT("abc") / LEXY_LIT("ab") / LEXY_LIT("b"), it can be parsed without any backtracking. This is done by constructing a trie at compile-time and looking for the input in there.

The production is now almost done. We can use lexy::validate() to give it some input and raise an error if it does not match the rule, or we can use lexy::match() to just give us a true/false result. But we want to lexy::parse() it and get a std::string. To implement that, we need to do two things.

First, we need to remember everything we’ve just matched by the rule, so we can convert that into the std::string later on. This is done using dsl::capture(). This rule takes another rule as input and parses it. However, it is also the first rule that produces a value: When parsing a dsl::capture() rule, we get a lexy::lexeme (basically a std::string_view) that views all the input the rule has matched. This is exactly what we then want to turn into our std::string.

Second, we need to specify what value our production should return when it’s parsed. When we lexy::parse() a production, we parse the rule of the production. As we have just seen, this can produce one or more values, like lexy::lexeme. All those value are then forwarded to a callback which constructs the result of the parse operation.

A callback is just a function object (so a class with operator()) that also has a return_type typedef. We can easily build one using the utility function lexy::callback<T>() which takes one or more lambdas and creates a callback that returns a T. A callback is added to a production using a static constexpr auto value member.

So we need to wrap our rule in dsl::capture(), so we actually get a value for our callback, and then add a callback that takes the lexeme and converts it into a std::string which is the final result of parsing the production.

The name production with capture() and value
struct name
{
    // Match an alpha character, followed by zero or more alphanumeric characters or underscores.
    // Captures it all into a lexeme.
    static constexpr auto rule
        = dsl::capture(dsl::ascii::alpha + dsl::while_(dsl::ascii::alnum / dsl::lit_c<'_'>));

    // The final value of this production is a std::string we've created from the lexeme.
    static constexpr auto value
        = lexy::callback<std::string>([](auto lexeme) { return std::string(lexeme.begin(), lexeme.end()); });
};

To finish it up, there are two things we can improve. First, converting a lexy::lexeme to a std::string is an incredible common thing you want to do, so the library provides the callback lexy::as_string<std::string> for it. Second, the rule definition has become somewhat unreadable as its one big expression. We can use an immediately invoked lambda to improve that.

The final name production (godbolt)
struct name
{
    // Match an alpha character, followed by zero or more alphanumeric characters or underscores.
    // Captures it all into a lexeme.
    static constexpr auto rule = [] {
        auto lead_char     = dsl::ascii::alpha;
        auto trailing_char = dsl::ascii::alnum / dsl::lit_c<'_'>;

        return dsl::capture(lead_char + dsl::while_(trailing_char));
    }();

    // The final value of this production is a std::string we've created from the lexeme.
    static constexpr auto value = lexy::as_string<std::string>;
};

If now parse the name production, we will get a std::string. First field done, let’s move on to the next one.

Parsing the package version

The next field is the version.

Package version
version = 0.0.0

Again, we’re only concerned with the value after the equal sign for now. It consists of three numbers separated by dots, where a number is a non-empty sequence of digits.

The token dsl::ascii::digit matches one digit 0-9. To match an arbitrary amount of digits, we can again use the while rule. However, this would also allow zero digits, which we don’t want. So instead we use dsl::while_one(dsl::ascii::digit), which is equivalent to dsl::ascii::digit + dsl::while_(dsl::ascii::digit): it needs at least one digit, and then zero or more.

Digits
// Match one or more digits.
dsl::while_one(dsl::ascii::digit)

Matching one or more digits is common, so there is a predefined rule (token actually): dsl::digits. It takes an optional template parameter to specify the base, for example dsl::digits<dsl::octal> would only match 0-7, whereas dsl::digits<dsl::hex_upper> would match 0-9A-F. If we don’t specify a base, it defaults to dsl::decimal.

The digits token (godbolt)
// Match one or more decimal digits.
dsl::digits<>

dsl::digits<> actually provides a couple of additional features over the dsl::while_one(). For example, we could prevent leading zeros or automatically allow an optional digit separator. None of that is needed here, however.

Just like with the name production, neither dsl::digits<> nor dsl::while_one() actually produce a value when parsed. To get the actual integer represented by the digits, we can do the same thing as we did before: Use dsl::capture(dsl::digits<>) to match digits and get a lexy::lexeme, then use a callback that takes the lexeme and converts it into an int. However, this approach does not work due to the possibility of integer overflow: dsl::digits<> matches an arbitrarily long sequence of digits, but only a subset of those are int`s. `lexy considers integer overflow a parse error, which can only be raised by a rule.

So instead we can use the dsl::integer<T>() rule. Just like dsl::capture(), it takes another rule and matches it. The resulting digits are then captured, but not as a lexy::lexeme but as the specified integer T.

While doing the conversion, dsl::integer ignores any character that is not a digit, so you can use it even if you have digit separators in your rule. What is or is not a digit, as well as the base used for conversion, is again determined using the policy classes dsl::decimal, dsl::octal, and so on. You can specify them manually using dsl::integer<int, dsl::decimal>(my_digit_rule), but if your digit rule is dsl::digits<>, the base is detected automatically.

The following sample production matches a single int using dsl::integer and dsl::digits.

The integer rule (godbolt)
struct integer
{
    // Matches one or more decimal digits, then converts those into an `int`.
    static constexpr auto rule = dsl::integer<int>(dsl::digits<>)

    // The rule produces a single value, the parsed `int`.
    // We simply forward that one to use as the result of parsing the `integer` production.
    static constexpr auto value = lexy::forward<int>;
};

Now we can just use the integer rule and put it in sequence together with dsl::lit_c<'.'> to match the three numbers separated by integer. If we match a sequence of rules, where some produce values, all values are preserved and forwarded to the callback in the same order. The dsl::lit_c rule does not produce any values, so our callback will be invoked with three values: the ints from each dsl::integer rule. We then use a callback that takes those three integers and constructs the PackageVersion as the result.

The version production
struct version
{
    // Match three integers separated by dots.
    static constexpr auto rule = []{
        auto number = dsl::integer<int>(dsl::digits<>);
        auto dot    = dsl::lit_c<'.'>;

        // Each number rule produces an int, each dot rule produces nothing.
        return number + dot + number + dot + number;
    }();

    // Construct a PackageVersion as the result of the production.
    static constexpr auto value
      = lexy::callback<PackageVersion>([](int a, int b, int c) {
            // a is the result of the first number rule, b of the second, c of the third.
            return PackageVersion{a, b, c};
        });
};

We can again clean this up a bit. lexy predefines dsl::period to match a '.' character, which looks cleaner than dsl::lit_c<'.'>. Constructing a type from arguments is also a common callback, so it is provided as lexy::construct<T>, which does T(args…​) if that compiles and T{args…​} otherwise.

The final version production (godbolt)
struct version
{
    // Match three integers separated by dots.
    static constexpr auto rule = []{
        auto number = dsl::integer<int>(dsl::digits<>);
        auto dot    = dsl::period;

        return number + dot + number + dot + number;
    }();

    // Construct a PackageVersion as the result of the production.
    static constexpr auto value = lexy::construct<PackageVersion>;
};

We can now use this production to parse PackageVersion.

Extending the version field

Let’s stick with the version production a bit and extend it. We also want to allow the special version number unreleased as an alternate spelling for 0.0.0.

Parsing unreleased is easy: just use the LEXY_LIT("unreleased") token:

Adding unreleased support
struct version
{
    static constexpr auto rule = []{
        auto number = dsl::integer<int>(dsl::digits<>);
        auto dot    = dsl::period;
        auto dot_version = number + dot + number + dot + number; (1)

        auto unreleased = LEXY_LIT("unreleased");

        return ???; (2)
    }();
};
  1. For convenience, we put the previous rule in a variable dot_version.

  2. What do we put here?

But how do we can we parse either unreleased or dot_version?

We’ve already seen the alternative rule /, which allowed us to parse one of the specified tokens. However, number + dot + number + dot + number is not a token, so we can’t use /. And this is a good thing!

If we were able to write dot_version / unreleased, this might lead to arbitrary backtracking. In particular, rules can have arbitrary side-effects that then might happen unnecessarily. So lexy strictly limits backtracking.

What we need here is a special branch rule. This is a rule that has an associated condition. If the condition matches, the branch can be taken and will be parsed without further backtracking. If the condition didn’t match, the parsing algorithm needs to look for another alternative to go. Matching the condition uses a special efficient implementation, so backtracking it is acceptable.

Every token is also a branch, and many simple rules such as a sequence of tokens are also branches. The same is true for dsl::capture() if it captures a token or branch. Then the argument is the branch condition, which is only really captured once the branch has been taken.

And even if you have a rule that isn’t a branch, don’t worry, there is a way to turn an arbitrary rule into a branch. We just need to give it a condition, which is another branch rule (usually a token). This can be done using operator>>: condition >> rule. This will check whether condition matches, and take the branch parsing rule if it does. Once the algorithm starts parsing rule it has already committed and will never backtrack.

The alternative rule / requires only tokens, but it has a big sister: the choice rule |. This requires branches as arguments and parses the first branch whose condition matches.

The choice rule
// In C++, this has the operator precedence we want, which worked out nicely.
condition1 >> rule1 | condition2 >> rule2 | ...

Such a choice corresponds to the following pseudo-code.

Manual implementation of choice
if (match(input, condition1)) (1)
  parse(input, rule1); (2)
else if (match(input, condition2))
  parse(input, rule2);
  1. If we match a condition, we take the branch. Of course, this requires backtracking if the condition did not match.

  2. When the condition did match, the input is not rewound and we can continue with the rule. If any errors occur now, it’s too late — we’ve committed to this branch and issue an error.

Note that we will not backtrack after a branch condition has been matched, no matter what! This is illustrated in the following example, where we use dsl::while_() with a branch.

Parsing dsl::while_(dsl::lit_c<'a'> >> dsl::lit_c<'b'> + dsl::lit_c<'c'>)
abcabcabd
^ start, try to match the condition

abcabcabd
-^ condition matched, we take the branch

abcabcabd
---^ branch matched, try to match condition of the next iteration

abcabcabd
----^ condition matched, we take the branch

abcabcabd
------^ branch matched, try to match condition of the next iteration

abcabcabd
-------^ condition matched, we take the branch

abcabcabd
--------^ error: expected `c` not `d`, however we no longer bracktrack - branch was taken

With the choice rule, we can now parse unreleased or dot_version. As unreleased is a token, it is already a branch. But dot_version isn’t, so we need to give it a condition. Something like this does not work:

unreleased or dot_version, first attempt
struct version
{
    static constexpr auto rule = []{
        auto number = dsl::integer<int>(dsl::digits<>);
        auto dot    = dsl::period;

        auto dot_version = number + dot + number + dot + number;
        auto dot_version_condition = dsl::digit<>; (1)

        auto unreleased = LEXY_LIT("unreleased");

        return unreleased | dot_version_condition >> dot_version; (2)
    }();
};
  1. We only want to parse dot_version if we have a decimal digit, which is checked by dsl::digit<>.

  2. A choice of the two branches.

If we haven an input like 1.2.3, we first try to match unreleased. This fails, so we try to match the condition of the second branch. dsl::digit<> matches, so we take the branch. However, dsl::digit<> consumes the digit! What is left once we try to parse dot_version is only .2.3, which is wrong.

We need to check for a digit without consuming it. This can be done with dsl::peek().

unreleased or dot_version, second attempt
struct version
{
    static constexpr auto rule = []{
        auto number = dsl::integer<int>(dsl::digits<>);
        auto dot    = dsl::period;

        auto dot_version = number + dot + number + dot + number;
        auto dot_version_condition = dsl::peek(dsl::digit<>); (1)

        auto unreleased = LEXY_LIT("unreleased");

        return unreleased | dot_version_condition >> dot_version; (2)
    }();
};
  1. We only want to parse dot_version if we have a decimal digit, which is checked by dsl::digit<>. dsl::peek() is a branch that matches the rule without consuming it.

  2. A choice of the two branches.

This works, but we can do better. Remember that the choice tries each branch strictly in order. So once it’s clear that it isn’t unreleased, it has to be dot_version (or is an error). This means that as condition of dot_version, we can just use a branch that is always taken. This branch is called dsl::else_.

unreleased or dot_version, third attempt
struct version
{
    static constexpr auto rule = []{
        auto number = dsl::integer<int>(dsl::digits<>);
        auto dot    = dsl::period;
        auto dot_version = number + dot + number + dot + number;

        auto unreleased = LEXY_LIT("unreleased");

        return unreleased | dsl::else_ >> dot_version;
    }();
};

Now we’re successfully matching the input, we just need to produce a correct PackageVersion. Let’s consider the values produced by the choice rule. If our input is a version number like 1.2.3, we’re producing three ints, just as before. But if our input is unreleased we’re not producing any values.

There are three things we can do.

The first solution is two simply add a default constructor to PackageVersion. If we parse unreleased, the lexy::construct<PackageVersion> callback will be invoked with zero arguments which will itself invoke the default constructor of PackageVersion.

The second solution is to write a callback that has two overloads. The first one takes three ints and forwards them to the PackageVersion. The second one takes no arguments and creates a 0.0.0 PackageVersion manually.

Overloaded callback for the version production (godbolt)
struct version
{
    static constexpr auto rule = []{
        auto number = dsl::integer<int>(dsl::digits<>);
        auto dot    = dsl::lit_c<'.'>;
        auto dot_version = number + dot + number + dot + number;

        auto unreleased = LEXY_LIT("unreleased");

        return unreleased | dsl::else_ >> dot_version;
    }();

    // An overloaded callback.
    static constexpr auto value
      = lexy::callback<PackageVersion>(
            [](int a, int b, int c) { (1)
                // a is the result of the first number rule, b of the second, c of the third.
                return PackageVersion{a, b, c};
            },
            [] { (2)
                return PackageVersion{0, 0, 0};
            }
        );
};
  1. This callback will be invoked when we parse dot_version.

  2. This callback will be invoked when we parse unreleased.

The third solution is two produce three ints even if we take the unreleased branch. This can be done with the dsl::value_c<Constant> production. It will accept any input without consuming anything, but it will always produce a value — the specified Constant. So we extend the unreleased branch to produce three zeroes once we take the branch:

Using dsl::value_c for the version production (godbolt)
struct version
{
    static constexpr auto rule = []{
        auto number = dsl::integer<int>(dsl::digits<>);
        auto dot    = dsl::lit_c<'.'>;
        auto dot_version = number + dot + number + dot + number;

        auto unreleased
          = LEXY_LIT("unreleased") >> dsl::value_c<0> + dsl::value_c<0> + dsl::value_c<0>; (1)

        return unreleased | dsl::else_ >> dot_version;
    }();

    static constexpr auto value = lexy::construct<PackageVersion>; (2)
};
  1. Produce the three zeroes.

  2. This callback will always be invoked with three integers.

To illustrate the most rules, I’ve decided to just stick with this solution. Your preference may vary, of course.

Parsing one package author

Before we go and parse the list of authors, we need to parse an individual one.

Package author
authors = ["Jonathan Müller"]

One author is just a quoted string.

We can easily parse it using the tools we’ve already covered:

String parsing, first attempt
struct author
{
    // Match zero or more code points ("characters") surrounded by quotation marks.
    // We capture the content without the quotes.
    static constexpr auto rule
      = dsl::lit_c<'"'> + dsl::capture(dsl::while_(dsl::code_point)) + dsl::lit_c<'"'>;

    // Convert the captured lexeme into a std::string.
    static constexpr auto value = lexy::as_string<std::string>;
};

However, this attempt does not quite work. First of all, we don’t want arbitrary code points in our string. It shouldn’t contain characters like line breaks. More importantly, the rule can never succeed.

The while rule uses the branch condition to determine whether or not it should try another iteration. Here, our branch is the token dsl::code_point, so the entire rule is used as condition. We repeat as long as we match code points, this includes the closing " character.

If we had the equivalent regex ".*", it would just work fine. The regex star operator only repeats the rule as often as its necessary to make the pattern work.

Such "magic" is not done in lexy. It does exactly what you say it should do.

To fix this, we need a branch condition. We only want to match code points while we don’t have the closing ". For that, we can use dsl::peek_not(), which checks whether a rule would not match at the input without consuming anything.

String parsing, second attempt (godbolt)
struct author
{
    // Match zero or more code points ("characters") surrounded by quotation marks.
    // We capture the content without the quotes.
    static constexpr auto rule
      = dsl::lit_c<'"'>
        + dsl::capture(dsl::while_(dsl::peek_not(dsl::lit_c<'"'>) >> dsl::code_point))
        + dsl::lit_c<'"'>;

    // Convert the captured lexeme into a std::string.
    static constexpr auto value = lexy::as_string<std::string>;
};

While this works, it is not as efficient as it could be: To determine whether we should parse another character, we need to peek for it in the input. If it would match, we’re done — but don’t match it yet. Immediately afterwards, we do match it again.

It’s also not quite as compact as I would like.

Luckily, parsing a quoted string is a common problem, so there is a predefined function in the library. We can use dsl::quoted(dsl::code_point) to match zero or more code points surrounded by quotes. The closing " is used as the condition to detect the end of the string, like we’ve just implemented, only more efficiently.

dsl::quoted() works differently than the other rules we’ve seen so far. Every rule that produced a value like dsl::capture() or dsl::integer produces only a single value. dsl::quoted() on the other hand can produce arbitrarily many values, for example one per iteration. As such, the values are not all collected as a parameter pack and forwarded to a callback, but instead a sink is used.

A sink is a callback that can be invoked multiple times. Every time it is invoked, all arguments are somehow added to an internal value, which is retrieved by calling .finish(). This allows building a container or std::string. If we write dsl::quoted(dsl::code_point), the sink will be invoked with the captured code point in each iteration.

String parsing, third attempt (godbolt)
struct author
{
    // Match zero or more code points ("characters") surrounded by quotation marks.
    static constexpr auto rule = dsl::quoted(dsl::code_point);       (1)

    // Add each captured code point to a std::string.
    static constexpr auto value                                       (2)
      = lexy::sink<std::string>([](std::string& result, auto lexeme) (3)
                                {
                                    result.append(lexeme.begin(), lexeme.end());
                                });
};
  1. We want code points surrounded by quotes. dsl::code_point is a pattern, so it will be automatically `dsl::capture()`d for us in each iteration.

  2. To provide a sink we use ::value just as before.

  3. lexy::sink creates a sink for us. It constructs an empty std::string and then invokes the lambda with each captured lexeme. We then append that to the string.

dsl::quoted() isn’t actually a function, but a function object. In the library, dsl::quoted() is defined as follows:

constexpr auto quoted = dsl::delimited(dsl::lit_c<'"'>);

You can use dsl::delimited() to define your own delimiters by giving it a pattern and then give it the rule that is being delimited by it.

Constructing a std::string by repeatedly appending a lexy::lexeme is a common use case, so we can also use lexy::as_string<std::string> for it. lexy::as_string is not just a callback that will construct a string from one argument, but also a sink that will repeatedly append the arguments to the string.

We also haven’t forbidden input such as "First line\nSecond line", where \n is a literal line break inside the string. To do that, we need to prevent certain code points from occurring in our string. We can do that using the minus rule implemented as operator-. a - b matches a but only succeeds if b did not match the input a just matched. With that, we can "subtract" certain character classes from our token.

String parsing, fourth attempt (godbolt)
struct author
{
    // Match zero or more non-control code points ("characters") surrounded by quotation marks.
    static constexpr auto rule = dsl::quoted(dsl::code_point - dsl::ascii::control);

    // Construct a string from the quoted content.
    static constexpr auto value = lexy::as_string<std::string>;
};

Here, we’ve prevented all control characters from occurring inside the string.

But what if we want to include a control character in the author’s name (however, unlikely)? Or more importantly, how do we get a " in our string? dsl::quoted() will end once it reaches the final ".

For that, we need escape sequences. They can be very conveniently defined using another rule and added to the string as the second argument.

String parsing, final attempt (godbolt)
struct author
{
    // Match zero or more non-control code points ("characters") surrounded by quotation marks.
    // We allow `\"`, as well as `\u` and `\U` as escape sequences.
    static constexpr auto rule = [] {
        auto cp     = dsl::code_point - dsl::ascii::control;
        auto escape = dsl::backslash_escape                                (1)
                          .lit_c<'"'>()                                    (2)
                          .rule(dsl::lit_c<'u'> >> dsl::code_point_id<4>)  (3)
                          .rule(dsl::lit_c<'U'> >> dsl::code_point_id<8>);

        return dsl::quoted(cp, escape);
    }();

    // Construct a UTF-8 string from the quoted content.
    static constexpr auto value = lexy::as_string<std::string, lexy::utf8_encoding>; (4)
};
  1. We use \ as the escape character using dsl::backslash_escape. Alternatively, we could have used dsl::escape(dsl::lit_c<'\\'>).

  2. We want \" to mean ". Using .lit_c<'"'>() is equivalent to .rule(dsl::lit_c<'"'> >> dsl::value_c<'"'>). Whenever we encounter a " after the \, we produce the literal constant value ", which will be added to our sink.

  3. These two lines define \uXXXX and \uXXXXXXXX to specify character codes. dsl::code_point_id<N> is just a convenience for a dsl::integer rule that parses a code point using N hex digits.

  4. The \u and \U rules all produce a lexy::code_point. lexy::as_string can only convert it back into a string, if we tell it the encoding we want. So we add lexy::utf8_encoding as the second optional argument to enable that.

Parsing the package authors

Now we know how to parse one author, but the field can take a list of authors surrounded by square brackets.

Package author
authors = ["Jonathan Müller"]

Before you try writing something with dsl::while_(), this won’t actually work. The reason for that is that dsl::while_() does not work with rules that produce values, as dsl::while_() does not use a sink. Instead we need to use dsl::list(rule, sep). This matches a (non-empty) list of rule separated by sep.

The list rule (godbolt)
struct integer_list
{
    // Match a (non-empty) list of integers separated by commas.
    static constexpr auto rule = dsl::list(dsl::integer<int>(dsl::digits<>),
                                           dsl::sep(dsl::comma)); (1)

    // Add them all to a std::vector<int>.
    static constexpr auto value = lexy::as_list<std::vector<int>>; (2)
};
  1. dsl::comma is just dsl::lit_c<','>. We wrap it in dsl::sep() to indicate that this is a normal separator that is required between each item.

  2. The list will pass each value to the sink. Here, we’ve used lexy::as_list, which repeatedly calls .push_back().

How does the list know when to repeat an item? In general, this would require a branch whose condition will determine that. Here we don’t need a branch, as our separator is dsl::sep(). As this separator can only occur between items, we’re done with the list if we didn’t match a separator after our item.

If we wanted to use dsl::trailing_sep(), which allows an optional trailing separator, this is no longer possible. Then we need to add a condition to our list item, like dsl::peek(dsl::digit<>).

Using dsl::list(), implementing an author_list production is pretty straightforward. Our list item is dsl::p<author>. This rule parses the specified production and it will produce the value of the production. Here, the value is a std::string and we add that to our std::vector<std::string>.

The author_list production
struct author_list
{
    // Match a comma separated (non-empty) list of authors surrounded by square brackets.
    static constexpr auto rule
      = dsl::lit_c<'['> + dsl::list(dsl::p<author>, dsl::sep(dsl::comma)) + dsl::lit_c<']'>;

    // Collect all authors into a std::vector.
    static constexpr auto value = lexy::as_list<std::vector<std::string>>;
};

If we wanted to use dsl::trailing_sep() or even no separator, we would need a branch. Luckily, dsl::p is a branch if the rule of the production is a branch, and dsl::quoted() is a branch whose condition is the initial ". As such, dsl::p<author> is a branch already.

Surrounding things with some sort of brackets is also quite common. As such, the library provides dsl::brackets() to define a set of open and closing brackets, which can then be applied to a rule. dsl::square_bracketed as dsl::brackets(dsl::lit_c<'['>, dsl::lit_c<']'>) is already predefined, so we can use it.

Writing dsl::square_bracketed(rule) will match the rule surrounded by square brackets. For the specific case of dsl::list(), we can also use dsl::square_bracketed.list(item, sep) instead. This has the additional advantage that the closing bracket will be used as branch condition for the list item.

The final author_list production (godbolt)
struct author_list
{
    // Match a comma separated (non-empty) list of authors surrounded by square brackets.
    static constexpr auto rule
        = dsl::square_bracketed.list(dsl::p<author>, dsl::sep(dsl::comma));

    // Collect all authors into a std::vector.
    static constexpr auto value = lexy::as_list<std::vector<std::string>>;
};

To recap all the implicit branch condition:

  • Using dsl::sep() as list separator does not require a branch to parse a list. The separator itself is used to determine whether or not we need another list item. If we wanted to use dsl::trailing_sep() or no list separator, we would need a branch.

  • The dsl::p rule is branch if the production rule is a branch.

  • dsl::quoted() is a branch that uses the initial quotation mark as condition. The same is true for every dsl::delimited().

  • dsl::square_bracketed() is a branch that uses the initial opening square bracket as condition. The same is true for every dsl::bracketed().

  • Using dsl::square_bracketed.list(…​) does never require a branch condition in the list item. The list is considered done if we have the closing square bracket, similar to the way dsl::quoted() worked. The same is true for every dsl::bracketed().

So while lexy requires branches every time it needs to make a decision, in many situations, the branches can be hidden away. This is thanks to the utility rules such as dsl::delimited() and dsl::bracketed(). There is also dsl::terminated(), which works just like dsl::bracketed() but it has only a closing "bracket" not an opening one.

Parsing the package config

We can now put everything together and parse our config:

The config production
struct config
{
    static constexpr auto rule = []{
        auto make_field = [](auto name, auto rule) {              (1)
            return name + dsl::lit_c<'='> + rule + dsl::newline;  (2)
        };

        auto name_field    = make_field(LEXY_LIT("name"), dsl::p<name>); (3)
        auto version_field = make_field(LEXY_LIT("version"), dsl::p<version>);
        auto authors_field
            = make_field(LEXY_LIT("authors"), dsl::p<author_list>);

        return name_field + version_field + authors_field; (4)
    }();

    static constexpr auto value = lexy::construct<PackageConfig>; (5)
};
  1. We define a little helper function that builds a rule that parses a field given its name and value.

  2. Each field consists of the name, an equal sign, the value rule, and a newline matched by the dsl::newline token.

  3. Define each field using the productions we’ve built above.

  4. Match them all in order.

  5. Construct the package config from the resulting std::string, PackageVersion and std::vector<std::string>.

This works!

We can now almost parse the sample input I’ve given above:

package.config
name=lexy
version=0.0.0
authors=["Jonathan Müller"]

We don’t support whitespace between the elements. We want to support ASCII blank characters (space and tab) surrounding the equal sign and the brackets and comma of the author list. This can be done either manually or automatically.

Manual whitespace skipping

To do this manually, we can use dsl::whitespace(dsl::ascii::blank). This rule, like dsl::while_(), matches zero or more occurrences of the given rule (but internally it is treated as whitespace, not actual text). We then insert it wherever we need to skip whitespace.

// Define whitespace globally for convenience.
constexpr auto ws = dsl::whitespace(dsl::ascii:::blank).

struct config
{
    static constexpr auto rule = []{
        auto make_field = [](auto name, auto rule) {
            // Skip whitespace surrounding the equal sign and before the newline.
            return name + ws + dsl::lit_c<'='> + ws + rule + ws + dsl::newline;
        };

        
    }();
};

// Likewise, add it to the author_list production.

Automatic whitespace skipping

Skipping whitespace manually is a good idea when whitespace is only needed in a couple of places or you’re copying a grammar that already specifies whitespace. Here, however, it just adds extra noise to the rule.

So instead we can instruct lexy to skip whitespace automatically for us. We just need to tell the library what whitespace is, and it will automatically skip one after it parses a token. Remember, tokens are things like LEXY_LIT("name") or dsl::lit_c<'='> — precisely where we inserted ws in the example above!

struct config
{
    static constexpr auto whitespace = dsl::ascii::blank; (1)

    static constexpr auto rule = [] {  } (); (2)
    static constexpr auto value = lexy::as_aggregate<PackageConfig>;
};
  1. Define what whitespace is for our grammar.

  2. Nothing needs to change in any of the rules here!

We enable whitespace by adding a whitespace member to the root production, i.e. the production we’re actually parsing. And that’s all: now lexy will skip whitespace after every token of our grammar.

This is a bit much, however. For example, the following now parses:

name    = le   x  y
version = 0.  0  .0
authors = ["Jonathan Müller"]

The name production consists of a sequence of tokens like dsl::ascii::alpha. lexy will skip whitespace after every one of them. Likewise, it will skip whitespace after the dsl::period and dsl::digits of the version production.

So we need to disable whitespace skipping there. Conceptually, the name and version production should be treated just like tokens: we don’t want whitespace inside of them, but only skip it afterwards. We can get that behavior by inheriting them from lexy::token_production:

struct name : lexy::token_production
{
    
};

struct version : lexy::token_production
{
    
};

// Other productions unchanged.

Now when we parse the name and version field, lexy disables whitespace skipping for the tokens inside the productions, and will only skip spaces afterwards.

Note that we don’t need to do the same for the author production. While it is also a production that should be treated as a single token (a string literal), whitespace skipping inside of dsl::quoted() is disabled automatically for us. So " Jonathan Müller" will always include the leading spaces.

To recap, to enable automatic whitespace skipping, we just need to do the following:

  • Add a static constexpr auto whitespace member that defines whitespace to our root production config.

  • Disable whitespace skipping inside the name and version field by inheriting the productions from lexy::token_production.

Now we can parse the package config shown in the beginning of the tutorial!

Arbitrary ordering of fields

One final feature we might want to support is parsing fields in arbitrary order. This can be done with the dsl::combination() rule, which matches the specified set of rules once, but in any order. The values of each rule are passed to a sink, to prevent exponential template instantiations. This is a problem though: how can we know which value should be assigned to which member of our PackageConfig?

We can specify a given member using LEXY_MEM(name) = rule. This says that the value produced by rule should be assigned to a member named name. The lexy::as_aggregate<T> sink then constructs a T object and processes all member assignments, in whatever order they might occur.

The final config production
struct config
{
    static constexpr auto whitespace = dsl::ascii::blank;

    static constexpr auto rule = [] {
        auto make_field = [](auto name, auto rule) {
            return name >> dsl::lit_c<'='> + rule + dsl::newline; (1)
        };

        auto name_field    = make_field(LEXY_LIT("name"), LEXY_MEM(name) = dsl::p<name>); (2)
        auto version_field
            = make_field(LEXY_LIT("version"), LEXY_MEM(version) = dsl::p<version>);
        auto authors_field
            = make_field(LEXY_LIT("authors"), LEXY_MEM(authors) = dsl::p<author_list>);

        return dsl::combination(name_field, version_field, authors_field); (3)
    }();

    static constexpr auto value = lexy::as_aggregate<PackageConfig>; (4)
};
  1. dsl::combination() requires a branch condition to know which rule to parse. Luckily, we can use the name of the field for that.

  2. Each rule now contains the assignment to the appropriate member.

  3. Instead of a sequence, we now have dsl::combination().

  4. We use lexy::as_aggregate<PackageConfig> as our sink.

This will match each field exactly once, but in any order.

Error handling

Our parser now handles all well-formed input, but what about wrong input?

Parsing the entire input

The first thing you might notice is that you can freely append stuff at the end of the config file.

package.config
name    = lexy
version = 0.0.0
authors = ["Jonathan Müller"]
Hello World!
asdfjlagnlwefhjlaghlhl

The reason for that is simple: when we parse a production, we only consume as much input as necessary for it and don’t look at anything else. To prevent that, we need to use dsl::eof. This token only matches when we’re at the end of the input.

Preventing trailing input
struct config
{
    static constexpr auto rule = [] {
        

        return dsl::combination(name_field, version_field, authors_field)
                + dsl::eof;
    }();
};

Note that this does not allow trailing newlines, as we’ve required EOF immediately after all the fields. To fix that, we can manually instruct lexy to skip any whitespace character, not just blanks.

Allowing trailing newlines
struct config
{
    static constexpr auto rule = [] {
        

        return dsl::combination(name_field, version_field, authors_field)
                + dsl::whitespace(dsl::ascii::space) + dsl::eof;
    }();
};

Error messages

When the parsing algorithm fails to parse something, parsing stops and an error is raised. This error is passed to the error callback passed as second argument to lexy::parse() and lexy::validate(). The callback is invoked with two arguments. The first is a lexy::error_context<Production, Input>, which contains contextual information like the name and location of the production that failed. The second is a lexy::error<Reader, Tag>. It always is associated with a location, but can have additional information depending on the Tag.

lexy::error<Reader, lexy::expected_literal>

A lexy::expected_literal error is raised when we’ve instructed the parse algorithm to parse a literal sequence of characters, but it couldn’t match those. It contains information about the expected literal and at which position and character matching failed.

lexy::error<Reader, lexy::expected_char_class>

A lexy::expected_char_class error is raised when we’ve instructed the parse algorithm to parse one of a specified set of characters, but it couldn’t match any of those. It contains a user-friendly name of the character class.

lexy::error<Reader, Tag>

Otherwise, it is a generic error. The Tag is an empty class that can be given a message, which the error reports. It is raised for example by a choice where no branch has matched.

In the full source code found at examples/tutorial.cpp, the error callback is lexy_ex::report_error. This callback is not part of the library proper, but can be copied and adapted for your own needs. It simply formats the error nicely and prints it to stderr.

By default, the error messages are pretty good. You can try various malformed input and see what the library reports. Some error messages are given.

Name that starts with an underscore.
error: while parsing name
     |
 1: 8| name = _lexy
     |        ^ expected 'ASCII.alpha' character
Missing version number
error: while parsing version
     |
 2:11| version = 0.0
     |           ~~~^ expected '.'
Author name not quoted.
error: while parsing author_list
     |
 3:12| authors = [Jonathan Müller]
     |            ^ expected '"'

Specifying custom error tags

However, some generic errors are a bit confusing if you haven’t written the grammar. For example, if you write a string literal that contains a control character, you get the generic minus failure error message. Luckily, the minus rule is actually a token and every token has a .error member. This allows specifying the error that will be reported if the token didn’t match.

author production with dsl::try_
struct author
{
    struct invalid_character (1)
    {
        static constexpr auto name = "invalid string character"; (2)
    };

    static constexpr auto rule = [] {
        auto cp = (dsl::code_point - dsl::ascii::control).error<invalid_character>; (3)

        
    }();

    
};
  1. The tag that will be associated with the error.

  2. We override the default message (which would be author::invalid_character) to the more friendly invalid string character.

  3. We specify that on token failure, we want a generic error with the given tag.

Likewise, if we specify the same field twice we get the generic combination duplicate error message. Additionally, if we add an unknown field we get the generic exhausted_choice error. Both issues can be improved by specifying custom tags in our dsl::combination() call.

config production with tagged dsl::combination()
struct config
{
    struct unknown_field (1)
    {
        static constexpr auto name = "unknown config field"; (2)
    };
    struct duplicate_field (1)
    {
        static constexpr auto name = "duplicate config field"; (2)
    };

    static constexpr auto rule = [] {
        

        auto combination = dsl::combination(name_field, version_field, authors_field)
                               .missing_error<unknown_field>.duplicate_error<duplicate_field>; (3)
        return combination + dsl::whitespace(dsl::ascii::space) + dsl::eof;
    }();
};
  1. Define the tags.

  2. Override the default message, which is the type name.

  3. Specify the error on failure. The missing error is the one triggered when no field condition matched, the duplicate one if we had a field twice.

Now an invalid string character is reported as invalid string character and a duplicated config field as duplicate config field:

Missing closing string delimiter
error: while parsing author
     |
 3:28| authors = ["Jonathan Müller]
     |              ~~~~~~~~~~~~~~~^ invalid string character
Duplicate config field error
error: while parsing config
     |
 1: 1| name = lexy
     | ^ beginning here
     |
 3: 1| version = 0.0.0
     | ^^^^^^^^^^^^^^^ duplicate config field

Using dsl::require() and dsl::prevent() to handle common mistakes

There are more error messages that could be improved. For example, when you have a name like my-package, you get an "expected newline" error pointing to the first -, as that’s where the name production stops parsing. We can improve that using dsl::require(). This rule raises an error with the specified tag if the pattern would not match at the input, but it doesn’t actually consume anything.

name production with dsl::require
struct name
{
    struct invalid_character (1)
    {
        static constexpr auto name = "invalid name character"; (2)
    };

    static constexpr auto rule = [] {
        

        return dsl::capture(lead_char + dsl::while_(trailing_char))
               + dsl::require(dsl::ascii::space).error<invalid_character>; (3)
    }();
};
  1. Define a tag.

  2. Give it a custom message.

  3. Issue the error unless the name is followed by the required space character (either trailing whitespace or the newline).

Now the error message looks like this instead.

Invalid name character error
error: while parsing name
     |
 1:10| name = my-package
     |        ~~^ invalid name character

Likewise, we can use dsl::prevent(), which fails if a pattern would match, if we were to specify a build string in our version.

version production with dsl::prevent()
struct version
{
    struct forbidden_build_string (1)
    {
        static constexpr auto name = "build string not supported"; (2)
    };

    static constexpr auto rule = [] {
        

        return number + dot + number + dot + number
               + dsl::prevent(dsl::lit_c<'-'>).error<forbidden_build_string>; (3)
    }();
};
  1. Define a tag.

  2. Give it a custom message.

  3. Raise the error when the beginning of a build string is encountered.

Forbidden build string
error: while parsing version
     |
 2:16| version = 0.0.0-alpha
     |           ~~~~~^ build string not supported

Error Recovery

lexy can also recover from an error and continue parsing. In the easy cases, this error recovery is done automatically for us, for example when parsing an author field of ["author 1" "author 2"]. Even though the comma is missing (and we’ll get the appropriate error), parsing continues and we’re getting a config object with the two authors.

However, sometimes we need to do error recovery ourselves. This can be done with the dsl::try_() rule. It parses a given rule, and will do something to recover from it if parsing fails.

Consider the code that parses one config field:

auto make_field = [](auto name, auto rule) {
    return name >> dsl::lit_c<'='> + rule + dsl::newline;
};

Note that the = sign between the name and the value is not required to be able to parse it; something like version 1.0.0 is not ambiguous. So instead of specifying dsl::lit_c<'='>, we can use dsl::try_(dsl::lit_c<'='>): this tries to parse an = sign and issues an error if there isn’t one, but then it just continues as if nothing happens. So version 1.0.0 will lead to an error message complaining about the missing =, but still give you the appropriate config object. Note that this is unlike dsl::if_(dsl::lit_c<'='>) which would not raise an error if there is no =, as there the = is optional.

Similar, we can help recover if there isn’t a newline after the rule. Input like name = my-package will raise the invalid name character error as demonstrated above. This is not a fatal error by design of dsl::require(), so parsing continues and tries to parse dsl::newline. The latter will fail though and abort parsing.

Instead of dsl::newline we can use dsl::try_(dsl::newline, dsl::until(dsl::newline)). This then tries to parse a newline, but if we didn’t have one, it will consume all input until it finds one. Then name = my-package will set the name to my, raise an invalid name character, raise a missing newline but then continues with the next field entry.

We can also leverage dsl::try_() to recover from input like version = 1.0 (instead of 1.0.0). Simply make every dot and number "optional" as shown here:

auto number      = dsl::try_(dsl::integer<int>(dsl::digits<>), dsl::value_c<0>); (1)
auto dot         = dsl::try_(dsl::period); (2)
auto dot_version = number + dot + number + dot + number
                   + dsl::prevent(dsl::lit_c<'-'>).error<forbidden_build_string>;
  1. If we didn’t have an integer, produce a 0.

  2. If we didn’t have a dot, just ignore it.

Many more things can be done, once common errors are known, but this is enough for the tutorial.


Congratulations, you’ve worked through your first parser!

Now you know everything to get started with parsing your own input. Check out the reference documentation for specific rules.