Header lexy/dsl/whitespace.hpp

Facilities for skipping whitespace.

By default, lexy does not treat whitespace in any particular way and it has to be parsed just like anything else in the input. However, as there are grammars that allow whitespace in a lot of places, it is often convenient to have it taken care of. lexy can be instructed to handle whitespace, using either manual or automatic whitespace skipping.

Manual whitespace skipping is done using lexy::dsl::whitespace(ws). It skips zero or more of whitespace defined by ws and can be inserted everywhere you want to skip over whitespace. This method is recommended where whitespace is an essential part of the grammar. See email.cpp or xml.cpp for examples of manual whitespace skipping.

Automatic whitespace skipping is done by adding a static constexpr auto whitespace member to the root production. This is a rule that defines default whitespace for the entire grammar, as the ws argument did in the manual example. lexy then skips zero or more occurrences of whitespace after every token rule in the grammar, unless it has been manually disabled (see below). This method is recommend where whitespace is not important and is just there to format the input nicely. See tutorial.cpp or json.cpp for examples of automatic whitespace skipping.

Note
"Whitespace" does not mean literal whitespace characters. It can also include comments (or whatever else you want).

Rule lexy::dsl::whitespace (manual)

lexy/dsl/whitespace.hpp
namespace lexy::dsl
{
    class ws-rule // models rule
    {};

    constexpr ws-rule whitespace(rule auto ws);

    constexpr ws-rule operator|(ws-rule rhs, rule auto lhs) const;
    constexpr ws-rule operator|(rule auto rhs, ws-rule lhs) const;

    constexpr ws-rule operator/(ws-rule rhs, token-rule auto lhs) const;
    constexpr ws-rule operator/(token-rule auto rhs, ws-rule lhs) const;
}

The manual whitespace overload is a rule that skips whitespace as defined by its argument.

Requires
Parses

Parses lexy::dsl::loop(ws | lexy::dsl::else_ >> lexy::dsl::break_) in a context where whitespace skipping is disabled.

Errors

All errors raised during parsing of ws | lexy::dsl::else_ >> lexy::dsl::break_. The rule then fails if ws has failed; even if in a branch context.

Parse tree

A single token node with the lexy::predefined_token_kind lexy::whitespace_token_kind whose range covers everything consumed; all individual token nodes of the whitespace rules are merged into this one. It is only added to the parse tree if it is not empty.

For convenience, operator| and operator/ are overloaded for the whitespace rule. Here, whitespace(a) | b is entirely equivalent to whitespace(a | b), and likewise for the other overloads. They simply allow adding more whitespace to a rule after it has already been wrapped in whitespace.

Example 1. Simple manual whitespace skipping
struct production
{
    static constexpr auto rule = [] {
        auto ws = dsl::whitespace(dsl::ascii::space);
        return LEXY_LIT("Hello") + ws + LEXY_LIT("World") //
               + ws + dsl::exclamation_mark + ws + dsl::eof;
    }();
};
Tip
Use lexy::dsl::ascii::space to skip all ASCII whitespace characters.

Rule lexy::dsl::whitespace (automatic)

lexy/dsl/whitespace.hpp
namespace lexy::dsl
{
    constexpr rule whitespace;
}

The automatic whitespace rule skips whitespace as it is defined in grammar.

It behaves exactly identical to the manual whitespace(ws) overload, where ws is determined as follows:

  1. If automatic whitespace skipping has been disabled (e.g. by using lexy::dsl::no_whitespace()), ws is the rule that matches the empty string. As such, whitespace does not advance the reader.

  2. If lexy::production_whitespace for the current production and the root production is non-void, ws is that rule. Here, the root production is determined by following any lexy::dsl::p or lexy::dsl::recurse calls backwards, until either the top-level production originally passed to a parse function or a production inheriting from lexy::token_production is reached. This is then the root production.

  3. Otherwise (if it is void), ws is the rule that matches the empty string and whitespace does not advance the reader at all.

This rule is automatically parsed after every token rule, a lexy::dsl::p or lexy::dsl::recurse rule that parses a production inheriting from lexy::token_production, or after a lexy::dsl::no_whitespace rule. Note that unless whitespace has been defined, this has no effect.

Example 2. Simple automatic whitespace skipping
struct production
{
    static constexpr auto whitespace = dsl::ascii::space;
    static constexpr auto rule                  //
        = LEXY_LIT("Hello") + LEXY_LIT("World") //
          + dsl::exclamation_mark + dsl::eof;
};
Example 3. Comments can be whitespace too
struct production
{
    // Note that an unterminated C comment will raise an error.
    static constexpr auto whitespace
        = dsl::ascii::space | LEXY_LIT("/*") >> dsl::until(LEXY_LIT("*/"));

    static constexpr auto rule //
        = LEXY_LIT("Hello") + LEXY_LIT("World") + dsl::exclamation_mark;
};
Example 4. How whitespace is determined
// An inner production that does not override the whitespace.
struct inner_normal
{
    // After every token in this rule, the whitespace is '+',
    // as determined by its root production `production`.
    static constexpr auto rule //
        = dsl::parenthesized(LEXY_LIT("inner") + LEXY_LIT("normal"));
};

// An inner production that overrides the current whitespace definition.
struct inner_override
{
    static constexpr auto whitespace = dsl::lit_c<'-'>;

    // After every token in this rule, the whitespace is '-',
    // as determined by the `whitespace` member of the current production.
    static constexpr auto rule //
        = dsl::parenthesized(LEXY_LIT("inner") + LEXY_LIT("override"));
};

// A token production that does not have inner whitespace.
struct inner_token : lexy::token_production
{
    struct inner_inner
    {
        // No whitespace is skipped here, as its root production is `inner_token`,
        // which does not have a `whitespace` member.
        static constexpr auto rule = LEXY_LIT("inner") + LEXY_LIT("token");
    };

    // No whitespace is skipped here, as the current production inherits from
    // `lexy::token_production`.
    static constexpr auto rule = dsl::parenthesized(dsl::p<inner_inner>);
};

// A token production that does have inner whitespace, but different one.
struct inner_token_whitespace : lexy::token_production
{
    struct inner_inner
    {
        // After every token in this rule, the whitespace is '_',
        // as determined by its root production `inner_token_whitespace`.
        static constexpr auto rule //
            = LEXY_LIT("inner") + LEXY_LIT("token") + LEXY_LIT("whitespace");
    };

    static constexpr auto whitespace = dsl::lit_c<'_'>;

    // No whitespace is skipped here, as the current production inherits from
    // `lexy::token_production`.
    static constexpr auto rule = dsl::parenthesized(dsl::p<inner_inner>);
};

// The root production defines whitespace.
struct production
{
    static constexpr auto whitespace = dsl::lit_c<'+'>;

    // After every token in this rule, the whitespace is '+',
    // as determined by the `whitespace` member of the current production.
    // Whitespace is also skipped after the two token productions.
    static constexpr auto rule
        = dsl::p<inner_normal> + dsl::comma + dsl::p<inner_override> + dsl::comma
          + dsl::p<inner_token> + dsl::comma + dsl::p<inner_token_whitespace> //
          + dsl::period + dsl::eof;
};
Note

As seen in the example above, directly inside the lexy::token_production inner_token_whitespace all whitespace skipping is disabled, even if it has a ::whitespace member. This is because the last token rule of the production (the ')‘) would skip whitespace according to the current `::whitespace member ('_'). However, the lexy::dsl::p production of the parent that started the parse (production), also skips whitespace, but according to the ::whitespace member of its root production (’+'`). As such, we would skip two different whitespaces directly after each other.

To enable whitespace skipping inside a token production, put all logic into a child production and directly and only parse that one via lexy::dsl::p. Inside the child production, whitespace is skipped again, as seen by the inner_inner production.

Tip
Use whitespace to skip optional whitespace at the beginning of the input.
Tip
Use lexy::dsl::ascii::space to skip all ASCII whitespace characters.

Rule lexy::dsl::no_whitespace

lexy/dsl/whitespace.hpp
namespace lexy::dsl
{
    constexpr rule        no_whitespace(rule auto rule);
    constexpr branch-rule no_whitespace(branch-rule auto rule);
}

no_whitespace is a rule that parses rule without automatic whitespace skipping.

(Branch) Parsing

Parses rule in a context where there is no current whitespace rule and lexy::dsl::whitespace does nothing.

Errors

All errors raised by rule. The rule then fails if rule has failed.

Values

All values produced by rule.

Example 5. Disable whitespace between two tokens
struct production
{
    static constexpr auto whitespace = dsl::ascii::space;
    static constexpr auto rule                                      //
        = dsl::no_whitespace(LEXY_LIT("Hello") + LEXY_LIT("World")) //
          + dsl::exclamation_mark + dsl::eof;
};
Tip
In most situations, you should prefer a lexy::token_production instead. no_whitespace is mostly used as implementation detail for rules that should never have whitespace skipping, like lexy::dsl::delimited.
Caution
When r contains a lexy::dsl::p or lexy::dsl::recurse rule, whitespace skipping is re-enabled while parsing the production.

See also