Header lexy/dsl/whitespace.hpp

Facilities for skipping whitespace.

By default, lexy does not treat whitespace in any particular way and it has to be parsed just like anything else in the input. However, as there are grammars that allow whitespace in a lot of places, it is often convenient to have it taken care of. lexy can be instructed to handle whitespace, using either manual or automatic whitespace skipping.

Manual whitespace skipping is done using lexy::dsl::whitespace . It skips zero or more of whitespace defined by ws and can be inserted everywhere you want to skip over whitespace. This method is recommended where whitespace is an essential part of the grammar. See email.cpp or xml.cpp for examples of manual whitespace skipping.

Automatic whitespace skipping is done by adding a static constexpr auto whitespace member to the root production. This is a rule that defines default whitespace for the entire grammar, as the ws argument did in the manual example. lexy then skips zero or more occurrences of whitespace after every token rule in the grammar, unless it has been manually disabled (see below). This method is recommend where whitespace is not important and is just there to format the input nicely. See config.cpp or json.cpp for examples of automatic whitespace skipping.

Note
"Whitespace" does not mean literal whitespace characters. It can also include comments (or whatever else you want).

Rule lexy::dsl::whitespace

lexy/dsl/whitespace.hpp
namespace lexy::dsl
{
    class ws-rule // models rule
    {};

    constexpr ws-rule whitespace(rule auto ws);

    constexpr ws-rule operator|(ws-rule rhs, rule auto lhs) const;
    constexpr ws-rule operator|(rule auto rhs, ws-rule lhs) const;
}

The manual whitespace overload is a rule that skips whitespace as defined by its argument.

Requires
Parses

Parses lexy::dsl::loop (ws | lexy::dsl::else_ >> lexy::dsl::break_) in a context where whitespace skipping is disabled.

Errors

All errors raised during parsing of ws | lexy::dsl::else_ >> lexy::dsl::break_. The rule then fails if ws has failed; even if in a branch context.

Parse tree

A single token node with the lexy::predefined_token_kind  lexy::whitespace_token_kind whose range covers everything consumed; all individual token nodes of the whitespace rules are merged into this one. It is only added to the parse tree if it is not empty.

For convenience, operator| are overloaded for the whitespace rule. Here, whitespace(a) | b is entirely equivalent to whitespace(a | b), and likewise for the other overloads. They simply allow adding more whitespace to a rule after it has already been wrapped in whitespace.

Example 1. Simple manual whitespace skipping
struct production
{
    static constexpr auto rule = [] {
        auto ws = dsl::whitespace(dsl::ascii::space);
        return LEXY_LIT("Hello") + ws + LEXY_LIT("World") //
               + ws + dsl::exclamation_mark + ws + dsl::eof;
    }();
};
Tip
Use lexy::dsl::ascii::space  to skip all ASCII whitespace characters.

Automatic whitespace skipping

For automatic whitespace skipping lexy inserts a lexy::dsl::whitespace (ws) rule after every token rule, a lexy::dsl::p  or lexy::dsl::recurse  rule that parses a production inheriting from lexy::token_production , or after a lexy::dsl::no_whitespace  rule; or when starting to parse a production that defines a new whitespace rule.

Here ws is determined as follows:

  1. If automatic whitespace skipping has been disabled (e.g. by using lexy::dsl::no_whitespace()), ws is the rule that matches the empty string. As such, no automatic whitespace skipping takes place.

  2. If lexy::production_whitespace  for the current production and the whitespace production is non-void, ws is that rule. Here, the whitespace production is determined by following any lexy::dsl::p or lexy::dsl::recurse calls backwards, until a production that defines a ::whitespace member, the top-level production originally passed to a parse function, or a production inheriting from lexy::token_production  is reached.

  3. Otherwise (if it is void), ws is the rule that matches the empty string and no whitespace skipping takes place.

Example 2. Simple automatic whitespace skipping
struct production
{
    static constexpr auto whitespace = dsl::ascii::space;
    static constexpr auto rule                  //
        = LEXY_LIT("Hello") + LEXY_LIT("World") //
          + dsl::exclamation_mark + dsl::eof;
};
Example 3. Comments can be whitespace too
struct production
{
    // Note that an unterminated C comment will raise an error.
    static constexpr auto whitespace
        = dsl::ascii::space | LEXY_LIT("/*") >> dsl::until(LEXY_LIT("*/"));

    static constexpr auto rule //
        = LEXY_LIT("Hello") + LEXY_LIT("World") + dsl::exclamation_mark;
};
Example 4. How whitespace is determined
// An inner production that does not override the whitespace.
struct inner_normal
{
    // After every token in this rule, the whitespace is '+',
    // as determined by its root production `production`.
    static constexpr auto rule //
        = dsl::parenthesized(LEXY_LIT("inner") + LEXY_LIT("normal"));
};

// An inner production that overrides the current whitespace definition.
struct inner_override
{
    static constexpr auto whitespace = dsl::lit_c<'-'>;

    // After every token in this rule, the whitespace is '-',
    // as determined by the `whitespace` member of the current production.
    static constexpr auto rule //
        = dsl::parenthesized(LEXY_LIT("inner") + LEXY_LIT("override"));
};

// A token production that does not have inner whitespace.
struct inner_token : lexy::token_production
{
    struct inner_inner
    {
        // No whitespace is skipped here, as its root production is `inner_token`,
        // which does not have a `whitespace` member.
        static constexpr auto rule = LEXY_LIT("inner") + LEXY_LIT("token");
    };

    // No whitespace is skipped here, as the current production inherits from
    // `lexy::token_production`.
    static constexpr auto rule = dsl::parenthesized(dsl::p<inner_inner>);
};

// A token production that does have inner whitespace, but different one.
struct inner_token_whitespace : lexy::token_production
{
    struct inner_inner
    {
        // After every token in this rule, the whitespace is '_',
        // as determined by its root production `inner_token_whitespace`.
        static constexpr auto rule //
            = LEXY_LIT("inner") + LEXY_LIT("token") + LEXY_LIT("whitespace");
    };

    static constexpr auto whitespace = dsl::lit_c<'_'>;

    static constexpr auto rule = dsl::parenthesized(dsl::p<inner_inner>);
};

// The root production defines whitespace.
struct production
{
    static constexpr auto whitespace = dsl::lit_c<'+'>;

    // After every token in this rule, the whitespace is '+',
    // as determined by the `whitespace` member of the current production.
    // Whitespace is also skipped after the two token productions.
    static constexpr auto rule
        = dsl::p<inner_normal> + dsl::comma + dsl::p<inner_override> + dsl::comma
          + dsl::p<inner_token> + dsl::comma + dsl::p<inner_token_whitespace> //
          + dsl::period + dsl::eof;
};
Caution
If e.g. a token production defines a new whitespace rule, this is skipped after the last token of the production. Then the whitespace rule of the parent production is skipped as well, as seen in the example.
Tip
Use lexy::dsl::ascii::space  to skip all ASCII whitespace characters.

Rule lexy::dsl::no_whitespace

lexy/dsl/whitespace.hpp
namespace lexy::dsl
{
    constexpr rule        no_whitespace(rule auto rule);
    constexpr branch-rule no_whitespace(branch-rule auto rule);
}

no_whitespace is a rule that parses rule without automatic whitespace skipping.

(Branch) Parsing

Parses rule in a context where there is no current whitespace rule and lexy::dsl::whitespace does nothing.

Errors

All errors raised by rule. The rule then fails if rule has failed.

Values

All values produced by rule.

Example 5. Disable whitespace between two tokens
struct production
{
    static constexpr auto whitespace = dsl::ascii::space;
    static constexpr auto rule                                      //
        = dsl::no_whitespace(LEXY_LIT("Hello") + LEXY_LIT("World")) //
          + dsl::exclamation_mark + dsl::eof;
};
Tip
In most situations, you should prefer a lexy::token_production  instead. no_whitespace is mostly used as implementation detail for rules that should never have whitespace skipping, like lexy::dsl::delimited .
Caution
When r contains a lexy::dsl::p  or lexy::dsl::recurse  rule, whitespace skipping is re-enabled while parsing the production.

See also