Header lexy/dsl/delimited.hpp

Rules for parsing delimited/quoted strings with escape sequences.

Rule DSL lexy::dsl::delimited

lexy/dsl/delimited.hpp
namespace lexy
{
    struct missing_delimiter {};
}

namespace lexy::dsl
{
    struct delimited-dsl // note: not a rule itself
    {
        constexpr branch-rule auto open() const;
        constexpr branch-rule auto close() const;

        constexpr delimited-dsl limit(char-class-rule limit);
        template <typename ErrorTag>
        constexpr delimited-dsl limit(char-class-rule limit);

        //=== rules ===//
        constexpr rule auto operator()(char-class-rule c) const;
        constexpr rule auto operator()(char-class-rule c, escape-dsl ... escapes) const;
    };

    constexpr delimited-dsl delimited(branch-rule auto open,
                                      branch-rule auto close);

    constexpr delimited-dsl delimited(branch-rule auto delim)
    {
        return delimited(delim, delim)
    }
}

delimited is not a rule, but a DSL for specifying rules that all parse zero or more "characters" surrounded by delimiters, with optional escape sequences.

It can be created using two overloads. The first overload takes a branch rule that matches the open(ing) delimiter and one that matches the close(ing) delimiter. The second overload takes just one rule that matches both opening and closing delimiter.

Common delimiters, like quotation marks, are predefined (see below).

Note
See lexy::dsl::brackets  if you want to parse an arbitrary rule surrounded by brackets. This is one is designed for lists of characters.

Branch rules .open() and .close()

lexy/dsl/delimited.hpp
constexpr branch-rule auto open() const;
constexpr branch-rule auto close() const;

.open()/.close() returns the branch rules that were passed to the top-level lexy::dsl::delimited().

.limit()

lexy/dsl/delimited.hpp
constexpr delimited-dsl limit(char-class-rule limit);

template <typename ErrorTag>
constexpr delimited-dsl limit(char-class-rule limit);

Provide a limit to detect a missing closing delimiter.

delimited only stops parsing once it matches close; if close is missing in the input it will consume the entire input. By specifying a limit, which is a char class rule, it fails once it matches one of them before the closing delimiter.

The second overload also specifies an ErrorTag, which is used instead of lexy::missing_delimiter.

Example 1. Detect missing closing delimiter
struct production
{
    static constexpr auto rule = [] {
        // Arbitrary code points that aren't control characters.
        auto c = -dsl::ascii::control;

        // If we have a newline inside our string, we're missing the closing ".
        auto quoted = dsl::quoted.limit(dsl::ascii::newline);
        return quoted(c);
    }();
};
Caution
The limit must be a character that must not be allowed inside the delimited.

Rule .operator()

lexy/dsl/delimited.hpp
constexpr rule auto operator()(char-class-rule c) const;
constexpr rule auto operator()(char-class-rule c, escape-dsl ... escapes) const;

.operator() returns a rule that parses zero or more occurrences of the char class rule c inside the delimited, with optional escape sequences.

Requires
  • The encoding of the input is a char encoding.

  • Each escape sequence escapes must begin with a distinct escape character (e.g. one with backslash and one with dollar).

Parsing

Parses open(), then enters a loop where it repeatedly does the following in order:

  1. Tries parsing close(). If that succeeds, finishes.

  2. Tries match any of the token rules provided as a limit, if there are any, and tries to match lexy::dsl::eof . If any of them match, fails with a missing delimiter.

  3. For the second overload, tries to parse all escapes (see below for what it does).

  4. Parses c. While parsing the delimited, automatic whitespace skipping is disabled; whitespace is only skipped after close().

Branch parsing

Tries parsing open() and backtracks if that did not succeed. Otherwise, parses the same loop as described above.

Errors
  • All errors raised by parsing open().

  • lexy::missing_delimiter or a specified ErrorTag: if the limits match. Its range covers everything since the opening delimiter. The rule then fails.

  • All errors raised by parsing escape. Recovers by simply continuing with the next iteration of the loop at the position where escape has left of. Note that no value of escape is produced.

  • All errors raised by parsing c. It can recover for by simply discarding the bad character and continuing after it. Otherwise, it fails.

Values

It creates a sink of the current context. The sink is invoked with a lexy::lexeme  capturing everything consumed by c; a sequence of contiguous characters is merged into a single lexeme. It is also invoked with every value produced by escape. The invocations happen separately in lexical order. The rule then produces all values of open(), the final value of the sink, and all values of close().

Parse tree

delimited does has any special parse tree handling: it will create the nodes for open(), then the nodes for each c and escape, and the nodes for close(). However, instead of creating separate token nodes for each c, adjacent token nodes are merged into a single one covering as much as possible. A character that is skipped during error recovery will create a token node whose lexy::predefined_token_kind  is lexy::error_token_kind.

Example 2. Parse a quoted string
struct production
{
    static constexpr auto rule = [] {
        // Arbitrary code points that aren't control characters.
        auto c = -dsl::ascii::control;

        return dsl::quoted(c);
    }();
};
Example 3. Parse a quoted string with custom error
struct production
{
    struct invalid_character
    {
        static constexpr auto name = "invalid character";
    };

    static constexpr auto rule = [] {
        // Arbitrary code points that aren't control characters.
        auto c = (-dsl::ascii::control).error<invalid_character>;

        return dsl::quoted(c);
    }();
};
Example 4. Parse a quoted string with whitespace and token production
struct quoted : lexy::token_production
{
    static constexpr auto rule = [] {
        // Arbitrary code points that aren't control characters.
        auto c = -dsl::ascii::control;

        return dsl::quoted(c);
    }();
};

struct production
{
    static constexpr auto whitespace = dsl::ascii::space;
    static constexpr auto rule       = dsl::p<quoted> + dsl::semicolon;
};
Tip
Use the sink lexy::as_string  to produce a std::string from the rule.

Predefined delimited

lexy/dsl/delimited.hpp
namespace lexy::dsl
{
    constexpr delimited-dsl quoted        = delimited(lit<"\"">);
    constexpr delimited-dsl triple_quoted = delimited(lit<"\"\"\"">);

    constexpr delimited-dsl single_quoted = delimited(lit<"'">);

    constexpr delimited-dsl backticked        = delimited(lit<"`">);
    constexpr delimited-dsl double_backticked = delimited(lit<"``">);
    constexpr delimited-dsl triple_backticked = delimited(lit<"```">);
}

ASCII quotation marks are pre-defined.

Warning
The naming scheme for triple_quoted and single_quoted is not consistent, but the terminology is common else where.

Rule DSL lexy::dsl::escape

lexy/dsl/delimited.hpp
namespace lexy
{
    struct invalid_escape_sequence {};
}

namespace lexy::dsl
{
    struct escape-dsl // note: not a rule itself
    {
        constexpr escape-dsl rule(branch-rule auto r) const;

        constexpr escape-dsl capture(branch-rule auto r) const;

        template <const symbol_table& SymbolTable>
        constexpr escape-dsl symbol(token-rule auto t) const;
        template <const symbol_table& SymbolTable>
        constexpr escape-dsl symbol() const;
    };

    constexpr escape-dsl escape(token-rule auto escape_char);
}

escape is not a rule but a DSL for specifying escape sequences.

It is created by giving it the escape_char, a token rule that matches the initial escape characters. Common escape characters are predefined.

The various member functions all add potential rules that parse the part of an escape sequence after the initial escape character. The resulting DSL can then only be used with delimited, where it is treated like a branch rule and as such documented like one.

Branch parsing

Tries to match and consume escape_char, backtracks otherwise. After escape_char has been consumed, tries to parse each escape sequence (see below) in order of the member function invocations, like a choice  would.

Errors
  • All errors raised by each escape sequence. escape then fails but delimited recovers (see above).

  • lexy::invalid_escape_sequence: if none of the escape sequences match; its range covers the escape_char. escape then fails but delimited recovers (see above).

Values

All values produced by the selected escape sequence. delimited forwards them to the sink in one invocation.

Example 5. Parse a quoted string with escape sequences
struct production
{
    // A mapping of the simple escape sequences to their replacement values.
    static constexpr auto escaped_symbols = lexy::symbol_table<char> //
                                                .map<'"'>('"')
                                                .map<'\\'>('\\')
                                                .map<'/'>('/')
                                                .map<'b'>('\b')
                                                .map<'f'>('\f')
                                                .map<'n'>('\n')
                                                .map<'r'>('\r')
                                                .map<'t'>('\t');

    static constexpr auto rule = [] {
        // Arbitrary code points that aren't control characters.
        auto c = -dsl::ascii::control;

        // Escape sequences start with a backlash.
        // They either map one of the symbols,
        // or a Unicode code point of the form uXXXX.
        auto escape = dsl::backslash_escape //
                          .symbol<escaped_symbols>()
                          .rule(dsl::lit_c<'u'> >> dsl::code_point_id<4>);
        return dsl::quoted(c, escape);
    }();

    // Need to specify a target encoding to handle the code point.
    static constexpr auto value = lexy::as_string<std::string, lexy::utf8_encoding>;
};

Escape sequence .rule()

lexy/dsl/delimited.hpp
constexpr escape-dsl rule(branch-rule auto r) const;

.rule() specifies an escape sequence that simply tries to parse the branch rule r.

Escape sequence .capture()

lexy/dsl/delimited.hpp
constexpr escape-dsl capture(branch-rule auto r) const
{
    return this->rule(lexy::dsl::capture(r));
}

.capture() specifies an escape sequence that tries to parse the branch rule t and produces a lexy::lexeme .

It is equivalent to lexy::dsl::capture .

Escape sequence .symbol()

lexy/dsl/delimited.hpp
template <const symbol_table& SymbolTable>
constexpr escape-dsl symbol(token-rule auto t) const
{
    return this->rule(lexy::dsl::symbol<SymbolTable>(t));
}

template <const symbol_table& SymbolTable>
constexpr escape-dsl symbol() const
{
    return this->rule(lexy::dsl::symbol<SymbolTable>);
}

.symbol() specifies an escape sequence that parses a symbol.

The first overload forwards to argument version lexy::dsl::symbol : it matches t and looks it up in the SymbolTable and corresponding value produced. The second overload forwards to the non-argument version that immediately looks up a symbol of the SymbolTable.

Predefined escapes

lexy/dsl/delimited.hpp
namespace lexy::dsl
{
    constexpr escape-dsl backslash_escape = escape(lit_c<'\\'>);
    constexpr escape-dsl dollar_escape    = escape(lit_c<'$'>);
}

Escape sequences beginning with common ASCII characters are pre-defined.

Note
They don’t actually define any escape sequences, just the initial character.

See also