Header lexy/dsl/identifier.hpp

The identifier and keyword rules.

Rule lexy::dsl::identifier

lexy/dsl/identifier.hpp
namespace lexy
{
    struct reserved_identifier {};
}

namespace lexy::dsl
{
    struct identifier-dsl // models branch-rule
    {
        //=== modifiers ===//
        constexpr identifier-dsl reserve(auto ... rules) const;
        constexpr identifier-dsl reserve_prefix(auto ... rules) const;
        constexpr identifier-dsl reserve_containing(auto ... rules) const;
        constexpr identifier-dsl reserve_suffix(auto ... rules) const;

        //=== sub-rules ===//
        constexpr token-rule auto pattern() const;

        constexpr token-rule auto leading_pattern() const;
        constexpr token-rule auto trailing_pattern() const;
    };

    constexpr identifier-dsl identifier(char-class-rule auto leading,
                                        char-class-rule auto trailing);

    constexpr identifier-dsl identifier(char-class-rule auto c)
    {
        return identifier(c, c);
    }

}

identifier is a rule that parses an identifier.

It can be created using two overloads. The first overload takes a char class rule that matches the leading character of the identifier, and one that matches all trailing characters after the first. The second overload takes just one char class rule and uses it both as leading and trailing characters.

Requires

The encoding of the input is a char encoding.

Parsing

Matches and consumes the token .pattern() (see below). Then verifies that the lexeme formed from .pattern() (excluding any trailing whitespace), is not reserved (see below).

Branch parsing

Tries to match and consume the token .pattern() (see below), backtracking if that fails. Otherwise it checks for reserved identifiers and backtracks if it was reserved. As such, branch parsing only raises errors due to the implicit whitespace skipping.

Errors
  • All errors raised by .pattern(). The rule then fails if not during branch parsing.

  • lexy::reserved_identifier: if the identifier is reserved; its range covers the identifier. The rule then recovers.

Values

A single lexy::lexeme  that is the parsed identifier (excluding any trailing whitespace).

Parse tree

The single token node created by .pattern() (see below). Its kind cannot be overridden.

Example 1. Parse a C-like identifier
struct production
{
    static constexpr auto rule = [] {
        auto head = dsl::ascii::alpha_underscore;
        auto tail = dsl::ascii::alpha_digit_underscore;
        return dsl::identifier(head, tail);
    }();
};
Example 2. Parse a Unicode-aware C-like identifier
struct production
{
    static constexpr auto rule
        = dsl::identifier(dsl::unicode::xid_start_underscore, // want '_' as well
                          dsl::unicode::xid_continue);
};
Example 3. Parse a case-insensitive identifier
struct production
{
    static constexpr auto rule = dsl::identifier(dsl::ascii::alpha);

    static constexpr auto value
        = lexy::as_string<std::string, lexy::ascii_encoding>.case_folding(dsl::ascii::case_folding);
};
Tip
Use the character classes from lexy::dsl::ascii  for simple identifier matching as seen in the example.
Tip
Use the callback lexy::as_string  to convert the lexy::lexeme  to a string.

Reserving identifiers

lexy/dsl/identifier.hpp
constexpr identifier-dsl reserve(auto ... rules) const; (1)
constexpr identifier-dsl reserve_prefix(auto ... rules) const; (2)
constexpr identifier-dsl reserve_containing(auto ... rules) const; (3)
constexpr identifier-dsl reserve_suffix(auto ... rules) const; (4)

Reserves an identifier.

Initially, no identifier is reserved. Identifiers are reserved by calling .reserve() or its variants passing it a literal rule or lexy::dsl::literal_set . If this has happened, parsing the identifier rule creates a partial input from the lexeme and matches it against the specified rules as follows:

  • (1) .reserve(): All rules specified here are matched against the partial input. If they match the entire partial input, the identifier is reserved.

  • (2) .reserve_prefix(): All rules specified here are matched against the partial input. If they match a prefix of the partial input, the identifier is reserved.

  • (3) .reserve_containing(): All rules specified here are matched against the partial input. If they match somewhere in the partial input, the identifier is reserved.

  • (4) .reserve_suffix(): All rules specified here are matched against the partial input. If they match a suffix of the partial input, the identifier is reserved.

If one rule passed to a .reserve() call or variant uses case folding (e.g. lexy::dsl::ascii::case_folding ), all other rules in the same call also use that case folding, but not rules in a different call. This is because internally each call creates a fresh lexy::dsl::literal_set , which has that behavior.

Example 4. Parse a C like identifier that is not reserved
struct production
{
    static constexpr auto rule = [] {
        // Define the general identifier syntax.
        auto head = dsl::ascii::alpha_underscore;
        auto tail = dsl::ascii::alpha_digit_underscore;
        auto id   = dsl::identifier(head, tail);

        // Define some keywords.
        auto kw_int    = LEXY_KEYWORD("int", id);
        auto kw_struct = LEXY_KEYWORD("struct", id);
        // ...

        // Parse an identifier
        return id
            // ... that is not a keyword,
            .reserve(kw_int, kw_struct)
            // ... doesn't start with an underscore,
            .reserve_prefix(dsl::lit_c<'_'>)
            // ... or contains a double underscore.
            .reserve_containing(LEXY_LIT("__"));
    }();
};
Example 5. Parse a C like identifier with case-insensitive keywords
struct production
{
    static constexpr auto rule = [] {
        // Define the general identifier syntax.
        auto head = dsl::ascii::alpha_underscore;
        auto tail = dsl::ascii::alpha_digit_underscore;
        auto id   = dsl::identifier(head, tail);

        // Define some case insensitive keywords.
        auto kw_int    = dsl::ascii::case_folding(LEXY_KEYWORD("int", id));
        auto kw_struct = dsl::ascii::case_folding(LEXY_KEYWORD("struct", id));
        // ...

        // Parse an identifier that is not a keyword.
        return id.reserve(kw_int, kw_struct);
    }();
};
Caution
The identifier rule doesn’t magically learn about the keywords you have created. They are only reserved if you actually pass them to .reserve(). This design allows you to use a different set of reserved identifiers in different places in the grammar.

Token rule .pattern()

lexy/dsl/identifier.hpp
constexpr token-rule auto pattern() const;

.pattern() is a token rule that matches the basic form of the identifier without checking for reserved identifiers.

Matching

Matches and consumes leading, then matches and consumes lexy::dsl::while_ (trailing), where leading and trailing are the arguments passed to identifier(). Whitespace skipping is disabled inside the pattern(), but it will be skipped after pattern().

Errors

All errors raised by matching leading. The rule then fails.

Parse tree

A single token node whose range covers everything consumed. Its lexy::predefined_token_kind  is lexy::identifier_token_kind.

Token rules .leading_pattern(), .trailing_pattern()

lexy/dsl/identifier.hpp
constexpr token-rule auto leading_pattern() const;
constexpr token-rule auto trailing_pattern() const;

They simply return leading/trailing from the arguments passed to identifier().

Literal rule lexy::dsl::keyword

lexy/dsl/identifier.hpp
namespace lexy::dsl
{
    template <auto Char>
    constexpr literal-rule auto keyword(identifier-dsl identifier);
    template <auto Str>
    constexpr literal-rule auto keyword(identifier-dsl identifier);
}

#define LEXY_KEYWORD(Str, Identifier) lexy::dsl::keyword<Str>(Identifier)

keyword is a literal rule that matches a keyword.

Matching

Tries to match and consume identifier.pattern(), i.e. the basic pattern of an identifier ignoring any reserved identifiers. Then creates a partial input that covers everything just consumed (without the trailing whitespace) and matches lexy::dsl::lit <Str> on that input. Succeeds only if that consumes the entire partial input.

Errors

lexy::expected_keyword : if either identifier.pattern() or the lit rule failed. Its range covers the everything consumed by identifier.pattern() and its .string() is Str.

Parse tree

Single token node with the lexy::predefined_token_kind  lexy::literal_token_kind.

The macro LEXY_KEYWORD(Str, Identifier) is equivalent to keyword<Str>(Identifier), except that it also works on older compilers that do not support C++20’s extended NTTPs. Use this instead of keyword<Str>(identifier) if you need to support them.

Example 6. Parse a keyword
struct production
{
    static constexpr auto rule = [] {
        // Define the general identifier syntax.
        auto head = dsl::ascii::alpha_underscore;
        auto tail = dsl::ascii::alpha_digit_underscore;
        auto id   = dsl::identifier(head, tail);

        // Parse a keyword.
        return LEXY_KEYWORD("int", id);
    }();
};
Note
While lexy::dsl::lit <"int"> would happily consume a prefix of "integer", keyword<"int">(id), for a matching id, would not.
Note
A keyword does not necessarily need to be a reserved identifier or vice-versa.
Note
The encoding caveats of literal rules apply here as well.
Tip
Use lexy::dsl::ascii::case_folding  or its Unicode variants to parse a case insensitive keyword.

See also