Header lexy/dsl/code_point.hpp

Rules for matching (specific) code points.

Token rule lexy::dsl::code_point

lexy/dsl/code_point.hpp
namespace lexy::dsl
{
    class code-point-dsl // models token-rule
    {
    public:
        // see below for member functions
        template <char32_t CodePoint>
        constexpr token-rule lit() const;

        template <typename Predicate>
        constexpr token-rule auto if_() const;

        constexpr token-rule auto ascii() const;
        constexpr token-rule auto bmp() const;
        constexpr token-rule auto noncharacter() const;

        template <lexy::code_point::general_category_t Category>
        constexpr token-rule auto general_category() const;
        template <lexy::code_point::gc-group CategoryGroup>
        constexpr token-rule auto general_category() const;

        template <char32_t Low, char32_t High>
        constexpr token-rule auto range() const;
    };

    constexpr code-point-dsl auto code_point;
}

code_point is a token rule that matches a single scalar Unicode code point.

Requires

The input encoding is ASCII, UTF-8, UTF-16, or UTF-32. In particular, lexy::default_encoding and lexy::byte_encoding are not supported.

Matching

Matches and consumes all code points that form a code point in this encoding. For ASCII and UTF-32, this is always a single code unit, for UTF-8, this is up to 4 code units, and for UTF-16, this is up to 2 code units.

Errors

lexy::expected_char_class ("<encoding>.code_point"): if the current code unit(s) do not form a valid code point; at the starting reader position. This includes surrogates, overlong UTF-8 sequences, or out of range code points (especially for ASCII). The rule then fails.

Parse tree

Single token node with the lexy::predefined_token_kind lexy::any_token_kind.

Example 1. Parse one code point in the inputs encoding
struct production
{
    static constexpr auto rule = dsl::code_point + dsl::eof;
};
Caution
As a token rule, it matches whitespace immediately following the code point. As such, the rule is best used in contexts where automatic whitespace skipping is disabled.
Note
If the input has been validated, the rule only fails if the reader is at the end of the input.

Token rule lexy::dsl::code_point.lit

lexy/dsl/code_point.hpp
template <char32_t CodePoint>
constexpr token-rule lit() const;

code_pont.lit is a token rule that matches the specific CodePoint.

Requires
  • CodePoint is the value of a scalar code point (i.e. non-surrogate and not out of bounds).

  • The input encoding is ASCII, UTF-8, UTF-16, or UTF-32. If it is ASCII, CodePoint is an ASCII character.

Matching

Matches and consumes the code units that encode CodePoint in the encoding of the input. For ASCII and UTF-32, this is a single code unit, for UTF-8 up to four code units, and for UTF-16 up to two code units.

Errors

lexy::expected_literal: if one code unit did not compare equal or the reader reached the end of the input. Its .string() is the encoded version of CodePoint, its .index() is the index of the code unit where the mismatch/missing one occurred, and its .position() is the reader position where it started to match the literal.

Parse tree

Single token node with the lexy::predefined_token_kind lexy::literal_token_kind.

It behaves identical to lexy::dsl::lit where Str is determined by encoding CodePoint in the encoding of the input.

Example 2. Match a smiley face
struct production
{
    static constexpr auto rule = dsl::code_point.lit<0x1F642>() + dsl::eof;
};
Note
The caveats of lexy::dsl::lit regarding whitespace skipping and keywords apply here as well.
Caution
If the input contains an ill-formed code unit sequence, this is not checked by this rule; it simply compares each code unit.

Token rule lexy::dsl::code_point.if_

lexy/dsl/code_point.hpp
template <std::predicate<lexy::code_point> Predicate>
  requires std::is_default_constructible_v<Predicate>
constexpr token-rule auto if_() const;

code_point.if_ is a token rule that matches a code point fulfilling a given predicate.

Matches

Matches and consumes the normal code_point rule.

Errors
  • lexy::expected_char_class ("<type name of Predicate>"): if Predicate{}(cp) == false, where cp is the code point we have just consumed; at the starting reader position. The rule then fails.

  • All errors raised by the normal code_point rule. The rule then fails.

Example 3. Parse even code points only
struct production
{
    struct even
    {
        constexpr bool operator()(lexy::code_point cp)
        {
            return cp.value() % 2 == 0;
        }
    };

    static constexpr auto rule = dsl::code_point.if_<even>() + dsl::eof;
};
Note
As the rule uses the type name of Predicate in the error, it does not accept a lambda as predicate, but should be called with a named type instead.
Caution
The same caveat about whitespace as for code_point applies here as well.
Note
See lexy::dsl::unicode for common predefined predicates.

Token rule lexy::dsl::code_point.ascii/bmp/noncharacter

lexy/dsl/code_point.hpp
constexpr token-rule auto ascii() const;
constexpr token-rule auto bmp() const;
constexpr token-rule auto noncharacter() const;

code_point.range is a token rule that matches a code point with the specified classification.

Matches

Matches and consumes the normal code_point rule to get a lexy::code_point cp and checks that cp.is_ascii()/cp.is_bmp()/cp.is_noncharacter().

Errors
  • lexy::expected_char_class ("<name>"): if the code point does not have the classification; at the starting reader position. The rule then fails.

  • All errors raised by the normal code_point rule. The rule then fails.

Note
The other classification functions don’t have rules: * cp.is_valid() and cp.is_scalar() is always true; cp.is_surrogate() is never true. * cp.is_control() is general category Cc. * cp.is_private_use() is general category Co.

Token rule lexy::dsl::code_point.general_category

lexy/dsl/code_point.hpp
template <lexy::code_point::general_category_t Category>
constexpr token-rule auto general_category() const;

template <lexy::code_point::gc-group CategoryGroup>
constexpr token-rule auto general_category() const;

code_point.range is a token rule that matches a code point with the specified lexy::code_point::general_category_t or group of categories.

Matches

Matches and consumes the normal code_point rule to get a lexy::code_point cp and checks that cp.general_category() == Category or cp.general_category() == CategoryGroup.

Errors
  • lexy::expected_char_class ("<name of Category>"): if the code point is not in the category; at the starting reader position. The rule then fails.

  • All errors raised by the normal code_point rule. The rule then fails.

Note
While cp.general_category() requires the Unicode database, Cc (Other, control) and Co (Other, private use) are fixed. As an optimization, cp.is_control()/cp.is_private_use() are used instead, so they don’t require the Unicode database.

Token rule lexy::dsl::code_point.range

lexy/dsl/code_point.hpp
template <char32_t Low, char32_t High>
constexpr token-rule auto range() const;

code_point.range is a token rule that matches a code point in the range [Low, High].

Matches

Matches and consumes the normal code_point rule to get a lexy::code_point cp and checks that Low <= cp <= High.

Errors
  • lexy::expected_char_class ("code-point.range"): if the code point is not in the range; at the starting reader position. The rule then fails.

  • All errors raised by the normal code_point rule. The rule then fails.

See also