Header lexy/dsl/unicode.hpp

Token rules for matching Unicode character classes.

Unicode character classes

lexy/dsl/unicode.hpp
namespace lexy::dsl
{
    namespace unicode
    {
        constexpr token-rule auto control;

        constexpr token-rule auto blank;
        constexpr token-rule auto newline;
        constexpr token-rule auto other_space;
        constexpr token-rule auto space;

        constexpr token-rule auto digit;

        constexpr token-rule auto lower;
        constexpr token-rule auto upper;
        constexpr token-rule auto alpha;

        constexpr token-rule auto alpha_digit;
        constexpr token-rule auto alnum = alpha_digit;

        constexpr token-rule auto word;

        constexpr token-rule auto graph;
        constexpr token-rule auto print;

        constexpr token-rule auto character;
    }
}

These token rules match one Unicode code point from a character class.

They are implemented using lexy::dsl::code_point.if_ using an appropriate predicate (see the table below). They require the Unicode database.

Each rule matches a superset of the corresponding rule in lexy::dsl::ascii.

The character classes
Token Rule Character Class

control

Cc (Other, control)

blank

Zs (Separator, space) or \t

newline

\r, \n, NEL, LINE SEPARATOR, or PARAGRAPH SEPARATOR

other_space

\f or \v

space

Whitespace, which is blank, newline or other_space

digit

Nd (Number, decimal digit)

lower

Lowercase

upper

Uppercase

alpha

Alphabetic

alpha_digit

alpha, digit

word

alpha, digit, M (Mark), Pc (Punctuation, connector), join control

graph

everything but space, control, Cs (Other, surrogate), Cn (Other, not assigned)

print

graph or blank but without control

character

any code point that is assigned (i.e. not Cn (Other, not assigned))

Caution
Unlike in the ASCII case, alpha is not lower or upper: there are alphabetic characters that don’t have a case.
Caution
Differentiate between lexy::dsl::unicode::newline, which matches \r or \n and others, and lexy::dsl::newline, which matches \r\n or \n!
Caution
As token rules, they match whitespace immediately following the character. As such, the rule is best used in contexts where automatic whitespace skipping is disabled. They can safely be used as part of the whitespace definition.
Note
There is no dsl::unicode::punct. The Unicode standard defines it as general category P (Punctuation), which is unsatisfactory as it does not include e.g. $ unlike dsl::ascii::punct (it’s a currency symbol instead). POSIX includes $ as well as other non-alphabetic symbols, which is unsatisfactory as dsl::unicode::punct would include characters Unicode does not consider punctuation.

Unicode identifier classes

lexy/dsl/unicode.hpp
namespace lexy::dsl
{
    namespace unicode
    {
        constexpr token-rule auto xid_start;
        constexpr token-rule auto xid_start_underscore;
        constexpr token-rule auto xid_continue;
    }
}

These token rules match one Unicode code point from the XID_Start/XID_Continue character classes. They are used to parse Unicode-aware lexy::dsl::identifier.

They are implemented using lexy::dsl::code_point.if_ using an appropriate predicate and require the Unicode database:

Example 1. Parse a Unicode-aware C-like identifier
struct production
{
    static constexpr auto rule
        = dsl::identifier(dsl::unicode::xid_start_underscore, // want '_' as well
                          dsl::unicode::xid_continue);
};
Warning
xid_start does not include _ (underscore)!

See also