Header lexy/dsl/unicode.hpp

Char class rules for matching Unicode char classes.

Unicode char classes

lexy/dsl/unicode.hpp
namespace lexy::dsl
{
    namespace unicode
    {
        constexpr char-class-rule auto control;

        constexpr char-class-rule auto blank;
        constexpr char-class-rule auto newline;
        constexpr char-class-rule auto other_space;
        constexpr char-class-rule auto space;

        constexpr char-class-rule auto digit;

        constexpr char-class-rule auto lower;
        constexpr char-class-rule auto upper;
        constexpr char-class-rule auto alpha;

        constexpr char-class-rule auto alpha_digit;
        constexpr char-class-rule auto alnum = alpha_digit;

        constexpr char-class-rule auto word;

        constexpr char-class-rule auto graph;
        constexpr char-class-rule auto print;

        constexpr char-class-rule auto character;
    }
}

These char class rules match one Unicode code point from a char class, as specified in the table below.

Each class is a superset of the corresponding rule in lexy::dsl::ascii . They require the Unicode database.

The char classes
Token RuleChar Class

control

Cc (Other, control)

blank

Zs (Separator, space) or \t

newline

\r, \n, NEL, LINE SEPARATOR, or PARAGRAPH SEPARATOR

other_space

\f or \v

space

Whitespace, which is blank, newline or other_space

digit

Nd (Number, decimal digit)

lower

Lowercase

upper

Uppercase

alpha

Alphabetic

alpha_digit

alpha, digit

word

alpha, digit, M (Mark), Pc (Punctuation, connector), join control

graph

everything but space, control, Cs (Other, surrogate), Cn (Other, not assigned)

print

graph or blank but without control

character

any code point that is assigned (i.e. not Cn (Other, not assigned))

Caution
Unlike in the ASCII case, alpha is not lower or upper: there are alphabetic characters that don’t have a case.
Caution
Differentiate between lexy::dsl::unicode::newline, which matches \r or \n and others, and lexy::dsl::newline , which matches \r\n or \n!
Caution
As token rules, they match whitespace  immediately following the character. As such, the rule is best used in contexts where automatic whitespace skipping is disabled. They can safely be used as part of the whitespace definition.
Note
There is no dsl::unicode::punct. The Unicode standard defines it as general category P (Punctuation), which is unsatisfactory as it does not include e.g. $ unlike dsl::ascii::punct (it’s a currency symbol instead). POSIX includes $ as well as other non-alphabetic symbols, which is unsatisfactory as dsl::unicode::punct would include characters Unicode does not consider punctuation.

Unicode identifier classes

lexy/dsl/unicode.hpp
namespace lexy::dsl
{
    namespace unicode
    {
        constexpr char-class-rule auto xid_start;
        constexpr char-class-rule auto xid_start_underscore;
        constexpr char-class-rule auto xid_continue;
    }
}

These char class rules match one Unicode code point from the XID_Start/XID_Continue character classes. They are used to parse Unicode-aware lexy::dsl::identifier .

They require the Unicode database.

Example 1. Parse a Unicode-aware C-like identifier
struct production
{
    static constexpr auto rule
        = dsl::identifier(dsl::unicode::xid_start_underscore, // want '_' as well
                          dsl::unicode::xid_continue);
};
Warning
xid_start does not include _ (underscore)!

See also