Header `lexy/dsl/unicode.hpp`

Char class rules for matching Unicode char classes.

Unicode char classes

lexy/dsl/unicode.hpp

namespace lexy::dsl
{
    namespace unicode
    {
        constexpr char-class-rule auto control;

        constexpr char-class-rule auto blank;
        constexpr char-class-rule auto newline;
        constexpr char-class-rule auto other_space;
        constexpr char-class-rule auto space;

        constexpr char-class-rule auto digit;

        constexpr char-class-rule auto lower;
        constexpr char-class-rule auto upper;
        constexpr char-class-rule auto alpha;

        constexpr char-class-rule auto alpha_digit;
        constexpr char-class-rule auto alnum = alpha_digit;

        constexpr char-class-rule auto word;

        constexpr char-class-rule auto graph;
        constexpr char-class-rule auto print;

        constexpr char-class-rule auto character;
    }
}

These char class rules match one Unicode code point from a char class, as specified in the table below.

Each class is a superset of the corresponding rule in lexy::dsl::ascii. They require the Unicode database.

The char classes

Token Rule Char Class

Token Rule	Char Class
`control`	`Cc` (Other, control)
`blank`	`Zs` (Separator, space) or `\t`
`newline`	`\r`, `\n`, `NEL`, `LINE SEPARATOR`, or `PARAGRAPH SEPARATOR`
`other_space`	`\f` or `\v`
`space`	`Whitespace`, which is `blank`, `newline` or `other_space`
`digit`	`Nd` (Number, decimal digit)
`lower`	`Lowercase`
`upper`	`Uppercase`
`alpha`	`Alphabetic`
`alpha_digit`	`alpha`, `digit`
`word`	`alpha`, `digit`, `M` (Mark), `Pc` (Punctuation, connector), join control
`graph`	everything but `space`, `control`, `Cs` (Other, surrogate), `Cn` (Other, not assigned)
`print`	`graph` or `blank` but without `control`
`character`	any code point that is assigned (i.e. not `Cn` (Other, not assigned))

control

Cc (Other, control)

blank

Zs (Separator, space) or \t

newline

\r, \n, NEL, LINE SEPARATOR, or PARAGRAPH SEPARATOR

other_space

\f or \v

space

Whitespace, which is blank, newline or other_space

digit

Nd (Number, decimal digit)

lower

upper

alpha

alpha_digit

word

alpha, digit, M (Mark), Pc (Punctuation, connector), join control

graph

everything but space, control, Cs (Other, surrogate), Cn (Other, not assigned)

print

graph or blank but without control

character

any code point that is assigned (i.e. not Cn (Other, not assigned))

Caution

Unlike in the ASCII case, alpha is not lower or upper: there are alphabetic characters that don’t have a case.

Caution

Differentiate between lexy::dsl::unicode::newline, which matches \r or \n and others, and lexy::dsl::newline, which matches \r\n or \n!

Caution

As token rules, they match whitespace immediately following the character. As such, the rule is best used in contexts where automatic whitespace skipping is disabled. They can safely be used as part of the whitespace definition.

Note

There is no dsl::unicode::punct. The Unicode standard defines it as general category P (Punctuation), which is unsatisfactory as it does not include e.g. $ unlike dsl::ascii::punct (it’s a currency symbol instead). POSIX includes $ as well as other non-alphabetic symbols, which is unsatisfactory as dsl::unicode::punct would include characters Unicode does not consider punctuation.

Unicode identifier classes

lexy/dsl/unicode.hpp

namespace lexy::dsl
{
    namespace unicode
    {
        constexpr char-class-rule auto xid_start;
        constexpr char-class-rule auto xid_start_underscore;
        constexpr char-class-rule auto xid_continue;
    }
}

These char class rules match one Unicode code point from the XID_Start/XID_Continue character classes. They are used to parse Unicode-aware lexy::dsl::identifier.

xid_start matches any Unicode character that can occur at the beginning of an identifier. It is a superset of lexy::dsl::ascii::alpha.
xid_start_underscore matches xid_start or _ (underscore. It is a superset of lexy::dsl::ascii::alpha_underscore.
xid_continue matches any Unicode character that can occur after the initial character of an identifier. It is a superset of lexy::dsl::ascii::alpha_digit_underscore.

They require the Unicode database.

Example 1. Parse a Unicode-aware C-like identifier

struct production
{
    static constexpr auto rule
        = dsl::identifier(dsl::unicode::xid_start_underscore, // want '_' as well
                          dsl::unicode::xid_continue);
};

Warning

xid_start does not include _ (underscore)!

Unicode char classes

Unicode identifier classes

See also