Header lexy/encoding.hpp
Facilities for handling the encoding of an input.
template <typename T>
concept encoding = …;An encoding determines the value type of the input and how it is interpreted by the rules.
Some rules require a certain encoding.
For example, lexy::dsl::code_point does not work with lexy::default_encoding.
lexy supports two kinds of encodings: char encodings and node encodings.
A char encoding is an encoding where the input is a sequence of characters.
Its primary character type ::char_type is the actual character type an input gives you,
but it can also have secondary character types which are also accepted but then internally normalized to the primary character type.
Its integer type ::int_type is an integer type which can store any valid character type or a special EOF value.
If this is the same type as the primary character type, some optimizations are possible.
Char encodings are further divided into text encodings (the input is human readable text) and byte encodings (the input contains arbitrary bytes).
A node encoding is an encoding where the input is already parsed into a lexy::parse_tree.
Its value type ::value_type is the type of the nodes in the parse tree, and its ::char_encoding is the underlying char encoding of the parse tree.
For details about node encodings, see lexy::parse_tree_input.
Supported char encodings
lexy/encoding.hppnamespace lexy
{
struct default_encoding {}; // any 8-bit encoding
struct ascii_encoding {}; // ASCII
struct utf8_encoding {}; // UTF-8, char8_t
struct utf8_char_encoding {}; // UTF-8, char
struct utf16_encoding {}; // UTF-16
struct utf32_encoding {}; // UTF-32
struct byte_encoding {}; // not text
}Tag types for the char encodings supported by lexy.
The character types
| Encoding | Primary character type | Secondary character type(s) |
|---|---|---|
|
| none |
|
| none |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Each tag type corresponds to the encoding indicated and it uses the character types defined in the table above. If an encoding is specified, lexy assumes that the input is valid for this encoding and will not perform any checking unless otherwise indicated. If the input contains invalid code units or invalid sequences of code units, lexy may or may not generate bogus parse errors.
Note | For example, if the input contains the byte sequence lexy uses to encode EOF, which is never a valid code unit for the encoding, lexy will raise an unexpected EOF error before the actual end of the input. |
default_encoding is used as fallback if no encoding is specified or could be deduced.
This encoding works with any 8-bit encoding where the code units 0x00-0x7F correspond to the ASCII characters (ASCII, UTF-8, Windows code pages, etc.).
Rules that require knowledge about the actual encoding like lexy::dsl::code_point do not work and some optimizations are impossible, as lexy has to assume that every 8-bit value is valid code unit.
Tip | If you know that your input is ASCII or UTF-8, use ascii_encoding/utf8_encoding instead. |
byte_encoding is used to indicate that the input does not contain actual text but arbitrary bytes.
It can also be used if you’re parsing text consisting of a mix of different encodings.
Alias template lexy::deduce_encoding
lexy/encoding.hppnamespace lexy
{
template <typename CharT>
using deduce_encoding = …;
}Deduces a char encoding from the character type of a string.
Deduced encoding
| Character type | Encoding |
|---|---|
|
|
|
|
|
|
|
|
|
|
|
|
What encoding is deduced from a character type is specified in the table above.
If CharT is not listed there, the alias is ill-formed.
The encoding corresponding to char can be overridden by defining the macro LEXY_ENCODING_OF_CHAR to one of lexy::default_encoding, lexy::ascii_encoding, lexy::utf8_encoding, and lexy::utf8_char_encoding.
Enum class lexy::encoding_endianness
lexy/encoding.hppnamespace lexy
{
enum class encoding_endianness
{
little,
big,
bom,
};
}Defines the endianness of a multi-byte char encoding when necessary.
In memory, UTF-16 and UTF-32 can be either encoded in big or little endian.
If raw bytes need to be interpreted as code units in those encodings,
as is necessary by lexy::read_file, the desired endianness can be specified using encoding_endianness.
littleThe input is stored in little endian. For single-byte encodings, this has no effect.
bigThe input is stored in big endian. For single-byte encodings, this has no effect.
bomThe input is stored in the endianness specified by the byte order mark (BOM) at the beginning of the input. This option has an effect only for UTF-8, UTF-16, and UTF-32. For UTF-8, it will skip an optional BOM, but it has otherwise no effect. If no BOM is present, it defaults to big endian.
lexy::make_buffer_from_raw and lexy::read_file use this enum to handle the endianness of the raw memory,
and the rule lexy::dsl::bom uses it to specify a given BOM.