Header lexy/dsl/bom.hpp

Token rule lexy::dsl::bom

lexy/dsl/bom.hpp
namespace lexy::dsl
{
    template <typename Encoding, lexy::encoding_endianness Endianness>
    constexpr token-rule auto bom;
}

bom is a token rule that matches the byte order mark (BOM) of the given encoding in the given lexy::encoding_endianness.

Requires
  • The combination of Encoding and Endianness is listed in the table below.

  • The input encoding is lexy::byte_encoding.

Matching

Matches and consumes the byte sequence given in the table below. This is done by repeatedly comparing the code unit against the numeric values.

Errors

lexy::expected_char_class ("BOM.<encoding>-<endianness>"): at the starting position. The rule then fails.

Parse tree

Single token node with the lexy::predefined_token_kind lexy::literal_token_kind.

The possible BOMs
Encoding Endianness BOM

UTF-8

ignored

0xEF, 0xBB, 0xBF

UTF-16

little

0xFF, 0xFE

UTF-16

big

0xFE, 0xFF

UTF-32

little

0xFF, 0xFE, 0x00, 0x00

UTF-32

big

0x00, 0x00, 0xFE, 0xFF

Example 1. Skip an optional UTF-8 BOM
struct production
{
    static constexpr auto rule = [] {
        auto bom = dsl::bom<lexy::utf8_encoding,
                            // Doesn't matter for UTF-8.
                            lexy::encoding_endianness::little>;
        return dsl::opt(bom) + LEXY_LIT("Hello!") + dsl::eof;
    }();
};
Caution
As a token rule, it matches whitespace immediately following the BOM. As such, the rule is best used in contexts where automatic whitespace skipping is disabled.
Note
When using lexy::read_file() as input, BOM has been taken care of by default.

See also