Enhance lexer for native records and related syntax in OTP 29 by garazdawi · Pull Request #34 · elixir-makeup/makeup_erlang

garazdawi · 2026-05-06T18:55:49Z

I started by adding native record support and then decided to do a full facelift of many of the features that have landed the last releases and fix some bugs along the way. This PR adds support for:

native records
triple (and quadruple) quoted strings
Numbers with joiners (1_000_000)

and a bunch of small fixes.

Native records introduced new syntactic forms in Erlang/OTP 29 (erlang/otp PR #11090). The shape that the existing record rule does not match is the external construction / pattern / field access form `#Module:Name{...}` and `#Module:Name.field`, where `Module:Name` appears between the `#` and the `{` (or `.`). Tokenize the module qualifier as `:name_class` (matching the existing namespace pattern) and the record name as the existing record-name token (`:string_symbol`). Local construction (`#Name{...}`) is identical in shape to a tuple-based record and needs no change — the lexer cannot disambiguate from local context, so both colour the same way. The other native-record forms — `-record #Name{...}.` definition attribute, `-export_record([...])`, `-import_record(Mod, [...])` — already tokenize correctly under the existing module_attribute rule (the attribute name is captured generically and `(` is already optional). Tests added to lock the expected output. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Add tests for native record patterns and updates The native-records commit covered construction (`#mod:name{f=v}`) but not the symmetric forms: pattern matching (`#mod:name{f = X} = Y`) and updates via a prefixed variable (`Y#mod:name{f = 2}`). Both already work via the same rule but nothing tested them; lock the coverage so a future ordering tweak in the choice can't silently regress them. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Accept variable-shape and keyword names in native records OTP 29's native-record syntax relaxes the record-name rule: per the spec at https://www.erlang.org/doc/system/data_types.html, "it is not necessary to quote atoms that look like variable names or keywords." So `#State{}`, `#div{}`, `#case{}`, `#fun{}` are all valid record references even though `State` is variable-shape and `div`/`case`/`fun` are reserved words. The previous rule used `atom_name` only, which requires lowercase or quoted. `#State{}` therefore fell through to the punctuation / variable rules and produced disjoint tokens with no record-shape grouping. Add a `record_name` combinator that accepts either `atom_name` or `variable_name`, both tagged `:string_symbol`, and use it in both `record` and `native_record_external`. Tuple-based records don't actually allow these forms, but the lexer can't tell the two record kinds apart from local context — so accept the union. Keyword and word-operator names (`#case{}`, `#div{}`, `#fun{}`) still get re-tagged by postprocess to `:keyword` / `:operator_word`. That's accepted output: the surrounding `#...{` shape still groups visually as a record reference, and themes that care can render keywords-in-record-position differently if desired. Tests cover all four name shapes (lowercase atom, variable, keyword, quoted) in three positions (local construction, external construction, definition attribute), including the OTP 29 spec example `-record #vector{x = 0.0, y = 0.0}.`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Don't reclassify keyword / builtin record names in postprocess When `#case{}` is a record reference, the existing postprocess clauses re-tagged the inner `:string_symbol "case"` as `:keyword`, because the conversion was unconditional on the value matching the keyword list. Visually that meant the record-name slot flipped colour depending on whether the chosen name happened to be a reserved word — confusing for readers, and wrong because the text in that position is a record name, not an expression keyword. Tag record-name tokens with a `record_name: true` meta marker via the `record_name` combinator. Add a postprocess clause that matches the marker and bypasses the keyword / builtin / word-operator reclassification, then strips the marker so it doesn't leak into rendered tokens. Both `:string_symbol` clauses (keyword, builtin, word-operator) are guarded by the marker check implicitly because pattern matching is order-sensitive and the marker clause comes first. Tests for `#case{}`, `#fun{}`, `#div{}`, and `#mod:case{}` now assert `:string_symbol` for the record name (matching the lowercase-name behaviour) instead of `:keyword` / `:operator_word`. A new test verifies the marker doesn't leak through to output meta. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

OTP 27 added `_` as a digit-group separator in numeric literals: `1_000_000`, `16#FF_FF`, `0.1_5e1_0`. Extend the digit character classes so numeric tokens accept these forms. The lexer is intentionally tolerant about position — it does not validate that underscores only sit between digits and not at the edges of the literal — because the lexer's job is highlighting, not validation. The compiler will reject malformed literals with a real error. Tightened `number_integer` to require a leading digit so a bare underscore can't accidentally start a number; the digit-tail then absorbs further `[0-9_]+`. Weird-base integers (`16#FF_FF`) now include `_` in the post-`#` character set. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

`?=` is the maybe-expression match operator added in OTP 25 (stable in OTP 27). Without it in `syntax_operators`, `X ?= Y` was lexed as two operator tokens (`?` and `=`), which is wrong both visually and semantically — it broke inside `maybe ... end` blocks. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The `character` rule tokenized `$\\` followed by a single byte — fine for simple `$\\n`, `$\\t`, but wrong for hex (`$\\xFF`, `$\\x{1F600}`), octal (`$\\077`) and control (`$\\^A`) forms. Those were splitting into a partial char-token plus stray name or integer tokens, which rendered as broken syntax in the docs. Add a dedicated `character_escape` rule that, after consuming the leading backslash, tries the structured escapes (hex with its two `x`-prefixed forms, octal, control) before falling back to any single char. The order matters: `escape_hex` and `escape_octal` must precede the single-char fallback so the multi-character forms are consumed whole. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Emit `:string_escape` sub-tokens inside double-quoted strings The `triple_quoted_string` rule already emitted `:string_escape` sub-tokens for each escape sequence inside the string body. Plain double-quoted strings did not — they used a literal `\"` recogniser that only stopped the closing-quote logic from triggering early, without producing a distinct token for the escape itself. Themes that wanted to colour escapes differently from the surrounding string body had no token to hook on. Replace the special-purpose `escape_double_quote` with the generic `escaped_char`, which itself was extended to consume structured escapes (`\\xFF`, `\\x{...}`, `\\077`, `\\^A`) whole rather than truncating after the leading byte. `string_like` now sees the same sub-token vocabulary in `"..."` strings and `"""..."""` triple-quoted strings. Existing string-escape tests updated to match the new (richer) output. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The `function` rule eagerly matches any atom-shaped token followed by `(` and tags it `:name_function`. For reserved words written adjacent to `(` — most commonly `fun(X) -> ... end` — that loses the keyword classification, because the postprocess pass only checked `:string_symbol` tokens against the keyword list. Add a postprocess clause that converts `:name_function` tokens whose value is in the keyword list back to `:keyword`. Reserved words can't legally be defined as function names in Erlang, so this is unambiguous. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The static `@builtins` list had bit-rotted: it was missing post-OTP-19 BIFs (`map_get/2`, `is_map_key/2`, `binary_part/2,3`, `floor/1`, `ceil/1`, `min/2`, `max/2`, `unique_integer/{0,1}`, `monotonic_time/{0,1}`, etc.) and contained at least one typo (`resume_processround` — no such BIF, presumably a merge of `resume_process` and `round`). Replace it with a compile-time-generated list sourced from `erl_internal:bif/2` — the same predicate the Erlang compiler uses to decide what's auto-imported. Every rebuild of `makeup_erlang` re-syncs the list with the OTP version we compile against. 122 BIFs vs the previous ~85. Also add a postprocess clause that converts `:name_function` tokens whose value is a BIF back to `:name_builtin` (analogous to the keyword-recovery clause). Closes makeup_erlang elixir-makeup#13: `length(L)` and similar BIF calls now render as builtins instead of plain function calls. The pre-existing string-symbol → name_builtin clause was unchanged and still applies in non-`(`-followed positions (e.g. `length` standalone in a documentation prose). Both clauses share the same `@builtins` list. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The `erl_prompt` rule used to require a literal `\n` immediately before the prompt body. When the generic `whitespace` rule earlier in the choice consumed a multi-character whitespace block ending in `\n` (e.g. `"\n \n1> ok."`), no `\n` remained at the prompt rule's expected position and the prompt was lexed as plain `[number_integer, operator]` instead of `:generic_prompt`. See makeup_elixir elixir-makeup#28 for the same-shape bug. Match any leading whitespace block that contains at least one `\n`, which keeps the rule anchored to a line boundary while tolerating preceding spaces / tabs / further newlines. False-positives on `1 > 2` and `x. 1> a.` are still rejected because neither contains a `\n` between the operand and `>`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Erlang's grammar treats any identifier starting with `_` followed by identifier characters as a variable (typically a "don't bother to warn me about this" hint). The lexer was tokenising `_5` as `[punctuation "_", number_integer 5]` because `_` appears in the generic punctuation list and was matched before the variable rule. Add a dedicated `underscore_identifier` rule that matches `_` followed by at least one identifier character and emits `:name`, placed before `punctuation` in the choice. Bare `_` (the wildcard pattern) remains a punctuation token so themes can render the two distinctly. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The `module_attribute` rule already accepts any atom-shaped name as the attribute, so all current and future OTP attributes work without lexer changes. Add an explicit list of every current attribute (`-callback`, `-optional_callbacks`, `-on_load`, `-nifs`, `-deprecated`, `-removed`, `-feature`, `-export_type`, `-export_record` and `-import_record` from the native-records work, plus the historically-supported set) and assert each one tokenises as `:name_attribute`. Catches accidental regressions if anyone ever narrows the rule. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

`makeup_elixir` splits `@foo` and `@foo(...)` into different tokens. The Erlang equivalent — `?FOO` vs `?FOO(args)` — used to collapse both into `:name_constant`, and worse, the `?` operator in `syntax_operators` was tried first in the choice and ate the leading `?` of any macro reference, leaving `?FOO` to lex as `[operator "?", name "FOO"]`. Add a separate `macro_call` rule that matches `?<name>(`-style references and emits `:name_function`, keep the existing `macro` rule (now `:name_constant`) for parameterless references, and move both ahead of `syntax_operators` in the choice so the operator rule no longer captures the `?`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

OTP 27's triple-quoted-string spec extends to N quotes (N >= 3): an opening run of N quotes on its own line opens the string and a matching run of N quotes on its own line closes it. The lexer only recognised N=3, so any string using a 4-quote opener (which is the canonical way to embed a literal `"""` in the body) was lexed as multiple unrelated tokens. Add explicit `quadruple_quoted_string` and `quintuple_quoted_string` rules — NimbleParsec doesn't support dynamic delimiter lengths, so each width needs its own rule. Place the longer-quote variants ahead of the triple-quote rule in the choice so the longest matching opener wins. Also extend `sigil_delimiters` with `""""\n` / `\n""""` and the quintuple analogue (plus the matching `''''` / `'''''` variants), so sigil-prefixed multi-quoted strings (`~b""""..."""" `, `~B""""..."""" `, etc.) get the same coverage. The sub-token vocabulary inside the body — `:string_escape` for escape sequences, `:string_interpol` for `~p` / `~b` etc. — is identical across all widths, since they all share the same element list. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Lock OTP 27 sigil-delimiter spec coverage with tests The spec at https://www.erlang.org/doc/system/data_types.html#sigil defines the allowed sigil delimiters as: * pair forms: `()` `[]` `{}` `<>` * symmetric forms: `/` `|` `'` `"` `` ` `` `#` * triple-quote forms: `"""` `'''` (with quad/quint extensions for bodies that need to contain a literal `"""` / `""""`) The current `sigil_delimiters` list already covers every entry, but nothing locked the coverage. Add per-delimiter tests so a future narrowing of the list trips a test rather than silently dropping a valid sigil form. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Doc attributes are nearly universal in OTP 27+ modules and the canonical use case for triple-quoted strings. Lock coverage of the common shapes: triple-quoted body, single-line string body, and a `-doc """..."""` attribute followed by a function clause (which exercises the boundary between the doc string close `"""` and the function head). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Function-head guards exercise the interaction between several rule families: keyword recognition (`when`), word operators (`andalso`, `orelse`), comparison operators (`>`, `<`, `=/=`), BIF recognition (`is_integer`, `is_atom`), and the comma/semicolon guard separator. Lock the common shapes so a regression in any one of those would surface as a guard test failure. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Map comprehensions (OTP 26) and bitstring comprehensions (pre-existing, but tests scarce) exercise the interactions between several operator and punctuation tokens that the lexer hasn't explicitly tested in combination: `=>` and `:=` next to `||`, `<-`, `<=`, and the `\#{...}` map-open punctuation. Also lock strict-generator `<:-` (OTP 27) coverage with an explicit positive test rather than the operator-list catch-all. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Most lexer tests are minimal isolated inputs that pin one rule's output. The richer interaction-shaped failures (a rule's order in the choice perturbing how a sibling rule fires) need a test that threads many features through one input. Add a small module fragment that combines: * `-module` / `-export` attributes * a `-doc """..."""` doc attribute with multi-line body * a function head with a `when` guard and BIF call * a map comprehension (`#{K => V || K := V <- M, ...}`) * a body with comparison operator and number If a future change breaks any of those rules' interactions, this test catches it whereas the per-feature tests would still pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

garazdawi · 2026-05-06T18:57:39Z

/cc @bjorng

garazdawi and others added 15 commits May 6, 2026 20:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enhance lexer for native records and related syntax in OTP 29#34

Enhance lexer for native records and related syntax in OTP 29#34
garazdawi wants to merge 15 commits intoelixir-makeup:masterfrom
garazdawi:master

garazdawi commented May 6, 2026

Uh oh!

garazdawi commented May 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

garazdawi commented May 6, 2026

Uh oh!

garazdawi commented May 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant