Skip to content

Enhance lexer for native records and related syntax in OTP 29#34

Open
garazdawi wants to merge 15 commits intoelixir-makeup:masterfrom
garazdawi:master
Open

Enhance lexer for native records and related syntax in OTP 29#34
garazdawi wants to merge 15 commits intoelixir-makeup:masterfrom
garazdawi:master

Conversation

@garazdawi
Copy link
Copy Markdown
Contributor

I started by adding native record support and then decided to do a full facelift of many of the features that have landed the last releases and fix some bugs along the way. This PR adds support for:

  • native records
  • triple (and quadruple) quoted strings
  • Numbers with joiners (1_000_000)

and a bunch of small fixes.

garazdawi and others added 15 commits May 6, 2026 20:48
Native records introduced new syntactic forms in Erlang/OTP 29
(erlang/otp PR #11090). The shape that the existing record rule
does not match is the external construction / pattern / field
access form `#Module:Name{...}` and `#Module:Name.field`, where
`Module:Name` appears between the `#` and the `{` (or `.`).

Tokenize the module qualifier as `:name_class` (matching the
existing namespace pattern) and the record name as the existing
record-name token (`:string_symbol`). Local construction
(`#Name{...}`) is identical in shape to a tuple-based record and
needs no change — the lexer cannot disambiguate from local
context, so both colour the same way.

The other native-record forms — `-record #Name{...}.` definition
attribute, `-export_record([...])`, `-import_record(Mod, [...])`
— already tokenize correctly under the existing module_attribute
rule (the attribute name is captured generically and `(` is
already optional). Tests added to lock the expected output.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Add tests for native record patterns and updates

The native-records commit covered construction (`#mod:name{f=v}`)
but not the symmetric forms: pattern matching
(`#mod:name{f = X} = Y`) and updates via a prefixed variable
(`Y#mod:name{f = 2}`). Both already work via the same rule but
nothing tested them; lock the coverage so a future ordering tweak
in the choice can't silently regress them.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Accept variable-shape and keyword names in native records

OTP 29's native-record syntax relaxes the record-name rule: per the
spec at https://www.erlang.org/doc/system/data_types.html, "it is
not necessary to quote atoms that look like variable names or
keywords." So `#State{}`, `#div{}`, `#case{}`, `#fun{}` are all
valid record references even though `State` is variable-shape and
`div`/`case`/`fun` are reserved words.

The previous rule used `atom_name` only, which requires lowercase
or quoted. `#State{}` therefore fell through to the punctuation /
variable rules and produced disjoint tokens with no record-shape
grouping. Add a `record_name` combinator that accepts either
`atom_name` or `variable_name`, both tagged `:string_symbol`, and
use it in both `record` and `native_record_external`. Tuple-based
records don't actually allow these forms, but the lexer can't tell
the two record kinds apart from local context — so accept the
union.

Keyword and word-operator names (`#case{}`, `#div{}`, `#fun{}`)
still get re-tagged by postprocess to `:keyword` /
`:operator_word`. That's accepted output: the surrounding `#...{`
shape still groups visually as a record reference, and themes that
care can render keywords-in-record-position differently if
desired.

Tests cover all four name shapes (lowercase atom, variable,
keyword, quoted) in three positions (local construction, external
construction, definition attribute), including the OTP 29 spec
example `-record #vector{x = 0.0, y = 0.0}.`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Don't reclassify keyword / builtin record names in postprocess

When `#case{}` is a record reference, the existing postprocess
clauses re-tagged the inner `:string_symbol "case"` as `:keyword`,
because the conversion was unconditional on the value matching
the keyword list. Visually that meant the record-name slot
flipped colour depending on whether the chosen name happened to
be a reserved word — confusing for readers, and wrong because the
text in that position is a record name, not an expression keyword.

Tag record-name tokens with a `record_name: true` meta marker via
the `record_name` combinator. Add a postprocess clause that
matches the marker and bypasses the keyword / builtin /
word-operator reclassification, then strips the marker so it
doesn't leak into rendered tokens. Both `:string_symbol` clauses
(keyword, builtin, word-operator) are guarded by the marker check
implicitly because pattern matching is order-sensitive and the
marker clause comes first.

Tests for `#case{}`, `#fun{}`, `#div{}`, and `#mod:case{}` now
assert `:string_symbol` for the record name (matching the
lowercase-name behaviour) instead of `:keyword` /
`:operator_word`. A new test verifies the marker doesn't leak
through to output meta.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
OTP 27 added `_` as a digit-group separator in numeric literals:
`1_000_000`, `16#FF_FF`, `0.1_5e1_0`. Extend the digit character
classes so numeric tokens accept these forms.

The lexer is intentionally tolerant about position — it does not
validate that underscores only sit between digits and not at the
edges of the literal — because the lexer's job is highlighting,
not validation. The compiler will reject malformed literals with
a real error.

Tightened `number_integer` to require a leading digit so a bare
underscore can't accidentally start a number; the digit-tail then
absorbs further `[0-9_]+`. Weird-base integers (`16#FF_FF`) now
include `_` in the post-`#` character set.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`?=` is the maybe-expression match operator added in OTP 25
(stable in OTP 27). Without it in `syntax_operators`,
`X ?= Y` was lexed as two operator tokens (`?` and `=`),
which is wrong both visually and semantically — it broke
inside `maybe ... end` blocks.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The `character` rule tokenized `$\\` followed by a single byte —
fine for simple `$\\n`, `$\\t`, but wrong for hex (`$\\xFF`,
`$\\x{1F600}`), octal (`$\\077`) and control (`$\\^A`) forms.
Those were splitting into a partial char-token plus stray name
or integer tokens, which rendered as broken syntax in the docs.

Add a dedicated `character_escape` rule that, after consuming
the leading backslash, tries the structured escapes (hex with
its two `x`-prefixed forms, octal, control) before falling back
to any single char. The order matters: `escape_hex` and
`escape_octal` must precede the single-char fallback so the
multi-character forms are consumed whole.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Emit `:string_escape` sub-tokens inside double-quoted strings

The `triple_quoted_string` rule already emitted `:string_escape`
sub-tokens for each escape sequence inside the string body. Plain
double-quoted strings did not — they used a literal `\"` recogniser
that only stopped the closing-quote logic from triggering early,
without producing a distinct token for the escape itself. Themes
that wanted to colour escapes differently from the surrounding
string body had no token to hook on.

Replace the special-purpose `escape_double_quote` with the generic
`escaped_char`, which itself was extended to consume structured
escapes (`\\xFF`, `\\x{...}`, `\\077`, `\\^A`) whole rather than
truncating after the leading byte. `string_like` now sees the same
sub-token vocabulary in `"..."` strings and `"""..."""`
triple-quoted strings.

Existing string-escape tests updated to match the new (richer)
output.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The `function` rule eagerly matches any atom-shaped token followed
by `(` and tags it `:name_function`. For reserved words written
adjacent to `(` — most commonly `fun(X) -> ... end` — that loses
the keyword classification, because the postprocess pass only
checked `:string_symbol` tokens against the keyword list.

Add a postprocess clause that converts `:name_function` tokens
whose value is in the keyword list back to `:keyword`. Reserved
words can't legally be defined as function names in Erlang, so
this is unambiguous.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The static `@builtins` list had bit-rotted: it was missing
post-OTP-19 BIFs (`map_get/2`, `is_map_key/2`, `binary_part/2,3`,
`floor/1`, `ceil/1`, `min/2`, `max/2`, `unique_integer/{0,1}`,
`monotonic_time/{0,1}`, etc.) and contained at least one typo
(`resume_processround` — no such BIF, presumably a merge of
`resume_process` and `round`).

Replace it with a compile-time-generated list sourced from
`erl_internal:bif/2` — the same predicate the Erlang compiler
uses to decide what's auto-imported. Every rebuild of
`makeup_erlang` re-syncs the list with the OTP version we
compile against. 122 BIFs vs the previous ~85.

Also add a postprocess clause that converts `:name_function`
tokens whose value is a BIF back to `:name_builtin` (analogous
to the keyword-recovery clause). Closes makeup_erlang elixir-makeup#13:
`length(L)` and similar BIF calls now render as builtins
instead of plain function calls.

The pre-existing string-symbol → name_builtin clause was
unchanged and still applies in non-`(`-followed positions
(e.g. `length` standalone in a documentation prose). Both
clauses share the same `@builtins` list.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The `erl_prompt` rule used to require a literal `\n` immediately
before the prompt body. When the generic `whitespace` rule earlier
in the choice consumed a multi-character whitespace block ending in
`\n` (e.g. `"\n  \n1> ok."`), no `\n` remained at the prompt rule's
expected position and the prompt was lexed as plain
`[number_integer, operator]` instead of `:generic_prompt`. See
makeup_elixir elixir-makeup#28 for the same-shape bug.

Match any leading whitespace block that contains at least one `\n`,
which keeps the rule anchored to a line boundary while tolerating
preceding spaces / tabs / further newlines. False-positives on `1 > 2`
and `x. 1> a.` are still rejected because neither contains a `\n`
between the operand and `>`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Erlang's grammar treats any identifier starting with `_` followed by
identifier characters as a variable (typically a "don't bother to
warn me about this" hint). The lexer was tokenising `_5` as
`[punctuation "_", number_integer 5]` because `_` appears in the
generic punctuation list and was matched before the variable rule.

Add a dedicated `underscore_identifier` rule that matches `_`
followed by at least one identifier character and emits `:name`,
placed before `punctuation` in the choice. Bare `_` (the wildcard
pattern) remains a punctuation token so themes can render the two
distinctly.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The `module_attribute` rule already accepts any atom-shaped name as
the attribute, so all current and future OTP attributes work without
lexer changes. Add an explicit list of every current attribute
(`-callback`, `-optional_callbacks`, `-on_load`, `-nifs`,
`-deprecated`, `-removed`, `-feature`, `-export_type`, `-export_record`
and `-import_record` from the native-records work, plus the
historically-supported set) and assert each one tokenises as
`:name_attribute`. Catches accidental regressions if anyone ever
narrows the rule.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`makeup_elixir` splits `@foo` and `@foo(...)` into different tokens.
The Erlang equivalent — `?FOO` vs `?FOO(args)` — used to collapse
both into `:name_constant`, and worse, the `?` operator in
`syntax_operators` was tried first in the choice and ate the leading
`?` of any macro reference, leaving `?FOO` to lex as
`[operator "?", name "FOO"]`.

Add a separate `macro_call` rule that matches `?<name>(`-style
references and emits `:name_function`, keep the existing `macro`
rule (now `:name_constant`) for parameterless references, and move
both ahead of `syntax_operators` in the choice so the operator
rule no longer captures the `?`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
OTP 27's triple-quoted-string spec extends to N quotes (N >= 3): an
opening run of N quotes on its own line opens the string and a
matching run of N quotes on its own line closes it. The lexer only
recognised N=3, so any string using a 4-quote opener (which is the
canonical way to embed a literal `"""` in the body) was lexed as
multiple unrelated tokens.

Add explicit `quadruple_quoted_string` and `quintuple_quoted_string`
rules — NimbleParsec doesn't support dynamic delimiter lengths, so
each width needs its own rule. Place the longer-quote variants
ahead of the triple-quote rule in the choice so the longest
matching opener wins.

Also extend `sigil_delimiters` with `""""\n` / `\n""""` and the
quintuple analogue (plus the matching `''''` / `'''''` variants),
so sigil-prefixed multi-quoted strings (`~b""""..."""" `,
`~B""""..."""" `, etc.) get the same coverage.

The sub-token vocabulary inside the body — `:string_escape` for
escape sequences, `:string_interpol` for `~p` / `~b` etc. — is
identical across all widths, since they all share the same element
list.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Lock OTP 27 sigil-delimiter spec coverage with tests

The spec at https://www.erlang.org/doc/system/data_types.html#sigil
defines the allowed sigil delimiters as:

* pair forms: `()` `[]` `{}` `<>`
* symmetric forms: `/` `|` `'` `"` `` ` `` `#`
* triple-quote forms: `"""` `'''` (with quad/quint extensions for
  bodies that need to contain a literal `"""` / `""""`)

The current `sigil_delimiters` list already covers every entry, but
nothing locked the coverage. Add per-delimiter tests so a future
narrowing of the list trips a test rather than silently dropping a
valid sigil form.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Doc attributes are nearly universal in OTP 27+ modules and the
canonical use case for triple-quoted strings. Lock coverage of
the common shapes: triple-quoted body, single-line string body,
and a `-doc """..."""` attribute followed by a function clause
(which exercises the boundary between the doc string close `"""`
and the function head).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Function-head guards exercise the interaction between several rule
families: keyword recognition (`when`), word operators
(`andalso`, `orelse`), comparison operators (`>`, `<`, `=/=`),
BIF recognition (`is_integer`, `is_atom`), and the comma/semicolon
guard separator. Lock the common shapes so a regression in any
one of those would surface as a guard test failure.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Map comprehensions (OTP 26) and bitstring comprehensions
(pre-existing, but tests scarce) exercise the interactions between
several operator and punctuation tokens that the lexer hasn't
explicitly tested in combination: `=>` and `:=` next to `||`,
`<-`, `<=`, and the `\#{...}` map-open punctuation. Also lock
strict-generator `<:-` (OTP 27) coverage with an explicit
positive test rather than the operator-list catch-all.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Most lexer tests are minimal isolated inputs that pin one rule's
output. The richer interaction-shaped failures (a rule's order in
the choice perturbing how a sibling rule fires) need a test that
threads many features through one input. Add a small module
fragment that combines:

* `-module` / `-export` attributes
* a `-doc """..."""` doc attribute with multi-line body
* a function head with a `when` guard and BIF call
* a map comprehension (`#{K => V || K := V <- M, ...}`)
* a body with comparison operator and number

If a future change breaks any of those rules' interactions, this
test catches it whereas the per-feature tests would still pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@garazdawi
Copy link
Copy Markdown
Contributor Author

/cc @bjorng

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant