Enhance lexer for native records and related syntax in OTP 29#34
Open
garazdawi wants to merge 15 commits intoelixir-makeup:masterfrom
Open
Enhance lexer for native records and related syntax in OTP 29#34garazdawi wants to merge 15 commits intoelixir-makeup:masterfrom
garazdawi wants to merge 15 commits intoelixir-makeup:masterfrom
Conversation
Native records introduced new syntactic forms in Erlang/OTP 29
(erlang/otp PR #11090). The shape that the existing record rule
does not match is the external construction / pattern / field
access form `#Module:Name{...}` and `#Module:Name.field`, where
`Module:Name` appears between the `#` and the `{` (or `.`).
Tokenize the module qualifier as `:name_class` (matching the
existing namespace pattern) and the record name as the existing
record-name token (`:string_symbol`). Local construction
(`#Name{...}`) is identical in shape to a tuple-based record and
needs no change — the lexer cannot disambiguate from local
context, so both colour the same way.
The other native-record forms — `-record #Name{...}.` definition
attribute, `-export_record([...])`, `-import_record(Mod, [...])`
— already tokenize correctly under the existing module_attribute
rule (the attribute name is captured generically and `(` is
already optional). Tests added to lock the expected output.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add tests for native record patterns and updates
The native-records commit covered construction (`#mod:name{f=v}`)
but not the symmetric forms: pattern matching
(`#mod:name{f = X} = Y`) and updates via a prefixed variable
(`Y#mod:name{f = 2}`). Both already work via the same rule but
nothing tested them; lock the coverage so a future ordering tweak
in the choice can't silently regress them.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Accept variable-shape and keyword names in native records
OTP 29's native-record syntax relaxes the record-name rule: per the
spec at https://www.erlang.org/doc/system/data_types.html, "it is
not necessary to quote atoms that look like variable names or
keywords." So `#State{}`, `#div{}`, `#case{}`, `#fun{}` are all
valid record references even though `State` is variable-shape and
`div`/`case`/`fun` are reserved words.
The previous rule used `atom_name` only, which requires lowercase
or quoted. `#State{}` therefore fell through to the punctuation /
variable rules and produced disjoint tokens with no record-shape
grouping. Add a `record_name` combinator that accepts either
`atom_name` or `variable_name`, both tagged `:string_symbol`, and
use it in both `record` and `native_record_external`. Tuple-based
records don't actually allow these forms, but the lexer can't tell
the two record kinds apart from local context — so accept the
union.
Keyword and word-operator names (`#case{}`, `#div{}`, `#fun{}`)
still get re-tagged by postprocess to `:keyword` /
`:operator_word`. That's accepted output: the surrounding `#...{`
shape still groups visually as a record reference, and themes that
care can render keywords-in-record-position differently if
desired.
Tests cover all four name shapes (lowercase atom, variable,
keyword, quoted) in three positions (local construction, external
construction, definition attribute), including the OTP 29 spec
example `-record #vector{x = 0.0, y = 0.0}.`.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Don't reclassify keyword / builtin record names in postprocess
When `#case{}` is a record reference, the existing postprocess
clauses re-tagged the inner `:string_symbol "case"` as `:keyword`,
because the conversion was unconditional on the value matching
the keyword list. Visually that meant the record-name slot
flipped colour depending on whether the chosen name happened to
be a reserved word — confusing for readers, and wrong because the
text in that position is a record name, not an expression keyword.
Tag record-name tokens with a `record_name: true` meta marker via
the `record_name` combinator. Add a postprocess clause that
matches the marker and bypasses the keyword / builtin /
word-operator reclassification, then strips the marker so it
doesn't leak into rendered tokens. Both `:string_symbol` clauses
(keyword, builtin, word-operator) are guarded by the marker check
implicitly because pattern matching is order-sensitive and the
marker clause comes first.
Tests for `#case{}`, `#fun{}`, `#div{}`, and `#mod:case{}` now
assert `:string_symbol` for the record name (matching the
lowercase-name behaviour) instead of `:keyword` /
`:operator_word`. A new test verifies the marker doesn't leak
through to output meta.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
OTP 27 added `_` as a digit-group separator in numeric literals: `1_000_000`, `16#FF_FF`, `0.1_5e1_0`. Extend the digit character classes so numeric tokens accept these forms. The lexer is intentionally tolerant about position — it does not validate that underscores only sit between digits and not at the edges of the literal — because the lexer's job is highlighting, not validation. The compiler will reject malformed literals with a real error. Tightened `number_integer` to require a leading digit so a bare underscore can't accidentally start a number; the digit-tail then absorbs further `[0-9_]+`. Weird-base integers (`16#FF_FF`) now include `_` in the post-`#` character set. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`?=` is the maybe-expression match operator added in OTP 25 (stable in OTP 27). Without it in `syntax_operators`, `X ?= Y` was lexed as two operator tokens (`?` and `=`), which is wrong both visually and semantically — it broke inside `maybe ... end` blocks. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The `character` rule tokenized `$\\` followed by a single byte —
fine for simple `$\\n`, `$\\t`, but wrong for hex (`$\\xFF`,
`$\\x{1F600}`), octal (`$\\077`) and control (`$\\^A`) forms.
Those were splitting into a partial char-token plus stray name
or integer tokens, which rendered as broken syntax in the docs.
Add a dedicated `character_escape` rule that, after consuming
the leading backslash, tries the structured escapes (hex with
its two `x`-prefixed forms, octal, control) before falling back
to any single char. The order matters: `escape_hex` and
`escape_octal` must precede the single-char fallback so the
multi-character forms are consumed whole.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Emit `:string_escape` sub-tokens inside double-quoted strings
The `triple_quoted_string` rule already emitted `:string_escape`
sub-tokens for each escape sequence inside the string body. Plain
double-quoted strings did not — they used a literal `\"` recogniser
that only stopped the closing-quote logic from triggering early,
without producing a distinct token for the escape itself. Themes
that wanted to colour escapes differently from the surrounding
string body had no token to hook on.
Replace the special-purpose `escape_double_quote` with the generic
`escaped_char`, which itself was extended to consume structured
escapes (`\\xFF`, `\\x{...}`, `\\077`, `\\^A`) whole rather than
truncating after the leading byte. `string_like` now sees the same
sub-token vocabulary in `"..."` strings and `"""..."""`
triple-quoted strings.
Existing string-escape tests updated to match the new (richer)
output.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The `function` rule eagerly matches any atom-shaped token followed by `(` and tags it `:name_function`. For reserved words written adjacent to `(` — most commonly `fun(X) -> ... end` — that loses the keyword classification, because the postprocess pass only checked `:string_symbol` tokens against the keyword list. Add a postprocess clause that converts `:name_function` tokens whose value is in the keyword list back to `:keyword`. Reserved words can't legally be defined as function names in Erlang, so this is unambiguous. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The static `@builtins` list had bit-rotted: it was missing
post-OTP-19 BIFs (`map_get/2`, `is_map_key/2`, `binary_part/2,3`,
`floor/1`, `ceil/1`, `min/2`, `max/2`, `unique_integer/{0,1}`,
`monotonic_time/{0,1}`, etc.) and contained at least one typo
(`resume_processround` — no such BIF, presumably a merge of
`resume_process` and `round`).
Replace it with a compile-time-generated list sourced from
`erl_internal:bif/2` — the same predicate the Erlang compiler
uses to decide what's auto-imported. Every rebuild of
`makeup_erlang` re-syncs the list with the OTP version we
compile against. 122 BIFs vs the previous ~85.
Also add a postprocess clause that converts `:name_function`
tokens whose value is a BIF back to `:name_builtin` (analogous
to the keyword-recovery clause). Closes makeup_erlang elixir-makeup#13:
`length(L)` and similar BIF calls now render as builtins
instead of plain function calls.
The pre-existing string-symbol → name_builtin clause was
unchanged and still applies in non-`(`-followed positions
(e.g. `length` standalone in a documentation prose). Both
clauses share the same `@builtins` list.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The `erl_prompt` rule used to require a literal `\n` immediately before the prompt body. When the generic `whitespace` rule earlier in the choice consumed a multi-character whitespace block ending in `\n` (e.g. `"\n \n1> ok."`), no `\n` remained at the prompt rule's expected position and the prompt was lexed as plain `[number_integer, operator]` instead of `:generic_prompt`. See makeup_elixir elixir-makeup#28 for the same-shape bug. Match any leading whitespace block that contains at least one `\n`, which keeps the rule anchored to a line boundary while tolerating preceding spaces / tabs / further newlines. False-positives on `1 > 2` and `x. 1> a.` are still rejected because neither contains a `\n` between the operand and `>`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Erlang's grammar treats any identifier starting with `_` followed by identifier characters as a variable (typically a "don't bother to warn me about this" hint). The lexer was tokenising `_5` as `[punctuation "_", number_integer 5]` because `_` appears in the generic punctuation list and was matched before the variable rule. Add a dedicated `underscore_identifier` rule that matches `_` followed by at least one identifier character and emits `:name`, placed before `punctuation` in the choice. Bare `_` (the wildcard pattern) remains a punctuation token so themes can render the two distinctly. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The `module_attribute` rule already accepts any atom-shaped name as the attribute, so all current and future OTP attributes work without lexer changes. Add an explicit list of every current attribute (`-callback`, `-optional_callbacks`, `-on_load`, `-nifs`, `-deprecated`, `-removed`, `-feature`, `-export_type`, `-export_record` and `-import_record` from the native-records work, plus the historically-supported set) and assert each one tokenises as `:name_attribute`. Catches accidental regressions if anyone ever narrows the rule. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`makeup_elixir` splits `@foo` and `@foo(...)` into different tokens. The Erlang equivalent — `?FOO` vs `?FOO(args)` — used to collapse both into `:name_constant`, and worse, the `?` operator in `syntax_operators` was tried first in the choice and ate the leading `?` of any macro reference, leaving `?FOO` to lex as `[operator "?", name "FOO"]`. Add a separate `macro_call` rule that matches `?<name>(`-style references and emits `:name_function`, keep the existing `macro` rule (now `:name_constant`) for parameterless references, and move both ahead of `syntax_operators` in the choice so the operator rule no longer captures the `?`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
OTP 27's triple-quoted-string spec extends to N quotes (N >= 3): an opening run of N quotes on its own line opens the string and a matching run of N quotes on its own line closes it. The lexer only recognised N=3, so any string using a 4-quote opener (which is the canonical way to embed a literal `"""` in the body) was lexed as multiple unrelated tokens. Add explicit `quadruple_quoted_string` and `quintuple_quoted_string` rules — NimbleParsec doesn't support dynamic delimiter lengths, so each width needs its own rule. Place the longer-quote variants ahead of the triple-quote rule in the choice so the longest matching opener wins. Also extend `sigil_delimiters` with `""""\n` / `\n""""` and the quintuple analogue (plus the matching `''''` / `'''''` variants), so sigil-prefixed multi-quoted strings (`~b""""..."""" `, `~B""""..."""" `, etc.) get the same coverage. The sub-token vocabulary inside the body — `:string_escape` for escape sequences, `:string_interpol` for `~p` / `~b` etc. — is identical across all widths, since they all share the same element list. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Lock OTP 27 sigil-delimiter spec coverage with tests The spec at https://www.erlang.org/doc/system/data_types.html#sigil defines the allowed sigil delimiters as: * pair forms: `()` `[]` `{}` `<>` * symmetric forms: `/` `|` `'` `"` `` ` `` `#` * triple-quote forms: `"""` `'''` (with quad/quint extensions for bodies that need to contain a literal `"""` / `""""`) The current `sigil_delimiters` list already covers every entry, but nothing locked the coverage. Add per-delimiter tests so a future narrowing of the list trips a test rather than silently dropping a valid sigil form. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Doc attributes are nearly universal in OTP 27+ modules and the canonical use case for triple-quoted strings. Lock coverage of the common shapes: triple-quoted body, single-line string body, and a `-doc """..."""` attribute followed by a function clause (which exercises the boundary between the doc string close `"""` and the function head). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Function-head guards exercise the interaction between several rule families: keyword recognition (`when`), word operators (`andalso`, `orelse`), comparison operators (`>`, `<`, `=/=`), BIF recognition (`is_integer`, `is_atom`), and the comma/semicolon guard separator. Lock the common shapes so a regression in any one of those would surface as a guard test failure. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Map comprehensions (OTP 26) and bitstring comprehensions
(pre-existing, but tests scarce) exercise the interactions between
several operator and punctuation tokens that the lexer hasn't
explicitly tested in combination: `=>` and `:=` next to `||`,
`<-`, `<=`, and the `\#{...}` map-open punctuation. Also lock
strict-generator `<:-` (OTP 27) coverage with an explicit
positive test rather than the operator-list catch-all.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Most lexer tests are minimal isolated inputs that pin one rule's
output. The richer interaction-shaped failures (a rule's order in
the choice perturbing how a sibling rule fires) need a test that
threads many features through one input. Add a small module
fragment that combines:
* `-module` / `-export` attributes
* a `-doc """..."""` doc attribute with multi-line body
* a function head with a `when` guard and BIF call
* a map comprehension (`#{K => V || K := V <- M, ...}`)
* a body with comparison operator and number
If a future change breaks any of those rules' interactions, this
test catches it whereas the per-feature tests would still pass.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Contributor
Author
|
/cc @bjorng |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
I started by adding native record support and then decided to do a full facelift of many of the features that have landed the last releases and fix some bugs along the way. This PR adds support for:
and a bunch of small fixes.