From 4b2d7c90fa2b578015516567c9e7eed6d54ec63a Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Lukas=20Backstr=C3=B6m?= Date: Wed, 6 May 2026 14:24:43 +0200 Subject: [PATCH 01/15] Add lexer rules for native records (OTP 29) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Native records introduced new syntactic forms in Erlang/OTP 29 (erlang/otp PR #11090). The shape that the existing record rule does not match is the external construction / pattern / field access form `#Module:Name{...}` and `#Module:Name.field`, where `Module:Name` appears between the `#` and the `{` (or `.`). Tokenize the module qualifier as `:name_class` (matching the existing namespace pattern) and the record name as the existing record-name token (`:string_symbol`). Local construction (`#Name{...}`) is identical in shape to a tuple-based record and needs no change — the lexer cannot disambiguate from local context, so both colour the same way. The other native-record forms — `-record #Name{...}.` definition attribute, `-export_record([...])`, `-import_record(Mod, [...])` — already tokenize correctly under the existing module_attribute rule (the attribute name is captured generically and `(` is already optional). Tests added to lock the expected output. Co-Authored-By: Claude Opus 4.7 (1M context) Add tests for native record patterns and updates The native-records commit covered construction (`#mod:name{f=v}`) but not the symmetric forms: pattern matching (`#mod:name{f = X} = Y`) and updates via a prefixed variable (`Y#mod:name{f = 2}`). Both already work via the same rule but nothing tested them; lock the coverage so a future ordering tweak in the choice can't silently regress them. Co-Authored-By: Claude Opus 4.7 (1M context) Accept variable-shape and keyword names in native records OTP 29's native-record syntax relaxes the record-name rule: per the spec at https://www.erlang.org/doc/system/data_types.html, "it is not necessary to quote atoms that look like variable names or keywords." So `#State{}`, `#div{}`, `#case{}`, `#fun{}` are all valid record references even though `State` is variable-shape and `div`/`case`/`fun` are reserved words. The previous rule used `atom_name` only, which requires lowercase or quoted. `#State{}` therefore fell through to the punctuation / variable rules and produced disjoint tokens with no record-shape grouping. Add a `record_name` combinator that accepts either `atom_name` or `variable_name`, both tagged `:string_symbol`, and use it in both `record` and `native_record_external`. Tuple-based records don't actually allow these forms, but the lexer can't tell the two record kinds apart from local context — so accept the union. Keyword and word-operator names (`#case{}`, `#div{}`, `#fun{}`) still get re-tagged by postprocess to `:keyword` / `:operator_word`. That's accepted output: the surrounding `#...{` shape still groups visually as a record reference, and themes that care can render keywords-in-record-position differently if desired. Tests cover all four name shapes (lowercase atom, variable, keyword, quoted) in three positions (local construction, external construction, definition attribute), including the OTP 29 spec example `-record #vector{x = 0.0, y = 0.0}.`. Co-Authored-By: Claude Opus 4.7 (1M context) Don't reclassify keyword / builtin record names in postprocess When `#case{}` is a record reference, the existing postprocess clauses re-tagged the inner `:string_symbol "case"` as `:keyword`, because the conversion was unconditional on the value matching the keyword list. Visually that meant the record-name slot flipped colour depending on whether the chosen name happened to be a reserved word — confusing for readers, and wrong because the text in that position is a record name, not an expression keyword. Tag record-name tokens with a `record_name: true` meta marker via the `record_name` combinator. Add a postprocess clause that matches the marker and bypasses the keyword / builtin / word-operator reclassification, then strips the marker so it doesn't leak into rendered tokens. Both `:string_symbol` clauses (keyword, builtin, word-operator) are guarded by the marker check implicitly because pattern matching is order-sensitive and the marker clause comes first. Tests for `#case{}`, `#fun{}`, `#div{}`, and `#mod:case{}` now assert `:string_symbol` for the record name (matching the lowercase-name behaviour) instead of `:keyword` / `:operator_word`. A new test verifies the marker doesn't leak through to output meta. Co-Authored-By: Claude Opus 4.7 (1M context) --- lib/makeup/lexers/erlang_lexer.ex | 44 ++- .../erlang_lexer_tokenizer_test.exs | 279 ++++++++++++++++++ 2 files changed, 322 insertions(+), 1 deletion(-) diff --git a/lib/makeup/lexers/erlang_lexer.ex b/lib/makeup/lexers/erlang_lexer.ex index eb94a3f..9a65a46 100644 --- a/lib/makeup/lexers/erlang_lexer.ex +++ b/lib/makeup/lexers/erlang_lexer.ex @@ -224,9 +224,43 @@ defmodule Makeup.Lexers.ErlangLexer do :operator ) + # OTP 29 native records relax the record-name rule: per the spec + # (https://www.erlang.org/doc/system/data_types.html), "it is not necessary + # to quote atoms that look like variable names or keywords." So `#State{}`, + # `#div{}`, `#case{}` are all valid record references even though `State` + # is variable-shape and `div`/`case` are reserved words. Tuple-based records + # don't allow these forms, but the lexer can't tell the two record kinds + # apart from local context — so accept the union. + # + # The `record_name: true` meta marker tells postprocess to skip the + # keyword / builtin / word-operator conversion for this position. Without + # it, `#case{}` would tokenise as `[#, keyword case, {]` — visually + # confusing because `case` here names a record, not an expression keyword. + record_name = + choice([ + token(atom_name, :string_symbol, %{record_name: true}), + token(variable_name, :string_symbol, %{record_name: true}) + ]) + + # External native record construction / pattern / field access: + # #Module:Name{F = V} + # #Module:Name.field + # The `Module:Name` shape between `#` and `{` (or `.`) was added in OTP 29 + # alongside native records. Local construction (`#Name{...}`) is identical + # in shape to a tuple-based record and is handled by the rule below. + native_record_external = + token(string("#"), :operator) + |> concat(token(atom_name, :name_class)) + |> concat(token(":", :punctuation)) + |> concat(record_name) + |> choice([ + token("{", :punctuation), + token(".", :punctuation) + ]) + record = token(string("#"), :operator) - |> concat(atom) + |> concat(record_name) |> choice([ token("{", :punctuation), token(".", :punctuation) @@ -304,6 +338,7 @@ defmodule Makeup.Lexers.ErlangLexer do ] ++ all_sigils ++ [ + native_record_external, record, punctuation, # `tuple` might be unnecessary @@ -379,6 +414,13 @@ defmodule Makeup.Lexers.ErlangLexer do @word_operators ~W[and andalso band bnot bor bsl bsr bxor div not or orelse rem xor] + # Record names tagged by the `record_name` combinator should not be + # reclassified as keywords / builtins / word-operators even if their + # text happens to match. Strip the marker after acting on it so it + # doesn't leak into the rendered output. + defp postprocess_helper([{:string_symbol, %{record_name: true} = meta, value} | tokens]), + do: [{:string_symbol, Map.delete(meta, :record_name), value} | postprocess_helper(tokens)] + defp postprocess_helper([{:string_symbol, meta, value} | tokens]) when value in @keywords, do: [{:keyword, meta, value} | postprocess_helper(tokens)] diff --git a/test/makeup/erlang_lexer/erlang_lexer_tokenizer_test.exs b/test/makeup/erlang_lexer/erlang_lexer_tokenizer_test.exs index 84ba145..f75085b 100644 --- a/test/makeup/erlang_lexer/erlang_lexer_tokenizer_test.exs +++ b/test/makeup/erlang_lexer/erlang_lexer_tokenizer_test.exs @@ -553,6 +553,285 @@ defmodule ErlangLexerTokenizer do end end + describe "native records (OTP 29)" do + test "tokenizes external native record construction" do + assert [ + {:operator, %{}, "#"}, + {:name_class, %{}, "vector_lib"}, + {:punctuation, %{}, ":"}, + {:string_symbol, %{}, "vector"}, + {:punctuation, %{}, "{"} | _ + ] = lex("#vector_lib:vector{x = 1.0, y = 2.0}") + end + + test "tokenizes external native record print form" do + assert [ + {:operator, %{}, "#"}, + {:name_class, %{}, "example"}, + {:punctuation, %{}, ":"}, + {:string_symbol, %{}, "pair"}, + {:punctuation, %{}, "{"} | _ + ] = lex("#example:pair{a = 1, b = 2}") + end + + test "tokenizes external native record field access" do + assert [ + {_, %{}, "X"}, + {:operator, %{}, "#"}, + {:name_class, %{}, "vector_lib"}, + {:punctuation, %{}, ":"}, + {:string_symbol, %{}, "vector"}, + {:punctuation, %{}, "."} | _ + ] = lex("X#vector_lib:vector.x") + end + + test "tokenizes local native record construction the same as tuple-based records" do + assert [ + {:operator, %{}, "#"}, + {:string_symbol, %{}, "pair"}, + {:punctuation, %{}, "{"} | _ + ] = lex("#pair{a = 1, b = 2}") + end + + test "tokenizes -record #Name{...} native definition attribute" do + tokens = lex("\n-record #pair{a, b}.") + assert {:name_attribute, %{}, "record"} in tokens + assert {:operator, %{}, "#"} in tokens + assert {:string_symbol, %{}, "pair"} in tokens + end + + test "tokenizes -export_record attribute" do + assert [ + {:whitespace, %{}, "\n"}, + {:punctuation, %{}, "-"}, + {:name_attribute, %{}, "export_record"} | _ + ] = lex("\n-export_record([vector, position]).") + end + + test "tokenizes -import_record attribute" do + assert [ + {:whitespace, %{}, "\n"}, + {:punctuation, %{}, "-"}, + {:name_attribute, %{}, "import_record"} | _ + ] = lex("\n-import_record(vector_lib, [vector, position]).") + end + + test "does not break the existing local-record rule when there is no `:`" do + tokens = lex("X#name{f = 1}") + assert {:operator, %{}, "#"} in tokens + assert {:string_symbol, %{}, "name"} in tokens + refute Enum.any?(tokens, fn t -> match?({:name_class, _, _}, t) end) + end + + test "external native record pattern match" do + assert [ + {:operator, %{}, "#"}, + {:name_class, %{}, "mod"}, + {:punctuation, %{}, ":"}, + {:string_symbol, %{}, "name"}, + {:punctuation, _, "{"}, + {:string_symbol, %{}, "f"}, + {:whitespace, %{}, " "}, + {:operator, %{}, "="}, + {:whitespace, %{}, " "}, + {:name, %{}, "X"}, + {:punctuation, _, "}"}, + {:whitespace, %{}, " "}, + {:operator, %{}, "="}, + {:whitespace, %{}, " "}, + {:name, %{}, "Y"} + ] = lex("#mod:name{f = X} = Y") + end + + test "external native record update via prefixed variable" do + assert [ + {:name, %{}, "Y"}, + {:operator, %{}, "#"}, + {:name_class, %{}, "mod"}, + {:punctuation, %{}, ":"}, + {:string_symbol, %{}, "name"}, + {:punctuation, _, "{"} | _ + ] = lex("Y#mod:name{f = 2}") + end + + # Native records relax the record-name rule: + # https://www.erlang.org/doc/system/data_types.html says "it is not + # necessary to quote atoms that look like variable names or keywords." + # So `#State{}`, `#div{}`, `#case{}` are all valid. + test "variable-shape name (`#State{}`)" do + assert [ + {:operator, %{}, "#"}, + {:string_symbol, %{}, "State"}, + {:punctuation, _, "{"}, + {:string_symbol, %{}, "x"}, + {:whitespace, %{}, " "}, + {:operator, %{}, "="}, + {:whitespace, %{}, " "}, + {:number_integer, %{}, "1"}, + {:punctuation, _, "}"} + ] = lex("#State{x = 1}") + end + + test "external native record with variable-shape name" do + assert [ + {:operator, %{}, "#"}, + {:name_class, %{}, "mod"}, + {:punctuation, %{}, ":"}, + {:string_symbol, %{}, "State"}, + {:punctuation, _, "{"} | _ + ] = lex("#mod:State{x = 1}") + end + + # Keyword and word-operator names stay as `:string_symbol` in record + # position. Postprocess sees the `record_name: true` meta marker and + # skips the usual conversion to `:keyword` / `:operator_word`, so the + # surrounding `#...{` shape renders consistently regardless of whether + # the name happens to be a reserved word. + test "keyword name (`#case{}`) stays as :string_symbol" do + assert [ + {:operator, %{}, "#"}, + {:string_symbol, %{}, "case"}, + {:punctuation, _, "{"} | _ + ] = lex("#case{x = 1}") + end + + test "keyword name (`#fun{}`)" do + assert [ + {:operator, %{}, "#"}, + {:string_symbol, %{}, "fun"}, + {:punctuation, _, "{"} | _ + ] = lex("#fun{f = g}") + end + + test "word-operator name (`#div{}`)" do + assert [ + {:operator, %{}, "#"}, + {:string_symbol, %{}, "div"}, + {:punctuation, _, "{"} | _ + ] = lex("#div{class}") + end + + test "external native record with keyword name (`#mod:case{}`)" do + assert [ + {:operator, %{}, "#"}, + {:name_class, %{}, "mod"}, + {:punctuation, %{}, ":"}, + {:string_symbol, %{}, "case"}, + {:punctuation, _, "{"} | _ + ] = lex("#mod:case{x = 1}") + end + + test "quoted-atom record name (`#'42'{}`)" do + assert [ + {:operator, %{}, "#"}, + {:string_symbol, %{}, "'42'"}, + {:punctuation, _, "{"} | _ + ] = lex("#'42'{}") + end + + # Declaration syntax: `-record #Name{...}.` (no parens around the name). + # This is the OTP 29 native-record definition form, distinct from the + # tuple-based `-record(name, {...}).` form. The same name flexibility + # (lowercase / variable-shape / keyword / quoted) applies. + test "definition with lowercase name" do + assert lex("\n-record #pair{a, b}.") == [ + {:whitespace, %{}, "\n"}, + {:punctuation, %{}, "-"}, + {:name_attribute, %{}, "record"}, + {:whitespace, %{}, " "}, + {:operator, %{}, "#"}, + {:string_symbol, %{}, "pair"}, + {:punctuation, %{group_id: "group-1"}, "{"}, + {:string_symbol, %{}, "a"}, + {:punctuation, %{}, ","}, + {:whitespace, %{}, " "}, + {:string_symbol, %{}, "b"}, + {:punctuation, %{group_id: "group-1"}, "}"}, + {:punctuation, %{}, "."} + ] + end + + test "definition with variable-shape name (`-record #State{x}.`)" do + assert lex("\n-record #State{x}.") == [ + {:whitespace, %{}, "\n"}, + {:punctuation, %{}, "-"}, + {:name_attribute, %{}, "record"}, + {:whitespace, %{}, " "}, + {:operator, %{}, "#"}, + {:string_symbol, %{}, "State"}, + {:punctuation, %{group_id: "group-1"}, "{"}, + {:string_symbol, %{}, "x"}, + {:punctuation, %{group_id: "group-1"}, "}"}, + {:punctuation, %{}, "."} + ] + end + + test "definition with keyword name (`-record #div{class}.`)" do + assert lex("\n-record #div{class}.") == [ + {:whitespace, %{}, "\n"}, + {:punctuation, %{}, "-"}, + {:name_attribute, %{}, "record"}, + {:whitespace, %{}, " "}, + {:operator, %{}, "#"}, + {:string_symbol, %{}, "div"}, + {:punctuation, %{group_id: "group-1"}, "{"}, + {:string_symbol, %{}, "class"}, + {:punctuation, %{group_id: "group-1"}, "}"}, + {:punctuation, %{}, "."} + ] + end + + test "definition with quoted name (`-record #'42'{}.`)" do + assert lex("\n-record #'42'{}.") == [ + {:whitespace, %{}, "\n"}, + {:punctuation, %{}, "-"}, + {:name_attribute, %{}, "record"}, + {:whitespace, %{}, " "}, + {:operator, %{}, "#"}, + {:string_symbol, %{}, "'42'"}, + {:punctuation, %{group_id: "group-1"}, "{"}, + {:punctuation, %{group_id: "group-1"}, "}"}, + {:punctuation, %{}, "."} + ] + end + + test "the record_name meta marker does not leak into output tokens" do + # Postprocess strips the marker after acting on it. End-to-end the + # token's meta should be the same as for any other :string_symbol. + [_, {:string_symbol, meta_kw, "case"} | _] = lex("#case{x = 1}") + [_, {:string_symbol, meta_lc, "vector"} | _] = lex("#vector{x = 1}") + assert meta_kw == meta_lc + refute Map.has_key?(meta_kw, :record_name) + end + + test "definition with default values" do + # `-record #vector{x = 0.0, y = 0.0}.` — the OTP 29 spec example. + assert lex("\n-record #vector{x = 0.0, y = 0.0}.") == [ + {:whitespace, %{}, "\n"}, + {:punctuation, %{}, "-"}, + {:name_attribute, %{}, "record"}, + {:whitespace, %{}, " "}, + {:operator, %{}, "#"}, + {:string_symbol, %{}, "vector"}, + {:punctuation, %{group_id: "group-1"}, "{"}, + {:string_symbol, %{}, "x"}, + {:whitespace, %{}, " "}, + {:operator, %{}, "="}, + {:whitespace, %{}, " "}, + {:number_float, %{}, "0.0"}, + {:punctuation, %{}, ","}, + {:whitespace, %{}, " "}, + {:string_symbol, %{}, "y"}, + {:whitespace, %{}, " "}, + {:operator, %{}, "="}, + {:whitespace, %{}, " "}, + {:number_float, %{}, "0.0"}, + {:punctuation, %{group_id: "group-1"}, "}"}, + {:punctuation, %{}, "."} + ] + end + end + describe "function_arity" do test "is tokenized correctly for the syntax function_name/arity" do assert [ From 5161d41ac910e9d64e0ee1d23498b97b9a69869c Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Lukas=20Backstr=C3=B6m?= Date: Wed, 6 May 2026 14:25:31 +0200 Subject: [PATCH 02/15] Accept underscore separators in numeric literals (OTP 27) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit OTP 27 added `_` as a digit-group separator in numeric literals: `1_000_000`, `16#FF_FF`, `0.1_5e1_0`. Extend the digit character classes so numeric tokens accept these forms. The lexer is intentionally tolerant about position — it does not validate that underscores only sit between digits and not at the edges of the literal — because the lexer's job is highlighting, not validation. The compiler will reject malformed literals with a real error. Tightened `number_integer` to require a leading digit so a bare underscore can't accidentally start a number; the digit-tail then absorbs further `[0-9_]+`. Weird-base integers (`16#FF_FF`) now include `_` in the post-`#` character set. Co-Authored-By: Claude Opus 4.7 (1M context) --- lib/makeup/lexers/erlang_lexer.ex | 13 ++++++--- .../erlang_lexer_tokenizer_test.exs | 27 +++++++++++++++++++ 2 files changed, 36 insertions(+), 4 deletions(-) diff --git a/lib/makeup/lexers/erlang_lexer.ex b/lib/makeup/lexers/erlang_lexer.ex index 9a65a46..b67fb21 100644 --- a/lib/makeup/lexers/erlang_lexer.ex +++ b/lib/makeup/lexers/erlang_lexer.ex @@ -59,24 +59,29 @@ defmodule Makeup.Lexers.ErlangLexer do ]) # Numbers - digits = ascii_string([?0..?9], min: 1) + # + # Erlang/OTP 27 added underscore separators in numeric literals + # (`1_000_000`, `16#FF_FF`, `0.1_5e1_0`). Lexer-tolerant: underscores are + # accepted anywhere inside the digit run; we don't validate position. + digits = ascii_string([?0..?9, ?_], min: 1) number_integer = optional(ascii_char([?+, ?-])) - |> concat(digits) + |> ascii_char([?0..?9]) + |> optional(ascii_string([?0..?9, ?_], min: 1)) |> token(:number_integer) number_integer_in_weird_base = optional(ascii_char([?+, ?-])) |> concat(numeric_base) |> string("#") - |> ascii_string([?0..?9, ?a..?z, ?A..?Z], min: 1) + |> ascii_string([?0..?9, ?a..?z, ?A..?Z, ?_], min: 1) |> token(:number_integer) # Floating point numbers float_scientific_notation_part = ascii_string([?e, ?E], 1) - |> optional(string("-")) + |> optional(ascii_char([?+, ?-])) |> concat(digits) number_float = diff --git a/test/makeup/erlang_lexer/erlang_lexer_tokenizer_test.exs b/test/makeup/erlang_lexer/erlang_lexer_tokenizer_test.exs index f75085b..47f728e 100644 --- a/test/makeup/erlang_lexer/erlang_lexer_tokenizer_test.exs +++ b/test/makeup/erlang_lexer/erlang_lexer_tokenizer_test.exs @@ -94,6 +94,33 @@ defmodule ErlangLexerTokenizer do assert lex("1.05e12") == [{:number_float, %{}, "1.05e12"}] assert lex("1.05e-6") == [{:number_float, %{}, "1.05e-6"}] assert lex("1.05e-12") == [{:number_float, %{}, "1.05e-12"}] + assert lex("1.05e+6") == [{:number_float, %{}, "1.05e+6"}] + assert lex("1.0e+10") == [{:number_float, %{}, "1.0e+10"}] + end + + # Numeric separators (`_`) are valid inside numeric literals since OTP 27. + test "integers with underscore separators" do + assert lex("1_000") == [{:number_integer, %{}, "1_000"}] + assert lex("1_000_000") == [{:number_integer, %{}, "1_000_000"}] + end + + test "floats with underscore separators" do + assert lex("1_000.5") == [{:number_float, %{}, "1_000.5"}] + assert lex("3.14_15") == [{:number_float, %{}, "3.14_15"}] + end + + test "weird-base integers with underscore separators" do + assert lex("16#FF_FF") == [{:number_integer, %{}, "16#FF_FF"}] + assert lex("2#1010_1010") == [{:number_integer, %{}, "2#1010_1010"}] + end + + test "trailing identifier after a number is not absorbed via underscore" do + # `1_000` is a number; the bare identifier following with whitespace is separate. + assert [ + {:number_integer, %{}, "1_000"}, + {:whitespace, %{}, " "}, + {:name, %{}, "X"} + ] = lex("1_000 X") end end From e5ed1f36d9228b7122e62ffba745d44c580afb94 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Lukas=20Backstr=C3=B6m?= Date: Wed, 6 May 2026 14:33:54 +0200 Subject: [PATCH 03/15] Tokenize `?=` as a single operator MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit `?=` is the maybe-expression match operator added in OTP 25 (stable in OTP 27). Without it in `syntax_operators`, `X ?= Y` was lexed as two operator tokens (`?` and `=`), which is wrong both visually and semantically — it broke inside `maybe ... end` blocks. Co-Authored-By: Claude Opus 4.7 (1M context) --- lib/makeup/lexers/erlang_lexer.ex | 2 +- .../erlang_lexer_tokenizer_test.exs | 18 ++++++++++++++++++ 2 files changed, 19 insertions(+), 1 deletion(-) diff --git a/lib/makeup/lexers/erlang_lexer.ex b/lib/makeup/lexers/erlang_lexer.ex index b67fb21..fa66bc9 100644 --- a/lib/makeup/lexers/erlang_lexer.ex +++ b/lib/makeup/lexers/erlang_lexer.ex @@ -225,7 +225,7 @@ defmodule Makeup.Lexers.ErlangLexer do syntax_operators = word_from_list( - ~W[+ - +? ++ = == -- * / < > /= =:= =/= =< >= ==? <- <:- <= <:= ! ? ?!], + ~W[+ - +? ++ = == -- * / < > /= =:= =/= =< >= ==? <- <:- <= <:= ! ? ?! ?=], :operator ) diff --git a/test/makeup/erlang_lexer/erlang_lexer_tokenizer_test.exs b/test/makeup/erlang_lexer/erlang_lexer_tokenizer_test.exs index 47f728e..cbc2b3a 100644 --- a/test/makeup/erlang_lexer/erlang_lexer_tokenizer_test.exs +++ b/test/makeup/erlang_lexer/erlang_lexer_tokenizer_test.exs @@ -431,6 +431,7 @@ defmodule ErlangLexerTokenizer do assert lex("<:-") == [{:operator, %{}, "<:-"}] assert lex("<=") == [{:operator, %{}, "<="}] assert lex("<:=") == [{:operator, %{}, "<:="}] + assert lex("?=") == [{:operator, %{}, "?="}] end test "word operators are tokenized as operator" do @@ -580,6 +581,23 @@ defmodule ErlangLexerTokenizer do end end + describe "maybe expression" do + # `?=` is the maybe-expression match operator added in OTP 25. + test "tokenizes ?= as a single operator inside a maybe block" do + assert lex("maybe X ?= ok end") == [ + {:keyword, %{}, "maybe"}, + {:whitespace, %{}, " "}, + {:name, %{}, "X"}, + {:whitespace, %{}, " "}, + {:operator, %{}, "?="}, + {:whitespace, %{}, " "}, + {:string_symbol, %{}, "ok"}, + {:whitespace, %{}, " "}, + {:keyword, %{}, "end"} + ] + end + end + describe "native records (OTP 29)" do test "tokenizes external native record construction" do assert [ From 315529933964a178a6006b47fb9e23f4ad59678d Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Lukas=20Backstr=C3=B6m?= Date: Wed, 6 May 2026 14:37:12 +0200 Subject: [PATCH 04/15] Tokenize multi-character escape sequences in `$\\...` chars MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The `character` rule tokenized `$\\` followed by a single byte — fine for simple `$\\n`, `$\\t`, but wrong for hex (`$\\xFF`, `$\\x{1F600}`), octal (`$\\077`) and control (`$\\^A`) forms. Those were splitting into a partial char-token plus stray name or integer tokens, which rendered as broken syntax in the docs. Add a dedicated `character_escape` rule that, after consuming the leading backslash, tries the structured escapes (hex with its two `x`-prefixed forms, octal, control) before falling back to any single char. The order matters: `escape_hex` and `escape_octal` must precede the single-char fallback so the multi-character forms are consumed whole. Co-Authored-By: Claude Opus 4.7 (1M context) Emit `:string_escape` sub-tokens inside double-quoted strings The `triple_quoted_string` rule already emitted `:string_escape` sub-tokens for each escape sequence inside the string body. Plain double-quoted strings did not — they used a literal `\"` recogniser that only stopped the closing-quote logic from triggering early, without producing a distinct token for the escape itself. Themes that wanted to colour escapes differently from the surrounding string body had no token to hook on. Replace the special-purpose `escape_double_quote` with the generic `escaped_char`, which itself was extended to consume structured escapes (`\\xFF`, `\\x{...}`, `\\077`, `\\^A`) whole rather than truncating after the leading byte. `string_like` now sees the same sub-token vocabulary in `"..."` strings and `"""..."""` triple-quoted strings. Existing string-escape tests updated to match the new (richer) output. Co-Authored-By: Claude Opus 4.7 (1M context) --- lib/makeup/lexers/erlang_lexer.ex | 31 ++++++-- .../erlang_lexer_tokenizer_test.exs | 77 +++++++++++++++++-- 2 files changed, 98 insertions(+), 10 deletions(-) diff --git a/lib/makeup/lexers/erlang_lexer.ex b/lib/makeup/lexers/erlang_lexer.ex index fa66bc9..db3e7f4 100644 --- a/lib/makeup/lexers/erlang_lexer.ex +++ b/lib/makeup/lexers/erlang_lexer.ex @@ -157,10 +157,23 @@ defmodule Makeup.Lexers.ErlangLexer do |> optional(string(".") |> concat(atom_name)) |> token(:name_label) + # `$\xFF`, `$\x{1F600}`, `$\077`, `$\^A`, plus simple `$\n` / `$\t` / `$\\` / + # `$\"` / `$\'` etc. The structured escapes (octal, hex, ctrl) must be tried + # before the single-char fallback so multi-character sequences are consumed + # whole. + character_escape = + string("\\") + |> choice([ + escape_hex, + escape_octal, + escape_ctrl, + utf8_char([]) + ]) + character = string("$") |> choice([ - string("\\") |> utf8_char([]), + character_escape, utf8_char(not: ?\\) ]) |> token(:string_char) @@ -171,14 +184,22 @@ defmodule Makeup.Lexers.ErlangLexer do |> ascii_char(to_charlist("~#+BPWXb-ginpswx")) |> token(:string_interpol) - escape_double_quote = string(~s/\\"/) - erlang_string = string_like(~s/"/, ~s/"/, [escape_double_quote, string_interpol], :string) - + # Sub-token emitted inside string literals for escape sequences. Mirrors + # the `character_escape` shape so multi-character escapes (`\xFF`, + # `\x{1F600}`, `\077`, `\^A`) are consumed whole instead of getting + # cut at the first byte. Themes can render these distinctly. escaped_char = string("\\") - |> utf8_string([], 1) + |> choice([ + escape_hex, + escape_octal, + escape_ctrl, + utf8_char([]) + ]) |> token(:string_escape) + erlang_string = string_like(~s/"/, ~s/"/, [escaped_char, string_interpol], :string) + triple_quoted_string = lookahead_string(string(~s/"""\n/), string(~s/\n"""/), [escaped_char, string_interpol]) diff --git a/test/makeup/erlang_lexer/erlang_lexer_tokenizer_test.exs b/test/makeup/erlang_lexer/erlang_lexer_tokenizer_test.exs index cbc2b3a..0bcc194 100644 --- a/test/makeup/erlang_lexer/erlang_lexer_tokenizer_test.exs +++ b/test/makeup/erlang_lexer/erlang_lexer_tokenizer_test.exs @@ -20,6 +20,36 @@ defmodule ErlangLexerTokenizer do assert lex("$🫂") == [{:string_char, %{}, "$🫂"}] end + describe "character escape sequences" do + test "named escapes" do + assert lex("$\\n") == [{:string_char, %{}, "$\\n"}] + assert lex("$\\t") == [{:string_char, %{}, "$\\t"}] + assert lex("$\\\\") == [{:string_char, %{}, "$\\\\"}] + assert lex("$\\\"") == [{:string_char, %{}, "$\\\""}] + end + + test "octal escape" do + assert lex("$\\7") == [{:string_char, %{}, "$\\7"}] + assert lex("$\\07") == [{:string_char, %{}, "$\\07"}] + assert lex("$\\077") == [{:string_char, %{}, "$\\077"}] + end + + test "hex escape (two-digit form)" do + assert lex("$\\xFF") == [{:string_char, %{}, "$\\xFF"}] + assert lex("$\\x4a") == [{:string_char, %{}, "$\\x4a"}] + end + + test "hex escape (braced form)" do + assert lex("$\\x{1F600}") == [{:string_char, %{}, "$\\x{1F600}"}] + assert lex("$\\x{0}") == [{:string_char, %{}, "$\\x{0}"}] + end + + test "control escape" do + assert lex("$\\^A") == [{:string_char, %{}, "$\\^A"}] + assert lex("$\\^z") == [{:string_char, %{}, "$\\^z"}] + end + end + test "comment" do assert lex("%abc") == [{:comment_single, %{}, "%abc"}] assert lex("% abc") == [{:comment_single, %{}, "% abc"}] @@ -148,16 +178,53 @@ defmodule ErlangLexerTokenizer do end test "tokenizes escape of double quotes correctly" do - assert [{:string, %{}, ~s/"escape \\"double quote\\""/}] == - lex(~s/"escape \\"double quote\\""/) + # Strings now produce :string_escape sub-tokens for each escape + # sequence (mirroring the triple-quoted-string behaviour and + # `makeup_elixir`). Themes can render escapes distinctly from the + # surrounding string body. + assert [ + {:string, %{}, ~s/"escape /}, + {:string_escape, %{}, ~s/\\"/}, + {:string, %{}, "double quote"}, + {:string_escape, %{}, ~s/\\"/}, + {:string, %{}, "\""} + ] = lex(~s/"escape \\"double quote\\""/) - assert [{:string, %{}, ~s/"\\"quote\\""/}] == lex(~s/"\\"quote\\""/) assert {:string, %{}, ~s/"invalid string\\"/} not in lex(~s/"invalid string\\"/) end test "tokenizes literal escaped characters correctly" do - assert [{:string, %{}, ~s/"\\b"/}] == lex(~s/"\\b"/) - assert [{:string, %{}, ~s/"\\\\b"/}] == lex(~s/"\\\\b"/) + assert [ + {:string, %{}, "\""}, + {:string_escape, %{}, "\\b"}, + {:string, %{}, "\""} + ] = lex(~s/"\\b"/) + + assert [ + {:string, %{}, "\""}, + {:string_escape, %{}, "\\\\"}, + {:string, %{}, "b\""} + ] = lex(~s/"\\\\b"/) + end + + test "tokenizes hex / octal / control escapes inside strings" do + assert [ + {:string, %{}, ~s/"a/}, + {:string_escape, %{}, ~s/\\xFF/}, + {:string, %{}, "b\""} + ] = lex(~s/"a\\xFFb"/) + + assert [ + {:string, %{}, ~s/"a/}, + {:string_escape, %{}, "\\077"}, + {:string, %{}, "b\""} + ] = lex(~s/"a\\077b"/) + + assert [ + {:string, %{}, ~s/"a/}, + {:string_escape, %{}, "\\^A"}, + {:string, %{}, "b\""} + ] = lex(~s/"a\\^Ab"/) end end From 7d236a658bf786a3ad8217cdbf34629f1dd85c09 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Lukas=20Backstr=C3=B6m?= Date: Wed, 6 May 2026 14:38:08 +0200 Subject: [PATCH 05/15] Recover keywords misclassified as function names MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The `function` rule eagerly matches any atom-shaped token followed by `(` and tags it `:name_function`. For reserved words written adjacent to `(` — most commonly `fun(X) -> ... end` — that loses the keyword classification, because the postprocess pass only checked `:string_symbol` tokens against the keyword list. Add a postprocess clause that converts `:name_function` tokens whose value is in the keyword list back to `:keyword`. Reserved words can't legally be defined as function names in Erlang, so this is unambiguous. Co-Authored-By: Claude Opus 4.7 (1M context) --- lib/makeup/lexers/erlang_lexer.ex | 8 +++++++ .../erlang_lexer_tokenizer_test.exs | 23 +++++++++++++++++++ 2 files changed, 31 insertions(+) diff --git a/lib/makeup/lexers/erlang_lexer.ex b/lib/makeup/lexers/erlang_lexer.ex index db3e7f4..b57daad 100644 --- a/lib/makeup/lexers/erlang_lexer.ex +++ b/lib/makeup/lexers/erlang_lexer.ex @@ -450,6 +450,14 @@ defmodule Makeup.Lexers.ErlangLexer do defp postprocess_helper([{:string_symbol, meta, value} | tokens]) when value in @keywords, do: [{:keyword, meta, value} | postprocess_helper(tokens)] + # Keywords followed by `(` are first matched by the `function` rule and + # tagged `:name_function`. Recover them here. The most common case is + # `fun(X) -> ... end`; the rule also covers any other keyword that gets + # written next to `(` (e.g. `if(X)` in a teaching example of invalid + # syntax). + defp postprocess_helper([{:name_function, meta, value} | tokens]) when value in @keywords, + do: [{:keyword, meta, value} | postprocess_helper(tokens)] + defp postprocess_helper([{:string_symbol, meta, value} | tokens]) when value in @builtins, do: [{:name_builtin, meta, value} | postprocess_helper(tokens)] diff --git a/test/makeup/erlang_lexer/erlang_lexer_tokenizer_test.exs b/test/makeup/erlang_lexer/erlang_lexer_tokenizer_test.exs index 0bcc194..01dffc0 100644 --- a/test/makeup/erlang_lexer/erlang_lexer_tokenizer_test.exs +++ b/test/makeup/erlang_lexer/erlang_lexer_tokenizer_test.exs @@ -665,6 +665,29 @@ defmodule ErlangLexerTokenizer do end end + describe "fun keyword vs function call" do + test "fun(X) -> ... end tokenizes `fun` as keyword, not function name" do + assert [ + {:keyword, %{}, "fun"}, + {:punctuation, _, "("}, + {:name, %{}, "X"}, + {:punctuation, _, ")"} | _ + ] = lex("fun(X) -> X end") + end + + test "fun mod:func/2 still tokenizes correctly" do + assert [ + {:keyword, %{}, "fun"}, + {:whitespace, %{}, " "}, + {:name_class, %{}, "mod"}, + {:punctuation, %{}, ":"}, + {:string_symbol, %{}, "func"}, + {:punctuation, %{}, "/"}, + {:number_integer, %{}, "2"} + ] = lex("fun mod:func/2") + end + end + describe "native records (OTP 29)" do test "tokenizes external native record construction" do assert [ From 65258a802b3c1315d7136f8d4f898967a374849f Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Lukas=20Backstr=C3=B6m?= Date: Wed, 6 May 2026 14:52:17 +0200 Subject: [PATCH 06/15] Generate the BIF list at compile time from `erl_internal:bif/2` MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The static `@builtins` list had bit-rotted: it was missing post-OTP-19 BIFs (`map_get/2`, `is_map_key/2`, `binary_part/2,3`, `floor/1`, `ceil/1`, `min/2`, `max/2`, `unique_integer/{0,1}`, `monotonic_time/{0,1}`, etc.) and contained at least one typo (`resume_processround` — no such BIF, presumably a merge of `resume_process` and `round`). Replace it with a compile-time-generated list sourced from `erl_internal:bif/2` — the same predicate the Erlang compiler uses to decide what's auto-imported. Every rebuild of `makeup_erlang` re-syncs the list with the OTP version we compile against. 122 BIFs vs the previous ~85. Also add a postprocess clause that converts `:name_function` tokens whose value is a BIF back to `:name_builtin` (analogous to the keyword-recovery clause). Closes makeup_erlang #13: `length(L)` and similar BIF calls now render as builtins instead of plain function calls. The pre-existing string-symbol → name_builtin clause was unchanged and still applies in non-`(`-followed positions (e.g. `length` standalone in a documentation prose). Both clauses share the same `@builtins` list. Co-Authored-By: Claude Opus 4.7 (1M context) --- lib/makeup/lexers/erlang_lexer.ex | 38 ++++++++----------- .../erlang_lexer_tokenizer_test.exs | 32 +++++++++++++++- 2 files changed, 46 insertions(+), 24 deletions(-) diff --git a/lib/makeup/lexers/erlang_lexer.ex b/lib/makeup/lexers/erlang_lexer.ex index b57daad..fe5c9e6 100644 --- a/lib/makeup/lexers/erlang_lexer.ex +++ b/lib/makeup/lexers/erlang_lexer.ex @@ -414,29 +414,15 @@ defmodule Makeup.Lexers.ErlangLexer do @keywords ~W[after begin case catch cond end fun if let of query receive try when maybe else] - @builtins ~W[ - abs append_element apply atom_to_list binary_to_list bitstring_to_list - binary_to_term bit_size bump_reductions byte_size cancel_timer - check_process_code delete_module demonitor disconnect_node display - element erase exit float float_to_list fun_info fun_to_list - function_exported garbage_collect get get_keys group_leader hash - hd integer_to_list iolist_to_binary iolist_size is_atom is_binary - is_bitstring is_boolean is_builtin is_float is_function is_integer - is_list is_number is_pid is_port is_process_alive is_record is_reference - is_tuple length link list_to_atom list_to_binary list_to_bitstring - list_to_existing_atom list_to_float list_to_integer list_to_pid - list_to_tuple load_module localtime_to_universaltime make_tuple - md5 md5_final md5_update memory module_loaded monitor monitor_node - node nodes open_port phash phash2 pid_to_list port_close port_command - port_connect port_control port_call port_info port_to_list - process_display process_flag process_info purge_module put read_timer - ref_to_list register resume_processround send send_after send_nosuspend - set_cookie setelement size spawn spawn_link spawn_monitor spawn_opt - split_binary start_timer statistics suspend_process system_flag - system_info system_monitor system_profile term_to_binary tl trace - trace_delivered trace_info trace_pattern trunc tuple_size tuple_to_list - universaltime_to_localtime unlink unregister whereis - ] + # Auto-imported BIFs, sourced at compile time from `erl_internal:bif/2` — + # the same predicate the Erlang compiler uses to decide what's auto-imported. + # Refreshed every time `makeup_erlang` is rebuilt, so the list stays in sync + # with the OTP version we compile against and never bit-rots. + @builtins :erlang.module_info(:exports) + |> Enum.filter(fn {name, arity} -> :erl_internal.bif(name, arity) end) + |> Enum.map(fn {name, _arity} -> Atom.to_string(name) end) + |> Enum.uniq() + |> Enum.sort() @word_operators ~W[and andalso band bnot bor bsl bsr bxor div not or orelse rem xor] @@ -461,6 +447,12 @@ defmodule Makeup.Lexers.ErlangLexer do defp postprocess_helper([{:string_symbol, meta, value} | tokens]) when value in @builtins, do: [{:name_builtin, meta, value} | postprocess_helper(tokens)] + # Same recovery for builtins: when a BIF is called as `length(L)` it is + # first matched by the `function` rule and tagged `:name_function`. Closes + # makeup_erlang #13. + defp postprocess_helper([{:name_function, meta, value} | tokens]) when value in @builtins, + do: [{:name_builtin, meta, value} | postprocess_helper(tokens)] + defp postprocess_helper([{:string_symbol, meta, value} | tokens]) when value in @word_operators, do: [{:operator_word, meta, value} | postprocess_helper(tokens)] diff --git a/test/makeup/erlang_lexer/erlang_lexer_tokenizer_test.exs b/test/makeup/erlang_lexer/erlang_lexer_tokenizer_test.exs index 01dffc0..b1ca0bf 100644 --- a/test/makeup/erlang_lexer/erlang_lexer_tokenizer_test.exs +++ b/test/makeup/erlang_lexer/erlang_lexer_tokenizer_test.exs @@ -665,6 +665,36 @@ defmodule ErlangLexerTokenizer do end end + describe "builtin (BIF) recognition" do + # The @builtins list is generated at compile time from `erl_internal:bif/2`. + test "atoms that are auto-imported BIFs render as :name_builtin" do + assert [{:name_builtin, %{}, "length"}] = lex("length") + assert [{:name_builtin, %{}, "tuple_size"}] = lex("tuple_size") + end + + test "BIF calls (`name(...)`) render as :name_builtin not :name_function" do + # makeup_erlang #13. Before this fix, `length(L)` rendered as a regular + # function call instead of a builtin. + assert [{:name_builtin, %{}, "length"} | _] = lex("length(L)") + assert [{:name_builtin, %{}, "is_atom"} | _] = lex("is_atom(X)") + assert [{:name_builtin, %{}, "tuple_size"} | _] = lex("tuple_size(T)") + end + + test "post-OTP-19 BIFs are recognised (proves the static list is gone)" do + assert [{:name_builtin, %{}, "map_get"} | _] = lex("map_get(K, M)") + assert [{:name_builtin, %{}, "is_map_key"} | _] = lex("is_map_key(K, M)") + assert [{:name_builtin, %{}, "binary_part"} | _] = lex("binary_part(B, 0, 4)") + assert [{:name_builtin, %{}, "floor"} | _] = lex("floor(X)") + assert [{:name_builtin, %{}, "ceil"} | _] = lex("ceil(X)") + end + + test "module_info and nif_error are not classified as BIFs" do + # Both are exported from `erlang` but neither is auto-imported. + refute Enum.any?(lex("module_info"), &match?({:name_builtin, _, "module_info"}, &1)) + refute Enum.any?(lex("nif_error"), &match?({:name_builtin, _, "nif_error"}, &1)) + end + end + describe "fun keyword vs function call" do test "fun(X) -> ... end tokenizes `fun` as keyword, not function name" do assert [ @@ -1106,7 +1136,7 @@ defmodule ErlangLexerTokenizer do *** argument 1: not an iolist term """) == [ {:generic_prompt, %{selectable: false}, "1> "}, - {:name_function, %{}, "list_to_binary"}, + {:name_builtin, %{}, "list_to_binary"}, {:punctuation, %{group_id: "group-1"}, "("}, {:punctuation, %{group_id: "group-2"}, "<<"}, {:punctuation, %{group_id: "group-2"}, ">>"}, From 31b82cbd5a3b10b77bdb8288cd1f2fd1936776f9 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Lukas=20Backstr=C3=B6m?= Date: Wed, 6 May 2026 15:08:52 +0200 Subject: [PATCH 07/15] Detect prompts after multi-line whitespace blocks The `erl_prompt` rule used to require a literal `\n` immediately before the prompt body. When the generic `whitespace` rule earlier in the choice consumed a multi-character whitespace block ending in `\n` (e.g. `"\n \n1> ok."`), no `\n` remained at the prompt rule's expected position and the prompt was lexed as plain `[number_integer, operator]` instead of `:generic_prompt`. See makeup_elixir #28 for the same-shape bug. Match any leading whitespace block that contains at least one `\n`, which keeps the rule anchored to a line boundary while tolerating preceding spaces / tabs / further newlines. False-positives on `1 > 2` and `x. 1> a.` are still rejected because neither contains a `\n` between the operand and `>`. Co-Authored-By: Claude Opus 4.7 (1M context) --- lib/makeup/lexers/erlang_lexer.ex | 13 +++++++++++-- .../erlang_lexer/erlang_lexer_tokenizer_test.exs | 13 +++++++++++++ 2 files changed, 24 insertions(+), 2 deletions(-) diff --git a/lib/makeup/lexers/erlang_lexer.ex b/lib/makeup/lexers/erlang_lexer.ex index fe5c9e6..fc305e8 100644 --- a/lib/makeup/lexers/erlang_lexer.ex +++ b/lib/makeup/lexers/erlang_lexer.ex @@ -309,10 +309,19 @@ defmodule Makeup.Lexers.ErlangLexer do |> concat(token("/", :punctuation)) |> concat(number_integer) - # Erlang prompt + # Erlang prompt. Anchored to a line boundary by requiring the leading + # whitespace to contain at least one `\n`. The original rule required + # the `\n` immediately before the prompt body, which broke when the + # generic `whitespace` rule had already consumed the trailing `\n` of + # a multi-character whitespace block (see makeup_elixir #28). Allowing + # any leading non-newline whitespace before the `\n` and any further + # whitespace after lets the rule match in those cases without + # false-positiving on `1 > 2` or `x. 1> a.` (neither contains a `\n` + # in the relevant position). erl_prompt = - ascii_string([?\s, ?\r, ?\t], min: 0) + ascii_string([?\s, ?\f, ?\r, ?\t], min: 0) |> string("\n") + |> optional(ascii_string([?\s, ?\f, ?\r, ?\n, ?\t], min: 1)) |> token(:whitespace) |> concat( optional(string("(") |> concat(atom_name) |> string(")")) diff --git a/test/makeup/erlang_lexer/erlang_lexer_tokenizer_test.exs b/test/makeup/erlang_lexer/erlang_lexer_tokenizer_test.exs index b1ca0bf..310fd3a 100644 --- a/test/makeup/erlang_lexer/erlang_lexer_tokenizer_test.exs +++ b/test/makeup/erlang_lexer/erlang_lexer_tokenizer_test.exs @@ -1059,6 +1059,19 @@ defmodule ErlangLexerTokenizer do ] end + # makeup_elixir #28 analogue. The whitespace rule used to consume + # multi-line whitespace blocks greedily, leaving no `\n` for the prompt + # rule to anchor against. The prompt rule now matches any leading + # whitespace block that contains a `\n`. + test "is detected after a multi-line whitespace block" do + assert [ + {:whitespace, %{}, "\n \n"}, + {:generic_prompt, %{selectable: false}, "1> "}, + {:string_symbol, %{}, "ok"}, + {:punctuation, %{}, "."} + ] = lex("\n \n1> ok.") + end + test "with newlines" do assert lex("x. 1> a.") == [ {:string_symbol, %{}, "x"}, From 13133427f5adafee0390e7ede756067a482a250f Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Lukas=20Backstr=C3=B6m?= Date: Wed, 6 May 2026 15:09:35 +0200 Subject: [PATCH 08/15] Lex `_5` / `_X` / `_unused` as a single variable Erlang's grammar treats any identifier starting with `_` followed by identifier characters as a variable (typically a "don't bother to warn me about this" hint). The lexer was tokenising `_5` as `[punctuation "_", number_integer 5]` because `_` appears in the generic punctuation list and was matched before the variable rule. Add a dedicated `underscore_identifier` rule that matches `_` followed by at least one identifier character and emits `:name`, placed before `punctuation` in the choice. Bare `_` (the wildcard pattern) remains a punctuation token so themes can render the two distinctly. Co-Authored-By: Claude Opus 4.7 (1M context) --- lib/makeup/lexers/erlang_lexer.ex | 11 +++++++ .../erlang_lexer_tokenizer_test.exs | 30 +++++++++++++++++++ 2 files changed, 41 insertions(+) diff --git a/lib/makeup/lexers/erlang_lexer.ex b/lib/makeup/lexers/erlang_lexer.ex index fc305e8..b56ce45 100644 --- a/lib/makeup/lexers/erlang_lexer.ex +++ b/lib/makeup/lexers/erlang_lexer.ex @@ -96,6 +96,16 @@ defmodule Makeup.Lexers.ErlangLexer do ascii_string([?A..?Z, ?_], 1) |> optional(ascii_string([?a..?z, ?_, ?0..?9, ?A..?Z], min: 1)) + # An underscore followed by at least one identifier character (`_5`, + # `_X`, `_unused`). Bare `_` stays as a punctuation token (the wildcard + # pattern), but `_` is a variable in Erlang grammar and should + # render as `:name`. Without this rule the `_` is matched first by + # the `punctuation` rule and the rest of the identifier falls through. + underscore_identifier = + string("_") + |> ascii_string([?a..?z, ?_, ?0..?9, ?A..?Z], min: 1) + |> token(:name) + simple_atom_name = ascii_string([?a..?z], 1) |> optional(ascii_string([?a..?z, ?_, ?@, ?0..?9, ?A..?Z], min: 1)) @@ -375,6 +385,7 @@ defmodule Makeup.Lexers.ErlangLexer do [ native_record_external, record, + underscore_identifier, punctuation, # `tuple` might be unnecessary tuple, diff --git a/test/makeup/erlang_lexer/erlang_lexer_tokenizer_test.exs b/test/makeup/erlang_lexer/erlang_lexer_tokenizer_test.exs index 310fd3a..c26a977 100644 --- a/test/makeup/erlang_lexer/erlang_lexer_tokenizer_test.exs +++ b/test/makeup/erlang_lexer/erlang_lexer_tokenizer_test.exs @@ -80,6 +80,36 @@ defmodule ErlangLexerTokenizer do assert lex("A_b1") == [{:name, %{}, "A_b1"}] end + describe "underscore-prefixed variables" do + test "underscore + digit lexes as a single variable" do + assert lex("_5") == [{:name, %{}, "_5"}] + end + + test "underscore + lowercase lexes as a single variable" do + assert lex("_unused") == [{:name, %{}, "_unused"}] + end + + test "underscore + uppercase lexes as a single variable" do + assert lex("_X") == [{:name, %{}, "_X"}] + end + + test "bare underscore (wildcard) stays as punctuation" do + # Pattern wildcard. Treat as punctuation so themes can render it + # distinctly from a variable name. + assert [ + {:keyword, %{}, "case"}, + {:whitespace, %{}, " "}, + {:name, %{}, "X"}, + {:whitespace, %{}, " "}, + {:keyword, %{}, "of"}, + {:whitespace, %{}, " "}, + {:punctuation, %{}, "_"}, + {:whitespace, %{}, " "}, + {:punctuation, %{}, "->"} | _ + ] = lex("case X of _ -> ok end") + end + end + test "function call" do assert lex("f(") == [ {:name_function, %{}, "f"}, From d9e00e1b2c915783b6127c142be2de51b5f87339 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Lukas=20Backstr=C3=B6m?= Date: Wed, 6 May 2026 15:10:15 +0200 Subject: [PATCH 09/15] Lock the current OTP module-attribute set with tests The `module_attribute` rule already accepts any atom-shaped name as the attribute, so all current and future OTP attributes work without lexer changes. Add an explicit list of every current attribute (`-callback`, `-optional_callbacks`, `-on_load`, `-nifs`, `-deprecated`, `-removed`, `-feature`, `-export_type`, `-export_record` and `-import_record` from the native-records work, plus the historically-supported set) and assert each one tokenises as `:name_attribute`. Catches accidental regressions if anyone ever narrows the rule. Co-Authored-By: Claude Opus 4.7 (1M context) --- .../erlang_lexer_tokenizer_test.exs | 34 +++++++++++++++++++ 1 file changed, 34 insertions(+) diff --git a/test/makeup/erlang_lexer/erlang_lexer_tokenizer_test.exs b/test/makeup/erlang_lexer/erlang_lexer_tokenizer_test.exs index c26a977..11175bd 100644 --- a/test/makeup/erlang_lexer/erlang_lexer_tokenizer_test.exs +++ b/test/makeup/erlang_lexer/erlang_lexer_tokenizer_test.exs @@ -748,6 +748,40 @@ defmodule ErlangLexerTokenizer do end end + describe "OTP-current module attribute coverage" do + # The generic `module_attribute` rule accepts any `atom_name`, which + # means new attributes ship without lexer changes. Lock the current + # OTP-supported set with an explicit assertion list so the rule + # keeps covering them. + @known_attributes ~w[module export import behaviour behavior callback + optional_callbacks on_load nifs deprecated removed + feature compile export_type record export_record + import_record spec type opaque doc moduledoc define + ifdef ifndef else endif if elif vsn] + + test "every current OTP module attribute lexes as :name_attribute" do + for attr <- @known_attributes do + # Use `(Body)` so the body is one well-known token. The point of + # the test is the attribute name, not the body shape. + expected = [ + {:whitespace, %{}, "\n"}, + {:punctuation, %{}, "-"}, + {:name_attribute, %{}, attr}, + {:punctuation, %{group_id: "group-1"}, "("}, + {:name, %{}, "Body"}, + {:punctuation, %{group_id: "group-1"}, ")"} + ] + + actual = lex("\n-" <> attr <> "(Body)") + + assert actual == expected, + "expected -#{attr} to lex as :name_attribute\n" <> + "expected: #{inspect(expected)}\n" <> + "actual: #{inspect(actual)}" + end + end + end + describe "native records (OTP 29)" do test "tokenizes external native record construction" do assert [ From 27f46947fb176666a22aaeda634af9391dfc1e91 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Lukas=20Backstr=C3=B6m?= Date: Wed, 6 May 2026 15:12:43 +0200 Subject: [PATCH 10/15] Distinguish parameterised macros from parameterless ones MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit `makeup_elixir` splits `@foo` and `@foo(...)` into different tokens. The Erlang equivalent — `?FOO` vs `?FOO(args)` — used to collapse both into `:name_constant`, and worse, the `?` operator in `syntax_operators` was tried first in the choice and ate the leading `?` of any macro reference, leaving `?FOO` to lex as `[operator "?", name "FOO"]`. Add a separate `macro_call` rule that matches `?(`-style references and emits `:name_function`, keep the existing `macro` rule (now `:name_constant`) for parameterless references, and move both ahead of `syntax_operators` in the choice so the operator rule no longer captures the `?`. Co-Authored-By: Claude Opus 4.7 (1M context) --- lib/makeup/lexers/erlang_lexer.ex | 20 +++++++++++++++- .../erlang_lexer_tokenizer_test.exs | 24 +++++++++++++++++++ 2 files changed, 43 insertions(+), 1 deletion(-) diff --git a/lib/makeup/lexers/erlang_lexer.ex b/lib/makeup/lexers/erlang_lexer.ex index b56ce45..8612cb6 100644 --- a/lib/makeup/lexers/erlang_lexer.ex +++ b/lib/makeup/lexers/erlang_lexer.ex @@ -156,6 +156,20 @@ defmodule Makeup.Lexers.ErlangLexer do macro_name = choice([variable_name, atom_name]) + # Parameterised macro reference: `?FOO(arg1, arg2)`. Tokenised + # separately from the parameterless form so themes can render the two + # distinctly (matches `makeup_elixir`'s split between `@foo` and + # `@foo(...)`). The macro head emits as `:name_function`; the trailing + # `(` opens the standard punctuation group so paren matching still + # works. + macro_call = + string("?") + |> concat(macro_name) + |> token(:name_function) + |> concat(optional(whitespace)) + |> concat(token("(", :punctuation)) + + # Parameterless macro: `?FOO`. Constants by convention. macro = string("?") |> concat(macro_name) @@ -386,6 +400,11 @@ defmodule Makeup.Lexers.ErlangLexer do native_record_external, record, underscore_identifier, + # Macros must be tried before `syntax_operators`, since the + # operator list contains `?` and `?=` and would otherwise eat the + # leading `?` of `?FOO` / `?FOO(X)`. + macro_call, + macro, punctuation, # `tuple` might be unnecessary tuple, @@ -400,7 +419,6 @@ defmodule Makeup.Lexers.ErlangLexer do function_arity, function, atom, - macro, character, label, # If we can't parse any of the above, we highlight the next character as an error diff --git a/test/makeup/erlang_lexer/erlang_lexer_tokenizer_test.exs b/test/makeup/erlang_lexer/erlang_lexer_tokenizer_test.exs index 11175bd..4f28d94 100644 --- a/test/makeup/erlang_lexer/erlang_lexer_tokenizer_test.exs +++ b/test/makeup/erlang_lexer/erlang_lexer_tokenizer_test.exs @@ -725,6 +725,30 @@ defmodule ErlangLexerTokenizer do end end + describe "macros" do + test "parameterless macro tokenizes as :name_constant" do + assert lex("?FOO") == [{:name_constant, %{}, "?FOO"}] + assert lex("?bar") == [{:name_constant, %{}, "?bar"}] + end + + test "parameterised macro head tokenizes as :name_function" do + assert [ + {:name_function, %{}, "?FOO"}, + {:punctuation, _, "("}, + {:name, %{}, "X"}, + {:punctuation, _, ")"} + ] = lex("?FOO(X)") + end + + test "parameterless macro followed by punctuation stays as constant" do + # `?FOO,` shouldn't be lured into the parameterised form. + assert [ + {:name_constant, %{}, "?FOO"}, + {:punctuation, %{}, ","} | _ + ] = lex("?FOO, X") + end + end + describe "fun keyword vs function call" do test "fun(X) -> ... end tokenizes `fun` as keyword, not function name" do assert [ From a62069ab8ed9a6dbb4a4e13e8f4a2fdf988e70d0 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Lukas=20Backstr=C3=B6m?= Date: Wed, 6 May 2026 15:17:08 +0200 Subject: [PATCH 11/15] Support quadruple- and quintuple-quoted strings (OTP 27+) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit OTP 27's triple-quoted-string spec extends to N quotes (N >= 3): an opening run of N quotes on its own line opens the string and a matching run of N quotes on its own line closes it. The lexer only recognised N=3, so any string using a 4-quote opener (which is the canonical way to embed a literal `"""` in the body) was lexed as multiple unrelated tokens. Add explicit `quadruple_quoted_string` and `quintuple_quoted_string` rules — NimbleParsec doesn't support dynamic delimiter lengths, so each width needs its own rule. Place the longer-quote variants ahead of the triple-quote rule in the choice so the longest matching opener wins. Also extend `sigil_delimiters` with `""""\n` / `\n""""` and the quintuple analogue (plus the matching `''''` / `'''''` variants), so sigil-prefixed multi-quoted strings (`~b""""..."""" `, `~B""""..."""" `, etc.) get the same coverage. The sub-token vocabulary inside the body — `:string_escape` for escape sequences, `:string_interpol` for `~p` / `~b` etc. — is identical across all widths, since they all share the same element list. Co-Authored-By: Claude Opus 4.7 (1M context) Lock OTP 27 sigil-delimiter spec coverage with tests The spec at https://www.erlang.org/doc/system/data_types.html#sigil defines the allowed sigil delimiters as: * pair forms: `()` `[]` `{}` `<>` * symmetric forms: `/` `|` `'` `"` `` ` `` `#` * triple-quote forms: `"""` `'''` (with quad/quint extensions for bodies that need to contain a literal `"""` / `""""`) The current `sigil_delimiters` list already covers every entry, but nothing locked the coverage. Add per-delimiter tests so a future narrowing of the list trips a test rather than silently dropping a valid sigil form. Co-Authored-By: Claude Opus 4.7 (1M context) --- lib/makeup/lexers/erlang_lexer.ex | 31 +++++++ .../erlang_lexer_tokenizer_test.exs | 86 +++++++++++++++++++ 2 files changed, 117 insertions(+) diff --git a/lib/makeup/lexers/erlang_lexer.ex b/lib/makeup/lexers/erlang_lexer.ex index 8612cb6..4a05f97 100644 --- a/lib/makeup/lexers/erlang_lexer.ex +++ b/lib/makeup/lexers/erlang_lexer.ex @@ -224,10 +224,39 @@ defmodule Makeup.Lexers.ErlangLexer do erlang_string = string_like(~s/"/, ~s/"/, [escaped_char, string_interpol], :string) + # Multi-quoted strings (OTP 27+). The opening run of `"""` (or more) on + # its own line opens the string; a matching run on its own line closes + # it. Use a quadruple/quintuple opener when the body needs to contain + # `"""` literally. Each variant is a separate rule because NimbleParsec + # doesn't support dynamic delimiter lengths; longer-quote variants must + # be tried first so the triple-quote rule doesn't claim them prematurely. + quintuple_quoted_string = + lookahead_string( + string(~s/"""""\n/), + string(~s/\n"""""/), + [escaped_char, string_interpol] + ) + + quadruple_quoted_string = + lookahead_string( + string(~s/""""\n/), + string(~s/\n""""/), + [escaped_char, string_interpol] + ) + triple_quoted_string = lookahead_string(string(~s/"""\n/), string(~s/\n"""/), [escaped_char, string_interpol]) + # Longer-quote variants must come first so the longest matching delimiter + # wins for sigils like `~"""""..."""""` (quintuple) or `~""""..."""" ` + # (quadruple) — these are needed when the sigil body has to contain + # `"""` or `""""` literally, mirroring the rule for plain multi-quoted + # strings above. sigil_delimiters = [ + {~s["""""\n], ~s[\n"""""]}, + {"'''''\n", "\n'''''"}, + {~s[""""\n], ~s[\n""""]}, + {"''''\n", "\n''''"}, {~s["""\n], ~s[\n"""]}, {"'''\n", "\n'''"}, {"\"", "\""}, @@ -392,6 +421,8 @@ defmodule Makeup.Lexers.ErlangLexer do hashbang, whitespace, comment, + quintuple_quoted_string, + quadruple_quoted_string, triple_quoted_string, erlang_string ] ++ diff --git a/test/makeup/erlang_lexer/erlang_lexer_tokenizer_test.exs b/test/makeup/erlang_lexer/erlang_lexer_tokenizer_test.exs index 4f28d94..9533765 100644 --- a/test/makeup/erlang_lexer/erlang_lexer_tokenizer_test.exs +++ b/test/makeup/erlang_lexer/erlang_lexer_tokenizer_test.exs @@ -772,6 +772,92 @@ defmodule ErlangLexerTokenizer do end end + # https://www.erlang.org/doc/system/data_types.html#sigil + describe "sigil delimiters (OTP 27 spec coverage)" do + # Pair delimiters: () [] {} <> + test "pair delimiters" do + for {open, close} <- [{"(", ")"}, {"[", "]"}, {"{", "}"}, {"<", ">"}] do + src = "~b" <> open <> "hi" <> close + + assert [{:string, %{}, ^src}] = lex(src), + "expected ~b#{open}hi#{close} to lex as a single :string" + end + end + + # Symmetric delimiters: / | ' " ` # + test "symmetric delimiters" do + for delim <- ["/", "|", "'", "\"", "`", "#"] do + src = "~b" <> delim <> "hi" <> delim + + assert [{:string, %{}, ^src}] = lex(src), + "expected ~b#{delim}hi#{delim} to lex as a single :string" + end + end + + test "triple-quote and triple-single-quote" do + assert [{:string, %{}, "~b\"\"\"\nhi\n\"\"\""}] = + lex("~b\"\"\"\nhi\n\"\"\"") + + assert [{:string, %{}, "~b'''\nhi\n'''"}] = + lex("~b'''\nhi\n'''") + end + + test "all sigil prefix kinds (~ ~b ~B ~s ~S) work with the same delimiters" do + for prefix <- ["~", "~b", "~B", "~s", "~S"] do + src = prefix <> "/hi/" + + assert [{:string, %{}, ^src}] = lex(src), + "expected #{prefix}/hi/ to lex as a single :string" + end + end + end + + describe "multi-quoted strings (OTP 27+)" do + test "triple-quoted string lexes as a single :string" do + assert [{:string, %{}, "\"\"\"\nfoo\n\"\"\""}] = lex("\"\"\"\nfoo\n\"\"\"") + end + + test "quadruple-quoted string lexes as a single :string" do + assert [{:string, %{}, "\"\"\"\"\nfoo\n\"\"\"\""}] = + lex("\"\"\"\"\nfoo\n\"\"\"\"") + end + + test "quadruple-quoted string can contain triple quotes in its body" do + # The whole point of using a quadruple opener: lets the body include + # `"""` literally without ending the string. + assert [{:string, %{}, body}] = + lex("\"\"\"\"\nhello \"\"\" inside\n\"\"\"\"") + + assert body =~ "\"\"\"" + end + + test "quintuple-quoted string can contain quadruple quotes in its body" do + assert [{:string, %{}, body}] = + lex("\"\"\"\"\"\nhi \"\"\"\" foo\n\"\"\"\"\"") + + assert body =~ "\"\"\"\"" + end + + test "escape sub-tokens still emitted inside quadruple-quoted strings" do + assert [ + {:string, %{}, "\"\"\"\"\nhi "}, + {:string_escape, %{}, "\\xFF"}, + {:string, %{}, " there\n\"\"\"\""} + ] = lex("\"\"\"\"\nhi \\xFF there\n\"\"\"\"") + end + + test "sigil prefixes work with quadruple-quoted strings" do + assert [{:string, %{}, "~b\"\"\"\"\nfoo\n\"\"\"\""}] = + lex("~b\"\"\"\"\nfoo\n\"\"\"\"") + + assert [{:string, %{}, "~B\"\"\"\"\nhello \"\"\" inside\n\"\"\"\""}] = + lex("~B\"\"\"\"\nhello \"\"\" inside\n\"\"\"\"") + + assert [{:string, %{}, "~\"\"\"\"\nhi\n\"\"\"\""}] = + lex("~\"\"\"\"\nhi\n\"\"\"\"") + end + end + describe "OTP-current module attribute coverage" do # The generic `module_attribute` rule accepts any `atom_name`, which # means new attributes ship without lexer changes. Lock the current From 0cf6946f449a5a89edb63f988d9fba98e5ea615e Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Lukas=20Backstr=C3=B6m?= Date: Wed, 6 May 2026 15:29:56 +0200 Subject: [PATCH 12/15] Add tests for -doc / -moduledoc attributes Doc attributes are nearly universal in OTP 27+ modules and the canonical use case for triple-quoted strings. Lock coverage of the common shapes: triple-quoted body, single-line string body, and a `-doc """..."""` attribute followed by a function clause (which exercises the boundary between the doc string close `"""` and the function head). Co-Authored-By: Claude Opus 4.7 (1M context) --- .../erlang_lexer_tokenizer_test.exs | 55 +++++++++++++++++++ 1 file changed, 55 insertions(+) diff --git a/test/makeup/erlang_lexer/erlang_lexer_tokenizer_test.exs b/test/makeup/erlang_lexer/erlang_lexer_tokenizer_test.exs index 9533765..ed0bbd3 100644 --- a/test/makeup/erlang_lexer/erlang_lexer_tokenizer_test.exs +++ b/test/makeup/erlang_lexer/erlang_lexer_tokenizer_test.exs @@ -858,6 +858,61 @@ defmodule ErlangLexerTokenizer do end end + describe "doc / moduledoc attributes (OTP 27+)" do + test "moduledoc with triple-quoted body" do + src = "-moduledoc \"\"\"\nThis module does X.\n\"\"\"" + + assert [ + {:punctuation, %{}, "-"}, + {:name_attribute, %{}, "moduledoc"}, + {:whitespace, %{}, " "}, + {:string, %{}, "\"\"\"\nThis module does X.\n\"\"\""} + ] = lex(src) + end + + test "doc attribute followed by a function definition" do + src = "-doc \"\"\"\nReturns true if X is positive.\n\"\"\".\nis_pos(X) when X > 0 -> true." + + assert lex(src) == [ + {:punctuation, %{}, "-"}, + {:name_attribute, %{}, "doc"}, + {:whitespace, %{}, " "}, + {:string, %{}, "\"\"\"\nReturns true if X is positive.\n\"\"\""}, + {:punctuation, %{}, "."}, + {:whitespace, %{}, "\n"}, + {:name_function, %{}, "is_pos"}, + {:punctuation, %{group_id: "group-1"}, "("}, + {:name, %{}, "X"}, + {:punctuation, %{group_id: "group-1"}, ")"}, + {:whitespace, %{}, " "}, + {:keyword, %{}, "when"}, + {:whitespace, %{}, " "}, + {:name, %{}, "X"}, + {:whitespace, %{}, " "}, + {:operator, %{}, ">"}, + {:whitespace, %{}, " "}, + {:number_integer, %{}, "0"}, + {:whitespace, %{}, " "}, + {:punctuation, %{}, "->"}, + {:whitespace, %{}, " "}, + {:string_symbol, %{}, "true"}, + {:punctuation, %{}, "."} + ] + end + + test "doc with single-line string body still works" do + src = "-doc \"short\"." + + assert [ + {:punctuation, %{}, "-"}, + {:name_attribute, %{}, "doc"}, + {:whitespace, %{}, " "}, + {:string, %{}, "\"short\""}, + {:punctuation, %{}, "."} + ] = lex(src) + end + end + describe "OTP-current module attribute coverage" do # The generic `module_attribute` rule accepts any `atom_name`, which # means new attributes ship without lexer changes. Lock the current From bcad191ec5307c90d072f144990515603cd0c76a Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Lukas=20Backstr=C3=B6m?= Date: Wed, 6 May 2026 15:30:26 +0200 Subject: [PATCH 13/15] Add tests for function clauses with guards Function-head guards exercise the interaction between several rule families: keyword recognition (`when`), word operators (`andalso`, `orelse`), comparison operators (`>`, `<`, `=/=`), BIF recognition (`is_integer`, `is_atom`), and the comma/semicolon guard separator. Lock the common shapes so a regression in any one of those would surface as a guard test failure. Co-Authored-By: Claude Opus 4.7 (1M context) --- .../erlang_lexer_tokenizer_test.exs | 85 +++++++++++++++++++ 1 file changed, 85 insertions(+) diff --git a/test/makeup/erlang_lexer/erlang_lexer_tokenizer_test.exs b/test/makeup/erlang_lexer/erlang_lexer_tokenizer_test.exs index ed0bbd3..12871a7 100644 --- a/test/makeup/erlang_lexer/erlang_lexer_tokenizer_test.exs +++ b/test/makeup/erlang_lexer/erlang_lexer_tokenizer_test.exs @@ -749,6 +749,91 @@ defmodule ErlangLexerTokenizer do end end + describe "function clauses with guards" do + test "guard with operator and BIF" do + assert [ + {:name_function, %{}, "f"}, + {:punctuation, _, "("}, + {:name, %{}, "X"}, + {:punctuation, _, ")"}, + {:whitespace, %{}, " "}, + {:keyword, %{}, "when"}, + {:whitespace, %{}, " "}, + {:name, %{}, "X"}, + {:whitespace, %{}, " "}, + {:operator, %{}, ">"}, + {:whitespace, %{}, " "}, + {:number_integer, %{}, "0"}, + {:punctuation, %{}, ","}, + {:whitespace, %{}, " "}, + {:name_builtin, %{}, "is_integer"}, + {:punctuation, _, "("}, + {:name, %{}, "X"}, + {:punctuation, _, ")"}, + {:whitespace, %{}, " "}, + {:punctuation, %{}, "->"} | _ + ] = lex("f(X) when X > 0, is_integer(X) -> X * 2.") + end + + test "guard sequence with `;` (alternative guards)" do + assert lex("f(X) when X < 0; X > 100 -> out_of_range.") == [ + {:name_function, %{}, "f"}, + {:punctuation, %{group_id: "group-1"}, "("}, + {:name, %{}, "X"}, + {:punctuation, %{group_id: "group-1"}, ")"}, + {:whitespace, %{}, " "}, + {:keyword, %{}, "when"}, + {:whitespace, %{}, " "}, + {:name, %{}, "X"}, + {:whitespace, %{}, " "}, + {:operator, %{}, "<"}, + {:whitespace, %{}, " "}, + {:number_integer, %{}, "0"}, + {:punctuation, %{}, ";"}, + {:whitespace, %{}, " "}, + {:name, %{}, "X"}, + {:whitespace, %{}, " "}, + {:operator, %{}, ">"}, + {:whitespace, %{}, " "}, + {:number_integer, %{}, "100"}, + {:whitespace, %{}, " "}, + {:punctuation, %{}, "->"}, + {:whitespace, %{}, " "}, + {:string_symbol, %{}, "out_of_range"}, + {:punctuation, %{}, "."} + ] + end + + test "guard with word operators (`andalso`, `orelse`)" do + assert lex("f(X) when is_atom(X) andalso X =/= undefined -> ok.") == [ + {:name_function, %{}, "f"}, + {:punctuation, %{group_id: "group-1"}, "("}, + {:name, %{}, "X"}, + {:punctuation, %{group_id: "group-1"}, ")"}, + {:whitespace, %{}, " "}, + {:keyword, %{}, "when"}, + {:whitespace, %{}, " "}, + {:name_builtin, %{}, "is_atom"}, + {:punctuation, %{group_id: "group-2"}, "("}, + {:name, %{}, "X"}, + {:punctuation, %{group_id: "group-2"}, ")"}, + {:whitespace, %{}, " "}, + {:operator_word, %{}, "andalso"}, + {:whitespace, %{}, " "}, + {:name, %{}, "X"}, + {:whitespace, %{}, " "}, + {:operator, %{}, "=/="}, + {:whitespace, %{}, " "}, + {:string_symbol, %{}, "undefined"}, + {:whitespace, %{}, " "}, + {:punctuation, %{}, "->"}, + {:whitespace, %{}, " "}, + {:string_symbol, %{}, "ok"}, + {:punctuation, %{}, "."} + ] + end + end + describe "fun keyword vs function call" do test "fun(X) -> ... end tokenizes `fun` as keyword, not function name" do assert [ From 55660ff4f755124edc47616bb62e2020f6ba24be Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Lukas=20Backstr=C3=B6m?= Date: Wed, 6 May 2026 15:31:12 +0200 Subject: [PATCH 14/15] Add tests for map and bitstring comprehensions Map comprehensions (OTP 26) and bitstring comprehensions (pre-existing, but tests scarce) exercise the interactions between several operator and punctuation tokens that the lexer hasn't explicitly tested in combination: `=>` and `:=` next to `||`, `<-`, `<=`, and the `\#{...}` map-open punctuation. Also lock strict-generator `<:-` (OTP 27) coverage with an explicit positive test rather than the operator-list catch-all. Co-Authored-By: Claude Opus 4.7 (1M context) --- .../erlang_lexer_tokenizer_test.exs | 77 +++++++++++++++++++ 1 file changed, 77 insertions(+) diff --git a/test/makeup/erlang_lexer/erlang_lexer_tokenizer_test.exs b/test/makeup/erlang_lexer/erlang_lexer_tokenizer_test.exs index 12871a7..a8b0d97 100644 --- a/test/makeup/erlang_lexer/erlang_lexer_tokenizer_test.exs +++ b/test/makeup/erlang_lexer/erlang_lexer_tokenizer_test.exs @@ -834,6 +834,83 @@ defmodule ErlangLexerTokenizer do end end + describe "newer comprehensions (OTP 26 / 27)" do + test "list comprehension with strict generator (OTP 27)" do + assert lex("[X || X <:- L]") == [ + {:punctuation, %{group_id: "group-1"}, "["}, + {:name, %{}, "X"}, + {:whitespace, %{}, " "}, + {:punctuation, %{}, "||"}, + {:whitespace, %{}, " "}, + {:name, %{}, "X"}, + {:whitespace, %{}, " "}, + {:operator, %{}, "<:-"}, + {:whitespace, %{}, " "}, + {:name, %{}, "L"}, + {:punctuation, %{group_id: "group-1"}, "]"} + ] + end + + test "map comprehension (OTP 26)" do + # `#{K => V * 2 || K := V <- M}` exercises map-open `\#{`, + # map arrow `=>`, comprehension separator `||`, map match + # operator `:=`, and the list-generator operator `<-`. + assert lex("\#{K => V * 2 || K := V <- M}") == [ + {:punctuation, %{group_id: "group-1"}, "\#{"}, + {:name, %{}, "K"}, + {:whitespace, %{}, " "}, + {:punctuation, %{}, "=>"}, + {:whitespace, %{}, " "}, + {:name, %{}, "V"}, + {:whitespace, %{}, " "}, + {:operator, %{}, "*"}, + {:whitespace, %{}, " "}, + {:number_integer, %{}, "2"}, + {:whitespace, %{}, " "}, + {:punctuation, %{}, "||"}, + {:whitespace, %{}, " "}, + {:name, %{}, "K"}, + {:whitespace, %{}, " "}, + {:punctuation, %{}, ":="}, + {:whitespace, %{}, " "}, + {:name, %{}, "V"}, + {:whitespace, %{}, " "}, + {:operator, %{}, "<-"}, + {:whitespace, %{}, " "}, + {:name, %{}, "M"}, + {:punctuation, %{group_id: "group-1"}, "}"} + ] + end + + test "bitstring comprehension with `<=` generator" do + # `<<>>` brackets, the bitstring-generator operator `<=`, and + # nested `<<>>` patterns inside. + assert lex("<< <> || <> <= Bin >>") == [ + {:punctuation, %{group_id: "group-1"}, "<<"}, + {:whitespace, %{}, " "}, + {:punctuation, %{group_id: "group-2"}, "<<"}, + {:name, %{}, "X"}, + {:punctuation, %{}, ":"}, + {:number_integer, %{}, "8"}, + {:punctuation, %{group_id: "group-2"}, ">>"}, + {:whitespace, %{}, " "}, + {:punctuation, %{}, "||"}, + {:whitespace, %{}, " "}, + {:punctuation, %{group_id: "group-3"}, "<<"}, + {:name, %{}, "X"}, + {:punctuation, %{}, ":"}, + {:number_integer, %{}, "8"}, + {:punctuation, %{group_id: "group-3"}, ">>"}, + {:whitespace, %{}, " "}, + {:operator, %{}, "<="}, + {:whitespace, %{}, " "}, + {:name, %{}, "Bin"}, + {:whitespace, %{}, " "}, + {:punctuation, %{group_id: "group-1"}, ">>"} + ] + end + end + describe "fun keyword vs function call" do test "fun(X) -> ... end tokenizes `fun` as keyword, not function name" do assert [ From 3ec73ecb152cd53bd4a32388e65ea3c0d48ea7c9 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Lukas=20Backstr=C3=B6m?= Date: Wed, 6 May 2026 15:32:16 +0200 Subject: [PATCH 15/15] Add integration test exercising a small real Erlang module Most lexer tests are minimal isolated inputs that pin one rule's output. The richer interaction-shaped failures (a rule's order in the choice perturbing how a sibling rule fires) need a test that threads many features through one input. Add a small module fragment that combines: * `-module` / `-export` attributes * a `-doc """..."""` doc attribute with multi-line body * a function head with a `when` guard and BIF call * a map comprehension (`#{K => V || K := V <- M, ...}`) * a body with comparison operator and number If a future change breaks any of those rules' interactions, this test catches it whereas the per-feature tests would still pass. Co-Authored-By: Claude Opus 4.7 (1M context) --- .../erlang_lexer_tokenizer_test.exs | 97 +++++++++++++++++++ 1 file changed, 97 insertions(+) diff --git a/test/makeup/erlang_lexer/erlang_lexer_tokenizer_test.exs b/test/makeup/erlang_lexer/erlang_lexer_tokenizer_test.exs index a8b0d97..c4baa6b 100644 --- a/test/makeup/erlang_lexer/erlang_lexer_tokenizer_test.exs +++ b/test/makeup/erlang_lexer/erlang_lexer_tokenizer_test.exs @@ -911,6 +911,103 @@ defmodule ErlangLexerTokenizer do end end + describe "real-world module fragment (integration)" do + # Exercises module attribute, doc string, function head with guard, + # body with map, BIF call, and a record. If any rule's choice order + # gets perturbed, this is the test most likely to catch it. + test "small module with -doc, guard, map, and BIF call" do + src = """ + + -module(positives). + -export([keep/1]). + + -doc \"\"\" + Keep map entries whose values are positive integers. + \"\"\". + keep(M) when is_map(M) -> + \#{K => V || K := V <- M, is_integer(V), V > 0}. + """ + + assert lex(src) == [ + {:whitespace, %{}, "\n"}, + {:punctuation, %{}, "-"}, + {:name_attribute, %{}, "module"}, + {:punctuation, %{group_id: "group-1"}, "("}, + {:string_symbol, %{}, "positives"}, + {:punctuation, %{group_id: "group-1"}, ")"}, + {:punctuation, %{}, "."}, + {:whitespace, %{}, "\n"}, + {:punctuation, %{}, "-"}, + {:name_attribute, %{}, "export"}, + {:punctuation, %{group_id: "group-2"}, "("}, + {:punctuation, %{group_id: "group-3"}, "["}, + {:string_symbol, %{}, "keep"}, + {:punctuation, %{}, "/"}, + {:number_integer, %{}, "1"}, + {:punctuation, %{group_id: "group-3"}, "]"}, + {:punctuation, %{group_id: "group-2"}, ")"}, + {:punctuation, %{}, "."}, + {:whitespace, %{}, "\n"}, + {:whitespace, %{}, "\n"}, + {:punctuation, %{}, "-"}, + {:name_attribute, %{}, "doc"}, + {:whitespace, %{}, " "}, + {:string, %{}, + "\"\"\"\nKeep map entries whose values are positive integers.\n\"\"\""}, + {:punctuation, %{}, "."}, + {:whitespace, %{}, "\n"}, + {:name_function, %{}, "keep"}, + {:punctuation, %{group_id: "group-4"}, "("}, + {:name, %{}, "M"}, + {:punctuation, %{group_id: "group-4"}, ")"}, + {:whitespace, %{}, " "}, + {:keyword, %{}, "when"}, + {:whitespace, %{}, " "}, + {:name_builtin, %{}, "is_map"}, + {:punctuation, %{group_id: "group-5"}, "("}, + {:name, %{}, "M"}, + {:punctuation, %{group_id: "group-5"}, ")"}, + {:whitespace, %{}, " "}, + {:punctuation, %{}, "->"}, + {:whitespace, %{}, "\n "}, + {:punctuation, %{group_id: "group-6"}, "\#{"}, + {:name, %{}, "K"}, + {:whitespace, %{}, " "}, + {:punctuation, %{}, "=>"}, + {:whitespace, %{}, " "}, + {:name, %{}, "V"}, + {:whitespace, %{}, " "}, + {:punctuation, %{}, "||"}, + {:whitespace, %{}, " "}, + {:name, %{}, "K"}, + {:whitespace, %{}, " "}, + {:punctuation, %{}, ":="}, + {:whitespace, %{}, " "}, + {:name, %{}, "V"}, + {:whitespace, %{}, " "}, + {:operator, %{}, "<-"}, + {:whitespace, %{}, " "}, + {:name, %{}, "M"}, + {:punctuation, %{}, ","}, + {:whitespace, %{}, " "}, + {:name_builtin, %{}, "is_integer"}, + {:punctuation, %{group_id: "group-7"}, "("}, + {:name, %{}, "V"}, + {:punctuation, %{group_id: "group-7"}, ")"}, + {:punctuation, %{}, ","}, + {:whitespace, %{}, " "}, + {:name, %{}, "V"}, + {:whitespace, %{}, " "}, + {:operator, %{}, ">"}, + {:whitespace, %{}, " "}, + {:number_integer, %{}, "0"}, + {:punctuation, %{group_id: "group-6"}, "}"}, + {:punctuation, %{}, "."}, + {:whitespace, %{}, "\n"} + ] + end + end + describe "fun keyword vs function call" do test "fun(X) -> ... end tokenizes `fun` as keyword, not function name" do assert [