Skip to content

macOS: make Ime::Preedit cursor range surrogate-safe in setMarkedText#4596

Open
William-Selna wants to merge 1 commit into
rust-windowing:masterfrom
William-Selna:fix/macos-ime-preedit-surrogate-cursor
Open

macOS: make Ime::Preedit cursor range surrogate-safe in setMarkedText#4596
William-Selna wants to merge 1 commit into
rust-windowing:masterfrom
William-Selna:fix/macos-ime-preedit-surrogate-cursor

Conversation

@William-Selna

@William-Selna William-Selna commented Jun 12, 2026

Copy link
Copy Markdown

The macOS backend turns the IME's UTF-16 selectedRange into the UTF-8 byte offsets in Ime::Preedit by slicing substringToIndex: prefixes and measuring them with NSString::len() (lengthOfBytesUsingEncoding:NSUTF8StringEncoding). When an offset lands inside a surrogate pair the prefix ends in a lone high surrogate, which UTF-8 can't represent, and lengthOfBytesUsingEncoding: returns 0 for the whole prefix instead of the convertible part — so the offset collapses to 0.

For preedit "a😀b":

  • selectedRange {2, 0} emits Some((0, 0)) instead of (1, 1) — the cursor jumps to the start and the valid "a" prefix is dropped.
  • selectedRange {1, 1} emits Some((1, 0)), i.e. lower > upper. Highlighting the active clause with &preedit[range.0..range.1]
    then panics: byte index starts at 1 but ends at 0.

This is the macOS counterpart of #3967 which was fixed for Windows in #4201; the macOS backend has its own copy of the conversion that never got the same fix.

The fix drops the substringToIndex: calls and maps the UTF-16 offset to a UTF-8 offset over the already-decoded Rust string, snapping a mid-surrogate offset down to the char boundary and clamping out-of-range offsets to s.len():

fn utf16_to_utf8_offset(s: &str, utf16_offset: usize) -> usize {
    let mut utf16_pos = 0;
    for (utf8_pos, ch) in s.char_indices() {
        if utf16_pos >= utf16_offset {
            return utf8_pos;
        }
        utf16_pos += ch.len_utf16();
        if utf16_pos > utf16_offset {
            return utf8_pos; // offset splits this char's surrogate pair
        }
    }
    s.len()
}

It's monotone and location <= end for an NSRange, so the range can't come out inverted. On well-formed input it's byte-for-byte identical to the old code. And since it no longer calls any NSString range API, it also covers the out-of-bounds case the .min(len) clamp in rust-windowing/winit#4494 was added for (no NSRangeException possible) and drops two ObjC allocations per preedit update.

Malformed offsets here aren't hypothetical: rust-windowing/winit#4494 added that clamp because the native Pinyin IME sends out-of-range indices (alacritty/alacritty#8791), and rust-windowing/winit#3967 came from a non-BMP preedit (a 🐖 from the Microsoft Japanese IME) corrupting the range — same shape, one code unit over, lands mid-pair on macOS.

Tests are a #[cfg(test)] mod tests of pure functions (no AppKit, so they run on any host): the mid-surrogate snap, the previously-inverted and previously-collapsed cases, the out-of-bounds clamp, identity on well-formed input, and a monotonicity sweep. cargo build / cargo clippy -p winit-appkit are clean on macOS.

…ed_text

`setMarkedText:selectedRange:replacementRange:` converted the IME's
UTF-16 `selectedRange` into UTF-8 byte offsets by taking
`substringToIndex:` prefixes and measuring them with
`NSString::len()` (`lengthOfBytesUsingEncoding:NSUTF8StringEncoding`).
When an index falls inside a surrogate pair, the prefix ends in a lone
high surrogate, which UTF-8 cannot represent, so
`lengthOfBytesUsingEncoding:` returns 0 for the entire prefix and the
offset silently collapses to 0.

For preedit "a😀b":
- `selectedRange {2,0}` emitted `Some((0, 0))` instead of `Some((1, 1))`,
  discarding the valid "a" prefix (cursor jumps to the start).
- `selectedRange {1,1}` emitted `Some((1, 0))` — an inverted range with
  `lower > upper`, which panics the natural consumer pattern
  `&preedit[range.0..range.1]` used to highlight the selected clause.

Replace the `substringToIndex:`/`len()` conversion with a pure, total
helper `utf16_to_utf8_offset` that walks the already-materialized Rust
`String`, snapping an offset that would split a surrogate pair down to
the character boundary and clamping an out-of-bounds offset to the
string length. The helper is monotone non-decreasing, so the emitted
range can never be inverted; the happy path (well-formed boundary
indices) is byte-for-byte unchanged.

Because no `NSString` range API is called anymore, this also subsumes
the out-of-bounds `.min(len)` clamp added in rust-windowing#4494 (no `NSRangeException`
is possible) and drops two Objective-C string allocations per preedit
update.

This is the macOS twin of the Windows bug reported in rust-windowing#3967 and fixed by
rust-windowing#4201; the macOS backend has its own independent copy of the conversion
that was never given the equivalent fix.

Adds pure unit tests (no AppKit) covering the mid-surrogate snap, the
previously-inverted and previously-collapsed cases, the out-of-bounds
clamp, identity on well-formed inputs, and a monotonicity sweep.
@William-Selna William-Selna requested a review from madsmtm as a code owner June 12, 2026 21:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

1 participant