macOS: make Ime::Preedit cursor range surrogate-safe in setMarkedText#4596
Open
William-Selna wants to merge 1 commit into
Open
macOS: make Ime::Preedit cursor range surrogate-safe in setMarkedText#4596William-Selna wants to merge 1 commit into
William-Selna wants to merge 1 commit into
Conversation
…ed_text
`setMarkedText:selectedRange:replacementRange:` converted the IME's
UTF-16 `selectedRange` into UTF-8 byte offsets by taking
`substringToIndex:` prefixes and measuring them with
`NSString::len()` (`lengthOfBytesUsingEncoding:NSUTF8StringEncoding`).
When an index falls inside a surrogate pair, the prefix ends in a lone
high surrogate, which UTF-8 cannot represent, so
`lengthOfBytesUsingEncoding:` returns 0 for the entire prefix and the
offset silently collapses to 0.
For preedit "a😀b":
- `selectedRange {2,0}` emitted `Some((0, 0))` instead of `Some((1, 1))`,
discarding the valid "a" prefix (cursor jumps to the start).
- `selectedRange {1,1}` emitted `Some((1, 0))` — an inverted range with
`lower > upper`, which panics the natural consumer pattern
`&preedit[range.0..range.1]` used to highlight the selected clause.
Replace the `substringToIndex:`/`len()` conversion with a pure, total
helper `utf16_to_utf8_offset` that walks the already-materialized Rust
`String`, snapping an offset that would split a surrogate pair down to
the character boundary and clamping an out-of-bounds offset to the
string length. The helper is monotone non-decreasing, so the emitted
range can never be inverted; the happy path (well-formed boundary
indices) is byte-for-byte unchanged.
Because no `NSString` range API is called anymore, this also subsumes
the out-of-bounds `.min(len)` clamp added in rust-windowing#4494 (no `NSRangeException`
is possible) and drops two Objective-C string allocations per preedit
update.
This is the macOS twin of the Windows bug reported in rust-windowing#3967 and fixed by
rust-windowing#4201; the macOS backend has its own independent copy of the conversion
that was never given the equivalent fix.
Adds pure unit tests (no AppKit) covering the mid-surrogate snap, the
previously-inverted and previously-collapsed cases, the out-of-bounds
clamp, identity on well-formed inputs, and a monotonicity sweep.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The macOS backend turns the IME's UTF-16 selectedRange into the UTF-8 byte offsets in Ime::Preedit by slicing substringToIndex: prefixes and measuring them with NSString::len() (lengthOfBytesUsingEncoding:NSUTF8StringEncoding). When an offset lands inside a surrogate pair the prefix ends in a lone high surrogate, which UTF-8 can't represent, and lengthOfBytesUsingEncoding: returns 0 for the whole prefix instead of the convertible part — so the offset collapses to 0.
For preedit "a😀b":
then panics: byte index starts at 1 but ends at 0.
This is the macOS counterpart of #3967 which was fixed for Windows in #4201; the macOS backend has its own copy of the conversion that never got the same fix.
The fix drops the substringToIndex: calls and maps the UTF-16 offset to a UTF-8 offset over the already-decoded Rust string, snapping a mid-surrogate offset down to the char boundary and clamping out-of-range offsets to s.len():
It's monotone and location <= end for an NSRange, so the range can't come out inverted. On well-formed input it's byte-for-byte identical to the old code. And since it no longer calls any NSString range API, it also covers the out-of-bounds case the .min(len) clamp in rust-windowing/winit#4494 was added for (no NSRangeException possible) and drops two ObjC allocations per preedit update.
Malformed offsets here aren't hypothetical: rust-windowing/winit#4494 added that clamp because the native Pinyin IME sends out-of-range indices (alacritty/alacritty#8791), and rust-windowing/winit#3967 came from a non-BMP preedit (a 🐖 from the Microsoft Japanese IME) corrupting the range — same shape, one code unit over, lands mid-pair on macOS.
Tests are a #[cfg(test)] mod tests of pure functions (no AppKit, so they run on any host): the mid-surrogate snap, the previously-inverted and previously-collapsed cases, the out-of-bounds clamp, identity on well-formed input, and a monotonicity sweep. cargo build / cargo clippy -p winit-appkit are clean on macOS.