Skip to content

Improvements to the safety of boa_string#5385

Merged
hansl merged 4 commits into
boa-dev:mainfrom
Manishearth:safety-improvements
May 31, 2026
Merged

Improvements to the safety of boa_string#5385
hansl merged 4 commits into
boa-dev:mainfrom
Manishearth:safety-improvements

Conversation

@Manishearth

@Manishearth Manishearth commented May 29, 2026

Copy link
Copy Markdown
Contributor

I had it on my todo list to unsafe review boa_string. I didn't get around to it: it's a lot of unsafe code, but I did spend some time asking an appropriately primed agent to perform a review, and it had some findings.

It found some issues around ascii-checking.

Full report:

Details

Unsafe Rust Review

This review performs a first pass safety audit of the unsafe code in boa_string v0.21.1), applying the Rust Safety Comments guidelines.

Executive Summary

The codebase generally follows good safety practices, including many explicit // SAFETY: comments. However, we identified:

  1. One logical inconsistency in Segment::is_ascii which affects safe builder behavior (confusing behavior and missed optimizations, though technically sound).
  2. One severe latent bug in JsStringBuilder::<u16>::is_ascii which returns incorrect results for non-ASCII UTF-16 characters.
  3. A few undocumented/under-documented unsafe assumptions (e.g., unwrap_unchecked in layout math, redundant-looking transmute).
  4. Opportunities to improve safety comments to be more "locally verifiable" as per the guidelines.

Detailed Findings by File

1. third_party/rust/boa_string/v0_21/src/builder.rs

🔴 Inconsistent Segment::is_ascii (Logical Inconsistency)

  • Location: builder.rs:L616-624

  • Code:

    fn is_ascii(&self) -> bool {
        match self {
            Segment::String(s) => s.as_str().is_latin1(),
            Segment::Str(s) => s.is_latin1(),
            Segment::Latin1(b) => *b <= 0x7f,
            Segment::CodePoint(ch) => *ch as u32 <= 0x7F,
        }
    }
  • Analysis:
    This method is highly inconsistent:

    • For String and Str segments, it checks is_latin1(), which returns true for any Latin1 string (values 0..=255), including non-ASCII characters (e.g., é / 0xEB).
    • For Latin1 (u8) and CodePoint (char) segments, it strictly checks for ASCII (<= 0x7F).

    This causes inconsistent behavior in safe CommonJsStringBuilder::build():

    • If you have a Segment::String containing non-ASCII Latin1, is_ascii() returns true, and it builds via build_as_latin1.
    • If you have a Segment::Latin1(0xEB) (same character), is_ascii() returns false, forcing a fallback to UTF-16 encoding.

    Safety Impact Evaluation:
    While this method acts as a guard for the unsafe function build_as_latin1(), this inconsistency does not lead to unsoundness (UB):

    • The "permissive" case (allowing non-ASCII String/Str) is safe because these segments are already Latin1 and fit perfectly into u8 without loss of representation or truncation.
    • The "strict" cases (restricting Latin1/CodePoint to ASCII) are overly conservative but safe. They merely cause a safe fallback to UTF-16.
  • Recommendation:
    Rename is_ascii to can_be_latin1 (or similar) and make the checks consistent:

    fn can_be_latin1(&self) -> bool {
        match self {
            Segment::String(s) => s.as_str().is_latin1(),
            Segment::Str(s) => s.is_latin1(),
            Segment::Latin1(_) => true, // u8 is always valid Latin1
            Segment::CodePoint(ch) => *ch as u32 <= 0xFF, // Latin1 range
        }
    }

    If the intent was strictly ASCII, then String and Str segments must be scanned to ensure they contain no characters > 0x7F.


🔴 Severe Bug in JsStringBuilder::<u16>::is_ascii (Potential Unsoundness)

  • Location: builder.rs:L309-317

  • Code:

    pub fn is_ascii(&self) -> bool {
        // SAFETY:
        // `NonNull` verified for us that the pointer returned by `alloc` is valid,
        // meaning we can read to its pointed memory.
        let data = unsafe {
            std::slice::from_raw_parts(self.data().cast::<u8>(), self.allocated_data_byte_len())
        };
        data.is_ascii()
    }
  • Analysis:
    When D = u16 (i.e., Utf16JsStringBuilder), this function casts the raw *mut u16 data pointer to *const u8 and checks if the raw bytes are ASCII.
    This is broken on Little Endian systems for certain non-ASCII UTF-16 characters:

    • Example: The character 0x0100 (Ā) is represented in memory as bytes [0x00, 0x01].
    • Since both 0x00 and 0x01 are < 128, data.is_ascii() will return true.
    • Thus, the builder falsely claims a non-ASCII string is ASCII.

    Safety Impact Evaluation:
    Currently, this method is only called on Latin1JsStringBuilder (JsStringBuilder<u8>), where it behaves correctly. However, because JsStringBuilder is generic and this method is pub, this is a critical latent bug. If future code or external users rely on is_ascii() to guard unsafe operations on Utf16JsStringBuilder (e.g., assuming it can be safely downgraded to Latin1/u8), it would lead to silent data corruption or UB.

  • Recommendation:
    Move is_ascii to a specialized impl JsStringBuilder<u8> block so it cannot be called on the u16 variant, or implement it correctly for u16 (e.g., checking if all u16 elements are <= 0x7F).


🟡 Missing Safety Justification for unwrap_unchecked

  • Location: builder.rs:L124-125
  • Code:
    Layout::for_value(self.inner.as_ref())
        .extend(Layout::array::<D>(self.capacity()).unwrap_unchecked())
        .unwrap_unchecked()
  • Analysis:
    unwrap_unchecked is used on Layout::array and Layout::extend. This is unsafe if they return Err (which happens on overflow).
    • Why it is likely safe: This layout is reconstructed for an already allocated object (self.inner). Since it was successfully allocated previously (via new_layout), we know the capacity and type size do not overflow.
    • Required Action: Add a // SAFETY: comment explaining this invariant.
  • Draft Safety Comment:
    // SAFETY:
    // 1. `self.inner` is verified to be allocated.
    // 2. `unwrap_unchecked` is safe because this layout was successfully 
    //    allocated previously with the same capacity, so it cannot overflow.

🟢 extend_from_slice_unchecked Pointers copy

  • Location: builder.rs:L202-208
  • Code:
    pub const unsafe fn extend_from_slice_unchecked(&mut self, v: &[D]) {
        // SAFETY: Caller should ensure the capacity is large enough to hold elements.
        unsafe {
            ptr::copy_nonoverlapping(v.as_ptr(), self.data().add(self.len()), v.len());
        }
        self.len += v.len();
    }
  • Analysis:
    The safety comment is correct but could be more thorough by addressing the other preconditions of copy_nonoverlapping (alignment, overlap):
  • Draft Safety Comment:
    // SAFETY:
    // 1. Caller must ensure `self.len() + v.len() <= self.capacity()` so the destination pointer is in-bounds.
    // 2. Pointers are aligned: `v` is aligned by Rust's slice guarantee; `self.data()` is aligned because the allocation layout was padded to `D`'s alignment.
    // 3. Regions do not overlap because `v` is an immutable reference and `self` is an exclusive mutable reference.

2. third_party/rust/boa_string/v0_21/src/common.rs

🟡 Redundant/Confusing transmute in get_string

  • Location: common.rs:L73-74
  • Code:
    let str = *RAW_STATICS_CACHE.get(string)?;
    
    // SAFETY: Type of T in is `&'static JsStr<'static>`, so this is safe.
    let ptr = unsafe { std::mem::transmute::<&JsStr<'_>, &'static JsStr<'static>>(str) };
  • Analysis:
    • RAW_STATICS_CACHE is defined as LazyLock<FxHashSet<&'static JsStr<'static>>>.
    • RAW_STATICS_CACHE.get should return Option<&&'static JsStr<'static>>.
    • Dereferencing it once (*) should yield &'static JsStr<'static>.
    • If str is already &'static JsStr<'static>, the transmute from &JsStr<'_> to &'static JsStr<'static> is redundant and confusing.
    • If the compiler forced a shorter lifetime due to type inference, the safety relies on the fact that the value definitely came from the static cache.
  • Recommendation:
    1. Try to remove the transmute and see if it compiles. It might require binding the cache deref to a static lifetime variable first:
      let cache: &'static FxHashSet<&'static JsStr<'static>> = &*RAW_STATICS_CACHE;
      let str = *cache.get(string)?;
      // `str` should now have `'static` lifetime naturally.
    2. If transmute is indeed required, clarify the safety comment:
      // SAFETY: `str` is retrieved from `RAW_STATICS_CACHE` which only contains 
      // references to static memory (`&'static JsStr<'static>`), so promoting 
      // the lifetime to `'static` is sound.

3. third_party/rust/boa_string/v0_21/src/str.rs

  • Status: PASS
  • JsStr uses std::slice::from_raw_parts to reconstruct slices from raw pointers in variant() and as_latin1().
  • Justification: These are well-guarded by the is_latin1 tag and the fact that JsStr is constructed from valid Rust slices with lifetime 'a (which is preserved in JsStr<'a>).
  • The Sync and Send implementations are correctly documented and sound (the type is immutable).

@Manishearth Manishearth requested a review from a team as a code owner May 29, 2026 04:51
@Manishearth Manishearth requested a review from jedel1043 May 29, 2026 04:51
@github-actions github-actions Bot added the Waiting On Review Waiting on reviews from the maintainers label May 29, 2026
@github-actions github-actions Bot added this to the v1.0.0 milestone May 29, 2026
@Manishearth Manishearth force-pushed the safety-improvements branch from 68f925f to 58ba1c2 Compare May 29, 2026 04:52
@github-actions

Copy link
Copy Markdown

Test262 conformance changes

Test result main count PR count difference
Total 53,125 53,125 0
Passed 51,071 51,072 +1
Ignored 1,482 1,482 0
Failed 572 571 -1
Panics 0 0 0
Conformance 96.13% 96.14% +0.00%
Fixed tests (1):
test/staging/sm/Math/acosh-approx.js (previously Failed)

Tested main commit: 8f5ef6542d641fd22320e51234e914b59e623717
Tested PR commit: 58ba1c2a7509f3d0950c4781d483527c711d9966
Compare commits: 8f5ef65...58ba1c2

@codecov

codecov Bot commented May 29, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 80.00000% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 60.21%. Comparing base (6ddc2b4) to head (58ba1c2).
⚠️ Report is 967 commits behind head on main.

Files with missing lines Patch % Lines
core/string/src/builder.rs 80.00% 2 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##             main    #5385       +/-   ##
===========================================
+ Coverage   47.24%   60.21%   +12.97%     
===========================================
  Files         476      566       +90     
  Lines       46892    63023    +16131     
===========================================
+ Hits        22154    37951    +15797     
- Misses      24738    25072      +334     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@Manishearth

Copy link
Copy Markdown
Contributor Author

Idk why the test is failing, could that be intermittent somehow?

@hansl hansl added this pull request to the merge queue May 31, 2026
Merged via the queue into boa-dev:main with commit 7f2da65 May 31, 2026
35 of 37 checks passed
@github-actions github-actions Bot removed the Waiting On Review Waiting on reviews from the maintainers label May 31, 2026
@Manishearth Manishearth deleted the safety-improvements branch May 31, 2026 07:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants