refactor(otel-thread-ctx): inline thread-local resolution#2129
refactor(otel-thread-ctx): inline thread-local resolution#2129cataphract wants to merge 3 commits into
Conversation
|
Haven't looked at it yet, but stupid question: if we're going the route of arhitecture-specific code, and TLSDESC is the default on arch, couldn't we just have a native Rust thread-local in this case and thus avoid entirely messing with assembly in the arm64 case? Offering of course the same API in both cases |
@yannham We could if the feature was stable. See https://play.rust-lang.org/?version=stable&mode=debug&edition=2024&gist=8030d0cb93bd361efe963e035d8005d5 |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 0cc8540401
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
@cataphract sorry, I don't understand what you mean. Which feature are you thinking of? On my side, I meant using standard Rust thread-locals, without any |
@yannham but afaik there is no way to produce an exported unmangled symbol with thread_local!. The macro expands to something like (for a const initializer): pub const SLOT: LocalKey<Cell<*mut u8>> = {
const __RUST_STD_INTERNAL_INIT: Cell<*mut u8> = Cell::new(std::ptr::null_mut());
unsafe {
LocalKey::new(const {
if std::mem::needs_drop::<Cell<*mut u8>>() {
|_| {
#[thread_local]
static __RUST_STD_INTERNAL_VAL: EagerStorage<Cell<*mut u8>> =
EagerStorage::new(__RUST_STD_INTERNAL_INIT);
__RUST_STD_INTERNAL_VAL.get()
}
} else {
|_| {
#[thread_local]
static __RUST_STD_INTERNAL_VAL: Cell<*mut u8> = __RUST_STD_INTERNAL_INIT;
&__RUST_STD_INTERNAL_VAL
}
}
})
}
};Because your type is a pointer we take the else branch, which makes it very close to what we want. But we still need the thread local symbol to be named exactly |
|
Ah, I didn't think of the symbol name indeed... Too bad, but makes sense, thanks 👍 |
0cc8540 to
e515ba8
Compare
🎉 All green!🧪 All tests passed 🎯 Code Coverage (details) 🔗 Commit SHA: d71783d | Docs | Datadog PR Page | Give us feedback! |
📚 Documentation Check Results📦
|
Clippy Allow Annotation ReportComparing clippy allow annotations between branches:
Summary by Rule
Annotation Counts by File
Annotation Stats by Crate
About This ReportThis report tracks Clippy allow annotations for specific rules, showing how they've changed in this PR. Decreasing the number of these annotations generally improves code quality. |
🔒 Cargo Deny Results📦
|
e515ba8 to
ee8bb2e
Compare
Artifact Size Benchmark Reportaarch64-alpine-linux-musl
aarch64-unknown-linux-gnu
libdatadog-x64-windows
libdatadog-x86-windows
x86_64-alpine-linux-musl
x86_64-unknown-linux-gnu
|
yannham
left a comment
There was a problem hiding this comment.
In general, putting inline ASM is rather scary and fragile. However, I can see how it makes everything simpler on the build side, and remove a lot of code and scripts, which is appealing. It also gives guaranteed inlining without any toolchain reauirement. Since TLSDESC sequences are very simple and are likely to be very very stable (since the linker, the dynamic loader and the compiler must all agree), I suppose it's a reasonable approach.
I've tested the dynamic build and a static build with TLS relaxation on x86 (I did cross-compile for aarch64 as well and tested both, but had to take some detour there; the CI will test the shared build at least anyway). The ASM sequences look canonical, to the best of my knowledge.
I'm requiring changes only for the potentially missing align for otel_thread_ctx_v1 symbol, but otherwise, looks fine!
| #[inline(always)] | ||
| unsafe fn tls_slot() -> *mut *mut ThreadContextRecord { | ||
| let ptr: usize; | ||
| core::arch::asm!( |
There was a problem hiding this comment.
Nit: we should probably put a big WARNING scary comment around the TLS asm sequence here and lower. I've played with some harmless variations and linker relaxation is actually really sensitive to the precise instructions used. So, basically, tell people to just never touch this or they might break TLS relaxation.
There was a problem hiding this comment.
I can do that, but what kind of relaxation do you expect?
Is this library supposed to be linked statically? If so, the linker will relax accesses to local-exec. We need to tell the linker to export otel_thread_ctx_v1 explicitly. Even then, in the test I made, this generated a relocation for a simple offset in the GOT (R_X86_64_TPOFF64, for an initial-exec like access, I imagine), so no TLSDESC.
If it's supposed to be linked against the .so, then I don't see what relaxation we could have.
There was a problem hiding this comment.
Is this library supposed to be linked statically
Yes, in the case of a Rust tracer/profiler typically (dd-trace-rs). The OTel spec mentions TLSDESC but it's been clarified to mean "the TLSDESC dialect as implemented by the compiler": in the statically linked case, this will be relaxed to local exec which is fine and supported. We could want the relaxation for improved perf here (even if keeping the slower generic TLSDESC sequence would still be correct).
But I believe there's worse than the perf question: I've played a bit with the assembly here (to avoid offsetting xo two times), and Claude affirms (I haven't gone as far as testing it though) that my innocent-looking refactoring completely broke the relaxation, though the effect/semantics were exactly the same. It made some instructions not visible to the linker anymore (no associated relocation). Since the linker can't really consider such cases and still patch the instructions that are associated with the relocation, this results in a completely broken partial relaxation which loads a gibberish memory address during the sequence.
In general having inline ASM pretending it's a compiler-generated TLS access sequence is already quite fragile and dangereous. I feel like the current one is ok after thorough review, but I would rather not have anyone touch this until they understand very, very well what they're doing (if the comment can also scare off LLMs that's probably a good thing). Many things could break in many different ways down the line.
There was a problem hiding this comment.
(and, I must add, because I've tried very hard some time ago when I initially misunderstood the spec: it's just impossible right now to prevent LLD from performing TLS relaxation, so there's no way to say "hey I know this looks strange but please keep my TLSDESC intact at link time". There's just no knob, so we can't prevent linker relaxation in a reasonable setup to happen anway, which is why the spec accepts relaxed accesses like local exec, or being spec compliant with a statically linked executable would just be a nightmare)
There was a problem hiding this comment.
I feel like the spec is confusing by mixing the concepts of exporting a symbol and its access model/relocation.
-
It does make sense, if this rust cdylib is consumed as shared library, to both say that 1) it should export the symbol and 2) use the TLSDESC model, because only TLSDESC (and general dynamic) allow for the symbol to accessed when the library is dlopen()ed (with a minor exception that's only applicable to glibc anyway).
-
If it is consumed as a static library by an executable, it's not even clear why there should be a relocation. It's possible to export a symbol and not have any relocation against it, because the linker knows where it lies inside the executable. And like, you said, it seems impossible to have the linker not relax the TLSDESC in the static library. The behavior I saw was:
a) without exporting the symbol, the accesses follow the local-exec model,
b) if the symbol is exported, strangely, a relocation of type R_X86_64_TPOFF64 is produced, and accesses inside the executable change to initial-exec, even though the symbol doesn't seem preemptible (e.g. with LD_PRELOAD). I was only able to prevent the relocation and force local-exec behavior again using a hidden alias.
This means that, if we want to support executables linking against the static library, we need to be able to find the symbol regardless of the type of relocation, regardless of even whethe the relocation exists. Whether TLSDESC is used or not becomes immaterial. Resolving the symbol likely involves getting its offset from dynsym and the module id from libc internals and doing the equivalent of __tls_get_addr
There was a problem hiding this comment.
It does make sense, if this rust cdylib is consumed as shared library, to both say that 1) it should export the symbol and 2) use the TLSDESC model, because only TLSDESC (and general dynamic) allow for the symbol to accessed when the library is dlopen()ed (with a minor exception that's only applicable to glibc anyway).
Yup, agreed.
If it is consumed as a static library by an executable, it's not even clear why there should be a relocation
Putting the symbol in the dynamic table is just a hack to have the reader be able to locate it. Otherwise, it's a local access and the linker has no reason in a statically linked executable to expose the symbol to the outside world in any circumstance. However, it doesn't have to come with an associated relocation, it can just be here in the table. There's a way to do so with --dynamic-list and the linker.
I have a simple example crate that uses libdd-otel-thread-ctxt statically, has a local exec relaxed access, has the symbol in the dynamic symbol table but no relocation associated (and particularly no TLS relocation). I can share it if you'd like to give it a look?
Also, I think we're getting a bit off track here discussing the spec. It hasn't been merged yet, so it's good to have feedback and proposals, but it probably doesn't belong here. Happy to chat about all those things on another channel!
I believe my point still stands and is independent from everything that has been discussed so far:
- we generate a TLSDESC access at compile-time with our inline assembly, as would do a compiler.
- the linker will attempt to relax the access for a statically link executable at link time, and there's no way to prevent that.
- this relaxation is quite "dumb" and really fragile with respect the exact sequence of assembly emitted, because it requires some specific instructions in the sequence to be associated with a relocation. This is fine because it's usually emitted by the compiler and thus true. We're doing it manually so we should match that carefully.
- messing with the assembly can cause broken relaxation which will link into a broken executable, that will for example segfault, because it's just not a case the linker relaxation code was written for.
Thus we should warn very loudly people to not mess with this assembly. Do you disagree with any step of this reasoning? That sounds cheap and safe to do, IMHO.
There was a problem hiding this comment.
I've added a comment and a test to verify the relaxation does happen.
As for the rest, I think we're in agreement. I'll take my comments on the spec to it.
There was a problem hiding this comment.
Sounds good! As for the added test, would that be possible to rebase this PR on the top of #2095 (that we can get reviewed and merged quickly if needed)? The latter changed the elf_properties test (basically using a programmatic test instead of using external tools like readelf and whatnot), which became simpler. I feel like doing it in the other direction, I wouldn't be entirely confident that the result would be the right one...
|
Hey @cataphract, haven't had a chance to look at this properly yet, but some more details around what went wrong exactly what probably not go astray - swapping this out for inline asm feels like a big change so it would be nice to convince ourselves its the right move |
|
From my side, I think this overall looks like a win. So my thoughts are "how can we validate this?", and also, the cousin question of "What could go wrong, even if unlikely, and is there something we could do to avoid it?" Any thoughts on these @cataphract ? |
@scottgerring I noticed the inlining wasn't happening in dd-trace-php, but I confess I didn't get to the bottom of it. In general, we shouldn't rely on such toolchain quirks, as we can't always control it (e.g. clients can and do build dd-trace-php themselves). |
Co-authored-by: Yann Hamdaoui <yann.hamdaoui@gmail.com>
a263aeb to
4acf961
Compare
|
One thing that occurred to me is that we could add a CI test that the inline assembly matches the compilation of |
4acf961 to
d71783d
Compare
What does this PR do?
The inlining of the resolution of the address of
otel_thread_ctx_v1depended on an alignment of toolchain version/options that's not trivial to accomplish. When it's not inlined and wrapped in a regular C function, some of the performance advantages of TLSDESC are negated, like its special calling convention.How to test the change?
By building DataDog/dd-trace-php#3970 and checking if the call is inlined.