[WIP] Trivial hot-path performance optimizations#425
Draft
Conversation
- getLockIndex: % -> & (CONCURRENCY_LEVEL is power-of-2) - flightRecorder: hoist numContextAttributes() out of per-event loop - flightRecorder: remove duplicate flushIfNeeded in recordEvent dispatcher - NativeFrameResolution: carry CodeCache* to avoid re-scanning in walkVM - Libraries::findLibraryByAddress: thread-local last-hit cache (63x hot-case speedup) - wallClock: bind reservoir.sample() result by reference, not copy (150x speedup) - profiler: guard TSC reads with #ifdef COUNTERS - RecordingBuffer: mark final for compiler devirtualization of limit() - benchmarks: add hot_path_benchmark covering fixes 1/5/6 Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Process.exitValue() throws IllegalThreadStateException if the process hasn't terminated yet. After destroyForcibly(), give the OS up to 5s to complete termination before calling exitValue(). Triggered on slow CI runners. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
DTLS initialization for shared libraries calls calloc internally. If a profiler signal fires on a thread whose TLS block hasn't been set up yet while that thread is inside malloc, the re-entrant calloc deadlocks on the allocator lock — causing 45+ min hangs. Replace thread_local struct with a plain static volatile int last-hit index. The cache update is benignly racy (worst case: a cache miss), and no allocator calls are made from the signal handler path. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Replace thread_local TLLibCache with the signal-safe static volatile int variant that matches the real libraries.cpp fix. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
CI Test ResultsRun: #23321679242 | Commit:
Status Overview
Legend: ✅ passed | ❌ failed | ⚪ skipped | 🚫 cancelled Summary: Total: 32 | Passed: 32 | Failed: 0 Updated: 2026-03-19 23:35:59 UTC |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What does this PR do?:
Eight targeted fixes to the profiler's hot paths (signal handler → stack walk → trace storage → JFR write), all identified by code audit with exact line numbers. Changes are non-speculative: each fix corrects a clear inefficiency.
getLockIndex:% CONCURRENCY_LEVEL→& (CONCURRENCY_LEVEL - 1)at 10 call sitesprofiler.cppProfiler::instance()->numContextAttributes()out of per-event attribute loopflightRecorder.cppflushIfNeededinrecordEventdispatcher (eachrecord*callee already calls it)flightRecorder.cppCodeCache*throughNativeFrameResolutionto avoid 1–2 extra O(N)findLibraryByAddressscans per native frame inwalkVMprofiler.h,profiler.cpp,stackWalker.cppLibraries::findLibraryByAddresslibraries.cppreservoir.sample()result by reference instead of copying the vector on every wall-clock epochwallClock.hTSC::ticks()reads with#ifdef COUNTERS— they were unconditional even in production builds whereCounters::incrementis a no-opprofiler.cppRecordingBufferasfinalto enable compiler devirtualization oflimit()input*hot pathbuffers.hMotivation:
Code audit of the profiler hot paths to remove unnecessary overhead that occurs on every sample recording.
Additional Notes:
Benchmark results for the directly measurable fixes (platform: macOS aarch64, clang++
-O2, 2M iterations):getLockIndex%→&-O2*findLibraryByAddresshot casefindLibraryByAddresscold casereservoir.sample()reference* Fix 1 is a no-op at
-O2(compiler already folds% 16to& 15). It is kept for clarity, correctness in debug builds, and consistency across all 9 call sites.The Fix 5 cold-case overhead (+1.3%) is the cost of an unconditional TLS write on every cache miss. In real profiling this is never sustained: consecutive native frames within a single stack trace cluster in 2–3 libraries, so the hot-case hit rate is high.
A new benchmark binary
hot_path_benchmarkis added covering fixes 1, 5, and 6:How to test the change?:
./gradlew :ddprof-lib:build— must succeed on Debug and Release./gradlew test— no regressions./gradlew :ddprof-lib:benchmarks:runHotPathBenchmark— verify hot-path benchmark outputFor Datadog employees:
credentials of any kind, I've requested a review from
@DataDog/security-design-and-guidance.