Skip to content

[WIP] Trivial hot-path performance optimizations#425

Draft
jbachorik wants to merge 4 commits intomainfrom
jb/trivial_perf
Draft

[WIP] Trivial hot-path performance optimizations#425
jbachorik wants to merge 4 commits intomainfrom
jb/trivial_perf

Conversation

@jbachorik
Copy link
Collaborator

What does this PR do?:
Eight targeted fixes to the profiler's hot paths (signal handler → stack walk → trace storage → JFR write), all identified by code audit with exact line numbers. Changes are non-speculative: each fix corrects a clear inefficiency.

# Fix File(s)
1 getLockIndex: % CONCURRENCY_LEVEL& (CONCURRENCY_LEVEL - 1) at 10 call sites profiler.cpp
2 Hoist Profiler::instance()->numContextAttributes() out of per-event attribute loop flightRecorder.cpp
3 Remove duplicate flushIfNeeded in recordEvent dispatcher (each record* callee already calls it) flightRecorder.cpp
4 Carry CodeCache* through NativeFrameResolution to avoid 1–2 extra O(N) findLibraryByAddress scans per native frame in walkVM profiler.h, profiler.cpp, stackWalker.cpp
5 Thread-local last-hit cache in Libraries::findLibraryByAddress libraries.cpp
6 Bind reservoir.sample() result by reference instead of copying the vector on every wall-clock epoch wallClock.h
7 Guard TSC::ticks() reads with #ifdef COUNTERS — they were unconditional even in production builds where Counters::increment is a no-op profiler.cpp
8 Mark RecordingBuffer as final to enable compiler devirtualization of limit() in put* hot path buffers.h

Motivation:
Code audit of the profiler hot paths to remove unnecessary overhead that occurs on every sample recording.

Additional Notes:
Benchmark results for the directly measurable fixes (platform: macOS aarch64, clang++ -O2, 2M iterations):

Fix Before After Speedup
Fix 1 — getLockIndex %& 0.284 ns 0.286 ns ~0% at -O2*
Fix 5 — findLibraryByAddress hot case 24.1 ns 0.4 ns 63×
Fix 5 — findLibraryByAddress cold case 22.9 ns 23.2 ns −1.3% overhead
Fix 6 — reservoir.sample() reference 35.2 ns 0.23 ns 150×

* Fix 1 is a no-op at -O2 (compiler already folds % 16 to & 15). It is kept for clarity, correctness in debug builds, and consistency across all 9 call sites.

The Fix 5 cold-case overhead (+1.3%) is the cost of an unconditional TLS write on every cache miss. In real profiling this is never sustained: consecutive native frames within a single stack trace cluster in 2–3 libraries, so the hot-case hit rate is high.

A new benchmark binary hot_path_benchmark is added covering fixes 1, 5, and 6:

./gradlew :ddprof-lib:benchmarks:runHotPathBenchmark \
    -Pargs="--warmup 200000 --iterations 2000000"

How to test the change?:

  • ./gradlew :ddprof-lib:build — must succeed on Debug and Release
  • ./gradlew test — no regressions
  • ./gradlew :ddprof-lib:benchmarks:runHotPathBenchmark — verify hot-path benchmark output

For Datadog employees:

  • If this PR touches code that signs or publishes builds or packages, or handles
    credentials of any kind, I've requested a review from @DataDog/security-design-and-guidance.
  • This PR doesn't touch any of that.
  • JIRA: [JIRA-XXXX]

- getLockIndex: % -> & (CONCURRENCY_LEVEL is power-of-2)
- flightRecorder: hoist numContextAttributes() out of per-event loop
- flightRecorder: remove duplicate flushIfNeeded in recordEvent dispatcher
- NativeFrameResolution: carry CodeCache* to avoid re-scanning in walkVM
- Libraries::findLibraryByAddress: thread-local last-hit cache (63x hot-case speedup)
- wallClock: bind reservoir.sample() result by reference, not copy (150x speedup)
- profiler: guard TSC reads with #ifdef COUNTERS
- RecordingBuffer: mark final for compiler devirtualization of limit()
- benchmarks: add hot_path_benchmark covering fixes 1/5/6

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
@jbachorik jbachorik added the AI label Mar 19, 2026
@jbachorik jbachorik changed the title Trivial hot-path performance optimizations [WIP] Trivial hot-path performance optimizations Mar 19, 2026
jbachorik and others added 3 commits March 20, 2026 00:13
Process.exitValue() throws IllegalThreadStateException if the
process hasn't terminated yet. After destroyForcibly(), give the
OS up to 5s to complete termination before calling exitValue().
Triggered on slow CI runners.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
DTLS initialization for shared libraries calls calloc internally.
If a profiler signal fires on a thread whose TLS block hasn't been
set up yet while that thread is inside malloc, the re-entrant
calloc deadlocks on the allocator lock — causing 45+ min hangs.

Replace thread_local struct with a plain static volatile int
last-hit index. The cache update is benignly racy (worst case: a
cache miss), and no allocator calls are made from the signal
handler path.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Replace thread_local TLLibCache with the signal-safe static volatile
int variant that matches the real libraries.cpp fix.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
@dd-octo-sts
Copy link

dd-octo-sts bot commented Mar 19, 2026

CI Test Results

Run: #23321679242 | Commit: d451581 | Duration: 10m 20s (longest job)

All 32 test jobs passed

Status Overview

JDK glibc-aarch64/debug glibc-amd64/debug musl-aarch64/debug musl-amd64/debug
8 - - -
8-ibm - - -
8-j9 - -
8-librca - -
8-orcl - - -
11 - - -
11-j9 - -
11-librca - -
17 - -
17-graal - -
17-j9 - -
17-librca - -
21 - -
21-graal - -
21-librca - -
25 - -
25-graal - -
25-librca - -

Legend: ✅ passed | ❌ failed | ⚪ skipped | 🚫 cancelled

Summary: Total: 32 | Passed: 32 | Failed: 0


Updated: 2026-03-19 23:35:59 UTC

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant