[WIP] Trivial hot-path performance optimizations by jbachorik · Pull Request #425 · DataDog/java-profiler

jbachorik · 2026-03-19T22:20:14Z

What does this PR do?:
Eight targeted fixes to the profiler's hot paths (signal handler → stack walk → trace storage → JFR write), all identified by code audit with exact line numbers. Changes are non-speculative: each fix corrects a clear inefficiency.

#	Fix	File(s)
1	`getLockIndex`: `% CONCURRENCY_LEVEL` → `& (CONCURRENCY_LEVEL - 1)` at 10 call sites	`profiler.cpp`
2	Hoist `Profiler::instance()->numContextAttributes()` out of per-event attribute loop	`flightRecorder.cpp`
3	Remove duplicate `flushIfNeeded` in `recordEvent` dispatcher (each `record*` callee already calls it)	`flightRecorder.cpp`
4	Carry `CodeCache*` through `NativeFrameResolution` to avoid 1–2 extra O(N) `findLibraryByAddress` scans per native frame in `walkVM`	`profiler.h`, `profiler.cpp`, `stackWalker.cpp`
5	Thread-local last-hit cache in `Libraries::findLibraryByAddress`	`libraries.cpp`
6	Bind `reservoir.sample()` result by reference instead of copying the vector on every wall-clock epoch	`wallClock.h`
7	Guard `TSC::ticks()` reads with `#ifdef COUNTERS` — they were unconditional even in production builds where `Counters::increment` is a no-op	`profiler.cpp`
8	Mark `RecordingBuffer` as `final` to enable compiler devirtualization of `limit()` in `put*` hot path	`buffers.h`

Motivation:
Code audit of the profiler hot paths to remove unnecessary overhead that occurs on every sample recording.

Additional Notes:
Benchmark results for the directly measurable fixes (platform: macOS aarch64, clang++ -O2, 2M iterations):

Fix	Before	After	Speedup
Fix 1 — `getLockIndex` `%`→`&`	0.284 ns	0.286 ns	~0% at `-O2`*
Fix 5 — `findLibraryByAddress` hot case	24.1 ns	0.4 ns	63×
Fix 5 — `findLibraryByAddress` cold case	22.9 ns	23.2 ns	−1.3% overhead
Fix 6 — `reservoir.sample()` reference	35.2 ns	0.23 ns	150×

* Fix 1 is a no-op at -O2 (compiler already folds % 16 to & 15). It is kept for clarity, correctness in debug builds, and consistency across all 9 call sites.

The Fix 5 cold-case overhead (+1.3%) is the cost of an unconditional TLS write on every cache miss. In real profiling this is never sustained: consecutive native frames within a single stack trace cluster in 2–3 libraries, so the hot-case hit rate is high.

A new benchmark binary hot_path_benchmark is added covering fixes 1, 5, and 6:

./gradlew :ddprof-lib:benchmarks:runHotPathBenchmark \
    -Pargs="--warmup 200000 --iterations 2000000"

How to test the change?:

./gradlew :ddprof-lib:build — must succeed on Debug and Release
./gradlew test — no regressions
./gradlew :ddprof-lib:benchmarks:runHotPathBenchmark — verify hot-path benchmark output

For Datadog employees:

If this PR touches code that signs or publishes builds or packages, or handles
credentials of any kind, I've requested a review from @DataDog/security-design-and-guidance.
This PR doesn't touch any of that.
JIRA: [JIRA-XXXX]

- getLockIndex: % -> & (CONCURRENCY_LEVEL is power-of-2) - flightRecorder: hoist numContextAttributes() out of per-event loop - flightRecorder: remove duplicate flushIfNeeded in recordEvent dispatcher - NativeFrameResolution: carry CodeCache* to avoid re-scanning in walkVM - Libraries::findLibraryByAddress: thread-local last-hit cache (63x hot-case speedup) - wallClock: bind reservoir.sample() result by reference, not copy (150x speedup) - profiler: guard TSC reads with #ifdef COUNTERS - RecordingBuffer: mark final for compiler devirtualization of limit() - benchmarks: add hot_path_benchmark covering fixes 1/5/6 Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

Process.exitValue() throws IllegalThreadStateException if the process hasn't terminated yet. After destroyForcibly(), give the OS up to 5s to complete termination before calling exitValue(). Triggered on slow CI runners. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

DTLS initialization for shared libraries calls calloc internally. If a profiler signal fires on a thread whose TLS block hasn't been set up yet while that thread is inside malloc, the re-entrant calloc deadlocks on the allocator lock — causing 45+ min hangs. Replace thread_local struct with a plain static volatile int last-hit index. The cache update is benignly racy (worst case: a cache miss), and no allocator calls are made from the signal handler path. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

Replace thread_local TLLibCache with the signal-safe static volatile int variant that matches the real libraries.cpp fix. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

dd-octo-sts · 2026-03-19T23:36:01Z

CI Test Results

Run: #23321679242 | Commit: d451581 | Duration: 10m 20s (longest job)

✅ All 32 test jobs passed

Status Overview

JDK	glibc-aarch64/debug	glibc-amd64/debug	musl-aarch64/debug	musl-amd64/debug
8	-	✅	-	-
8-ibm	-	✅	-	-
8-j9	✅	✅	-	-
8-librca	-	-	✅	✅
8-orcl	-	✅	-	-
11	-	✅	-	-
11-j9	✅	✅	-	-
11-librca	-	-	✅	✅
17	✅	✅	-	-
17-graal	✅	✅	-	-
17-j9	✅	✅	-	-
17-librca	-	-	✅	✅
21	✅	✅	-	-
21-graal	✅	✅	-	-
21-librca	-	-	✅	✅
25	✅	✅	-	-
25-graal	✅	✅	-	-
25-librca	-	-	✅	✅

Legend: ✅ passed | ❌ failed | ⚪ skipped | 🚫 cancelled

Summary: Total: 32 | Passed: 32 | Failed: 0

Updated: 2026-03-19 23:35:59 UTC

jbachorik added the AI label Mar 19, 2026

jbachorik changed the title ~~Trivial hot-path performance optimizations~~ [WIP] Trivial hot-path performance optimizations Mar 19, 2026

jbachorik and others added 3 commits March 20, 2026 00:13

Benchmark the actual last-hit-index implementation

a49590d

Replace thread_local TLLibCache with the signal-safe static volatile int variant that matches the real libraries.cpp fix. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Trivial hot-path performance optimizations#425

[WIP] Trivial hot-path performance optimizations#425
jbachorik wants to merge 4 commits intomainfrom
jb/trivial_perf

jbachorik commented Mar 19, 2026

Uh oh!

dd-octo-sts bot commented Mar 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jbachorik commented Mar 19, 2026

Uh oh!

dd-octo-sts bot commented Mar 19, 2026

CI Test Results

Status Overview

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant