Performance & Profiling

Measure first, optimize second

Premature optimization wastes time; ignoring production signals wastes SLAs. A sensible loop: reproduce → profile (CPU, alloc, lock) → fix dominant cost → verify with benchmarks and load tests.

Question	Tool
Is this method faster than that one?	JMH (controlled JVM)
Who uses CPU in production?	async-profiler, JFR
Why is heap growing?	Heap dump + MAT / YourKit
Did JIT compile hot methods?	JFR, `-XX:+PrintCompilation`

Deeper GC and class loading topics live in JVM Internals; lock contention in Concurrency.

Benchmarking with JMH

Java Microbenchmark Harness (JMH) is the standard for JVM microbenchmarks—handles warm-up, forks, blackholes, and multiple benchmark modes. Add the JMH Maven/Gradle plugin or archetype; benchmarks live in a separate module so IDE “Run” does not skew results.

Why `System.nanoTime()` loops are wrong

Hand-rolled timing loops almost always lie because:

Dead code elimination (DCE) — JIT removes computation whose result is unused
Constant folding — work hoisted out of the loop when inputs are constant
No warm-up — first iterations interpret; later ones run compiled code
Coordinated omission — measuring only “happy path” latency hides stalls
GC pauses — one full GC dominates a short loop
CPU frequency scaling — turbo bins differ run-to-run

// Misleading — JIT may eliminate the hash entirely
long start = System.nanoTime();
for (int i = 0; i < 1_000_000; i++) {
    Integer.valueOf(i).hashCode();
}
long ns = System.nanoTime() - start;
System.out.println(ns);  // not a trustworthy benchmark

JMH setup and annotations

Typical project: org.openjdk.jmh:jmh-core + jmh-generator-annprocess at compile time. Run via java -jar benchmarks.jar or Maven exec goal.

Annotation	Purpose
`@Benchmark`	Method to measure
`@Warmup`	Iterations/time before scoring (JIT compile)
`@Measurement`	Iterations/time recorded as results
`@BenchmarkMode`	Throughput, average time, sample, single-shot, …
`@OutputTimeUnit`	Report in ns, µs, ms, ops/s
`@Fork`	Fresh JVM processes (isolate noise)
`@State`	Shared data scope: Thread, Group, Benchmark

@BenchmarkMode(Mode.Throughput)
@OutputTimeUnit(TimeUnit.SECONDS)
@Warmup(iterations = 3, time = 1)
@Measurement(iterations = 5, time = 1)
@Fork(1)
@State(Scope.Thread)
public class HashBench {

    @Benchmark
    public int hashCodeScalar(Blackhole bh) {
        int h = 0;
        for (int i = 0; i < 1024; i++) {
            bh.consume(Integer.valueOf(i).hashCode());  // prevent DCE
        }
        return h;
    }
}

Blackhole.consume forces the JVM to keep side effects—essential for fair comparison. Return values from @Benchmark methods are also consumed by JMH automatically in many modes.

Avoiding DCE and false sharing in benchmarks

Do not share mutable fields between benchmark threads without understanding false sharing: two independent atomic counters on the same cache line (64 bytes on most x86) cause cores to invalidate each other’s lines—throughput collapses. Pad fields (@Contended on JDK internal paths, or manual padding in benchmarks) or use @State(Scope.Thread) per-thread data.

Benchmark the same JVM version and GC you use in production; use multiple forks and compare distributions, not a single number.

⚠️ Pitfall

Microbenchmark wins rarely predict service latency—I/O, locks, and queueing dominate real systems. Use JMH for library-level decisions; use load tests for SLAs.

Memory management

The JVM GC reclaims unreachable objects—leaks are reachable objects you no longer need but still reference. Diagnosis combines metrics (heap after GC), heap dumps, and dominator trees.

Object size — shallow vs retained heap

Shallow size — bytes of the object itself (header + fields). Retained size — shallow size plus everything that would become unreachable if this object alone were removed (dominator view). A small HashMap entry may retain a large byte[] value array.

Estimation rules of thumb: object header ~12–16 bytes (compressed oops), aligned fields, arrays add length overhead. Use MAT or jcmd <pid> VM.classloader_stats for ground truth—not manual guessing in production incidents.

Common memory leak patterns

Static collections — caches without eviction grow until OOM
Inner / non-static holder classes — implicit reference to outer instance prevents GC of large outer graphs
ThreadLocal misuse — values survive thread pool reuse; always remove() in finally in pooled threads
Unclosed streams — native buffers and socket handles leak; see I/O
Listeners not unregistered — long-lived publishers hold subscribers
Classloader leaks — hot deploy without releasing loader graphs

// ThreadLocal — must clear on pooled threads
private static final ThreadLocal<UserContext> CTX = new ThreadLocal<>();

void handle(Request req) {
    try {
        CTX.set(loadContext(req));
        process();
    } finally {
        CTX.remove();  // prevent leak across requests
    }
}

Heap dump analysis — jmap, MAT, YourKit

Capture when heap is high after a full GC if possible—otherwise you dump mostly garbage.

# HPROF binary dump
jmap -dump:live,format=b,file=heap.hprof <pid>

# JDK 9+ unified jcmd
jcmd <pid> GC.heap_dump heap.hprof

Eclipse MAT — dominator tree, leak suspects report, path to GC roots, histogram by class. YourKit — similar with strong live allocation recording and CPU pairing. Look for collections with unexpected size, duplicate char[]/byte[] from strings, and class loaders retaining old wars.

🔧 Under the Hood

-XX:+HeapDumpOnOutOfMemoryError writes a dump on OOM—set -XX:HeapDumpPath on servers. Dumps are large; copy off-box before analysis.

CPU performance

HotSpot compiles hot bytecode to native code (C1 then C2). Microarchitecture effects—cache lines, branches—still matter after the JIT does its job.

JIT compilation warm-up

Methods start in interpreter, then C1 (fast compile), then often C2 (aggressive optimize) after enough invocations. Until C2 runs, throughput differs wildly—load tests need warm-up phases; profilers taken in first seconds mislead. -XX:+PrintCompilation logs compilations; JFR “Compilation” events show tier transitions.

Deoptimization happens when assumptions break (e.g. monomorphic call site becomes megamorphic)—rare but visible as perf cliffs after deploys.

False sharing (64-byte cache lines)

CPU caches move data in cache lines (typically 64 bytes). Independent variables on the same line ping-pong between cores when threads write. Common with padded counters, queue head/tail indices, or AtomicLong fields allocated adjacent in objects. Mitigations: @jdk.internal.vm.annotation.Contended (JDK libs), field padding, per-thread accumulators merged later. See also Concurrency: atomics.

// Two hot counters on same cache line — false sharing risk
class Counters {
    volatile long a;  // consider padding between a and b
    volatile long b;
}

Branch prediction and loop unrolling

CPUs speculate on branches—predictable branches are cheap; unpredictable ones (random data, polymorphic hot loops) stall the pipeline. HotSpot unrolls and inlines loops when profiling proves stable—manual unrolling is rarely needed in application code.

Prefer sorted data or branch-free math in inner loops only when profiling proves branch misprediction dominates—not by default.

String concatenation in loops

result += s in a loop creates a new String each iteration—O(n²) copying. Use StringBuilder (not thread-safe) or StringBuffer when synchronization required; in modern code, String.join, formatted strings, or streams for clarity.

// Bad — quadratic copies
String out = "";
for (String s : parts) out += s;

// Good
var sb = new StringBuilder();
for (String s : parts) sb.append(s);
String out = sb.toString();

Profiling tools

Profilers attribute CPU time and allocations to methods—use low-overhead tools in production briefly, heavier tools in staging reproducing load.

async-profiler — CPU and allocation, flame graphs

async-profiler uses JDK APIs and perf events—low overhead, works on Linux/macOS. Modes: cpu (default wall/itimer), alloc for allocation hotspots, lock for contention. Output HTML flame graphs or JFR format for JMC.

# 60s CPU profile, flame graph HTML
./profiler.sh -d 60 -f flame.html <pid>

# Allocation profile
./profiler.sh -e alloc -d 30 -f alloc.html <pid>

Read flame graphs bottom-up: wide bars are hot. Look for unexpected JSON parsing, logging, regex, and synchronized blocks.

JFR and Java Mission Control (JMC)

Java Flight Recorder records events inside the JVM—method samples, GC, locks, I/O, exceptions—with typically < 1% overhead when tuned. Production-safe: start a short recording during incidents instead of leaving full profilers attached.

# Start 5-minute recording (JDK 11+)
jcmd <pid> JFR.start name=incident settings=profile duration=300s filename=rec.jfr

# Dump if already running
jcmd <pid> JFR.dump name=incident filename=rec.jfr

Java Mission Control opens .jfr files—Method Profiling, GC pause histogram, Hot Methods, allocation by class. Compare recordings before/after deploy to spot regressions.

JDK Mission Control is separate from the open-source JDK build in some distributions—install Oracle/OpenJDK JMC matching your JDK major version.

`-XX:+PrintCompilation` and diagnostic VM options

Unlock deeper logging for JIT investigation (verbose—dev/staging only):

java -XX:+UnlockDiagnosticVMOptions \
     -XX:+PrintCompilation \
     -XX:+LogCompilation \
     -XX:LogFile=compile.log \
     -jar app.jar

PrintCompilation prints one line per compilation (tier, method, time). LogCompilation (+ log file) adds detail for inlining decisions—use when JFR compilation events are not enough.

Flag / tool	Use
async-profiler	Quick CPU/alloc flame graphs on live JVM
JFR + JMC	Time-correlated production incident analysis
MAT / YourKit heap	Memory leaks, retained set
JMH	Controlled microbenchmarks
`PrintCompilation`	JIT tier and compile timeline

🎯 Interview Tip

Explain why hand-timed loops fail (DCE, warm-up). Name JMH Blackhole, false sharing at cache-line level, ThreadLocal leak in pools, and when you’d choose JFR over attaching a traditional sampler profiler.

📦 Real World

Runbooks: on latency spike → 2–5 min JFR profile → check GC pauses and hot methods → heap dump only if memory correlates. Baseline JFR nightly on canary for regression diff.

Why System.nanoTime() loops are wrong