Performance: hot path & baselines

Short answer

Over five phases (P1 + P3 + P4 + P5 + P6) the user-time per benchmark dropped by ~57–62 % vs the pre-P1 baseline:

BenchmarkPre-P1Post-hotPathΔ
Dhrystone8.09 s3.48 s−57 %
CoreMark14.02 s5.82 s−58 %
MicroBench85.82 s32.91 s−62 %

See ../PROGRESS.md §Phase 9 for the full table and ../spec/perfBusFastPath/SPEC.md, ../spec/perfHotPath/SPEC.md for per-phase design.

Where time goes today

On the post-hotPath profile, the dominant buckets are roughly:

BucketShareCharacter
xdb::main (dispatch + decode + execute)~30 %Interpreter core
MMU entry (checked_* + access_bus)~10 %Per load/store
Mtimer deadline gate<1 %Per-step (post-P3)
Typed RAM access<2 %Per load/store (post-P6)
Device ticks (UART / PLIC / VirtIO)<1 %Slow path, every 64 steps

The pre-P1 baseline had pthread_mutex_* at 33–40 % — now 0 % (Bus is owned, not behind Arc<Mutex<_>>).

The five landed phases

PhaseSubjectWinRisk
P1 busFastPathDrop Arc<Mutex<Bus>>, own inline−45…−52 % wallLow
P3 Mtimer deadlineCache next_fire_mtime, short-circuit tickMtimer bucket → <1 %Very low
P4 icachePer-hart decoded-inst cache, 4 K entriesxdb::main bucket −10 ppMedium (invalidation)
P5 MMU inline#[inline] pressure through fast pathMMU bucket −3 ppLow
P6 memmove bypassTyped reads on aligned 1/2/4/8-byte accessesmemmove bucket → <2 %Low-Medium (unsafe)

Measurement pipeline

Always run from ProjectX/ root:

bash scripts/perf/bench.sh       # → docs/perf/baselines/<today>/data/bench.csv
bash scripts/perf/sample.sh      # → <today>/data/<workload>.sample.txt
python3 scripts/perf/render.py   # → <today>/graphics/*.svg
  • 3 runs per workloaduser_s is the stable metric, real_s is noisy on macOS under system load.
  • Use DEBUG=n. PTY mode perturbs timing.
  • Commit data/ and graphics/ with the phase's MASTER document.

Phase exit gate pattern

A phase is not done until:

  1. cargo test --workspace + make linux + make debian all green (and -2hart variants where applicable).
  2. bench.sh rerun (3 iters per workload).
  3. sample.sh rerun for each of the three benches.
  4. Per-phase exit gate hit with ≥ 1 pp margin on the bucket it targets.
  5. REPORT.md deltas committed to the phase's archived MASTER.

What's next

  • P7 multi-hart re-profile — pending; shapes the Phase 11 SMP work. Not an optimisation in itself — a measurement task.
  • Phase 11 (RFC) — true per-hart OS threads. Requires atomic RAM, per-hart reservations, per-device MMIO locking. Not in any current perf phase.