Multi-hart
Today's multi-hart is a single-threaded cooperative round-robin
scheduler in CPU::step. N harts are no faster than 1. The
abstraction exists so the ISA code can reason about per-hart state,
not because the host is running them in parallel.
See ../spec/multiHart/SPEC.md for
the Hart abstraction design.
What's shared, what's per-hart
| Shared | Per-hart |
|---|---|
Bus (RAM + all devices) | GPR / PC / NPC |
| ACLINT mtime (one host wall-clock source) | CsrFile |
| PLIC state (2 contexts route to 2 harts) | privilege |
IrqState Arc<AtomicU64> (one set of mip/mie bits per hart) | mmu, pmp |
icache | |
pending_trap |
Per-hart icache
Each hart has its own 4 K direct-mapped decoded-instruction cache. A
satp write on one hart does not flush the other hart's icache
— each has its own ctx_tag. sfence.vma with an explicit hart
target would too, but the current implementation flushes both harts
on any sfence.vma for simplicity (conservative, correct).
Running
cd resource
make linux-2hart # 2 harts, cooperative scheduler
make debian-2hart # same, with VirtIO rootfs
Both cores share the same Bus instance. The scheduler gives each hart a slice of steps in round-robin order before rotating.
Why single-threaded today
P1 (busFastPath) removed the Arc<Mutex<Bus>> that was dead weight
under the cooperative scheduler — there's no real SMP, so the mutex
was pure overhead. Removing it gave 45–52 % wall-clock.
True SMP (Phase 11 RFC)
Not in any landed phase. To get per-hart OS threads:
- Guest RAM becomes
&[AtomicU8](orunsafetyped access with explicit fences). - LR/SC reservations become per-hart
AtomicUsize. - Per-device fine-grained sync (or the QEMU MTTCG "BQL on MMIO only" model).
- A runtime that joins / cancels hart threads cleanly.
None of this fits in the perf roadmap. See
../PROGRESS.md §Phase 11 for reference designs
(QEMU MTTCG, rv8, Guo 2019 on fast TLB simulation).
Pre-conditions before opening Phase 11
- P1, P2 (bus-access API), P5 (MMU inline) shipped. Done.
- A reproducible 2-hart Linux benchmark in
docs/perf/baselines/<date>/showing the fraction of time actually parallelisable. Not yet measured. - P7 re-profile results.