Agent harnesses · a journey up the stack
Ark & ArkOS
Building reliable coding agents — from harness to operating system.
① Agents & harnesses →
② Ark →
③ ArkOS →
④ Future agent model
Agenda
Four steps up the stack
1 Agents & harnesses what an agent is · the 7-part harness framework
2 Ark a workflow harness · tiers · lifecycle · specs
3 ArkOS workflow as a service · self-evolving substrate
4 The future agent model how agents will work · autonomous, isolated, at scale
01
Agents & harnesses
A model in a loop is the engine. The harness is everything that makes it ship.
① Agents & harnesses →
② Ark →
③ ArkOS →
④ Future agent model
① Agents & harnesses
What is an agent ?
AGENT
the agentic loop
reason
act
perceive
🧠 Model
reasoning core
🧰 Tools
edit · shell · git · web
💾 Memory
what it knows / recalls
🗺 Planning
decompose · sequence
ENVIRONMENT — codebase · OS · results
An agent = a loop (perceive → reason → act) wired to a model, tools, memory, and planning — acting on its environment until the goal is met.
① Agents & harnesses
The model is only the engine
📦
Framework
a library — you import it
LangChain · Agents SDK · CrewAI
⚙
Harness
an app — you install & run it
Claude Code · Codex · Aider · OpenHands · Ark
Framework = you import it. Harness = you install & run it.
① Agents & harnesses · the framework
Inside the harness
THE HARNESS — everything that isn't the model
🖥 Interface CLI · IDE · web UI — how you talk to it
🔁 Orchestration the agent loop · planning · subagents · retries
🧠 Context & memory prompt · working state · compaction · recall
🧰 Tools & protocols MCP · bash · file I/O · AGENTS.md
📦 Substrate & sandbox worktree → container → microVM
CROSS-CUTTING
📊 Evaluation SWE-bench · RL envs
📡 Observability traces · cost · replay
🛡 Guardrails gates · policy · audit
🧩 Reference impls Claude Code · Codex
apply to every layer
Four layers of runtime, three cross-cutting concerns — the Agent Harness Engineering survey's seven dimensions, in one map.
① Agents & harnesses · in detail
What each dimension does
Dimension What it owns Key techniques
🔁 Orchestration how the agent decides & coordinates work loop · subagents · skill composition
🧠 Context & state what fits in the window vs. what persists compaction · memory injection · planning-as-file
📦 Substrates where code runs, safely & resumably microVM · snapshot/restore · egress policy
🔌 Protocols how the agent talks to tools & repos MCP · AGENTS.md · git-native contracts
📊 Evaluation whether a change is actually an improvement SWE-bench · trajectory audit · RL envs
📡 Observability seeing what the agent did & what it cost tracing · cost attribution · replay
🛡 Guardrails keeping autonomy safe & accountable approval gates · policy · fail-closed
① Agents & harnesses · the thesis
The harness is load-bearing
46→80
same model
two harnesses, one benchmark
+26%
LangChain
harness change only
23→45
scaffold
basic vs tuned, SWE-bench
Same model. Only the wrapper changed.
02
Ark
An opinionated workflow harness — the seven dimensions, with discipline baked in.
① Agents & harnesses →
② Ark →
③ ArkOS →
④ Future agent model
② Ark · the gap
Today's harnesses have no process
🌀 No structure no PRD, no plan, no record of why
👤 No second opinion the agent grades its own work
🧩 No memory knowledge evaporates between sessions
② Ark · positioning
A workflow harness above the agent
Ark — workflow harness
tiered lifecycle · gates · SPECs · journals
coding-agent harness
Claude Code · Codex · OpenCode
model + tools (the agent loop)
host — your machine, git, filesystem
Ark is local, multi-platform, workflow-opinionated — the meta-layer that turns any coding agent into a software-engineering process.
② Ark · the framework
Ark on the seven dimensions
Dimension Ark's answer
🔁 Orchestration tiered lifecycle + researcher / reviewer / verifier subagents
🧠 Context & state structured artifacts on disk · ark context projection
📦 Substrates worktree-per-task · .ark.db snapshots
🔌 Protocols AGENTS.md · slash commands · per-platform templates
📊 Evaluation VERIFY gate · PLAN → REVIEW step
🛡 Guardrails state machine · user-staged atomic commit
Same seven axes — Ark's take is opinionated workflow discipline .
② Ark · the idea
Two pieces, working as one
⌘ ark-cli
the mechanism — a Rust binary you install & run
◇ ark-workflow
the discipline — a tiered lifecycle in workflow.md
Install once → every transition runs through an ark command.
② Ark · in practice
What it looks like
~/my-project — ark
$ ark init # scaffold .ark/ + agent integrations
✓ initialized ark · claude-code, codex, opencode
$ ark context # orient: git + tasks + specs
› /ark:design --deep "refactor auth layer"
DESIGN → PLAN → REVIEW → EXECUTE → VERIFY → COMMIT
✓ one atomic commit: work + task.toml + SPEC + journal
Two layers of commands: the public lifecycle (init / context / …) and what you type inside the agent (/ark:*).
② Ark · CLI
One binary, two layers
⌘ Public — manage Ark's footprint
init context load unload upgrade archive cleanup
◇ Hidden ark agent — the workflow engine
task
new plan review execute verify commit archive
spec
extract import register
You type the slash commands ; they call ark agent. The state machine guards every transition.
② Ark · tiers
Pick the smallest tier that fits
⚡ Quick /ark:quick · reversible
PRD
◆ Standard /ark:design · feature scope
PRD PLAN VERIFY
◈ Deep /ark:design --deep · architectural
PRD PLAN → REVIEW VERIFY SPEC
🔬 Research /ark:research · corpus = deliverable
PRD research/
When in doubt, pick lower — promotion is cheap.
② Ark · workflow
One task, end to end
DESIGN
task new
PLAN
task plan
REVIEW
task review
deep only
fold findings → PLAN
EXECUTE
task execute
VERIFY
task verify
COMMIT
task commit -m
gate: no PENDING
ARCHIVED
ark archive
later · bulk
RESEARCH
/ark:research
research → commit → archive (no PLAN/REVIEW/VERIFY)
Each phase is an ark agent command. You type a slash command; it drives the state machine.
② Ark · lifecycle · decide
Decide what , then how
🟢
DESIGN
write the PRD — What · Why · Outcome · related specs
ark agent task new
⛓ fields filled
🔵
PLAN
elaborate how — goals G-N · architecture · API · validation
ark agent task plan
✓ every G-N → a test
🟣
REVIEW deep · single pass
independent reviewer files R-NNN — fold findings into PLAN, then execute
ark agent task review
you pick the reviewer
② Ark · lifecycle · ship
Build · audit · close
🟠
EXECUTE
implement the PLAN, following project & feature specs
ark agent task execute
✓ tests · lints · build
🟡
VERIFY
audit shipped code — spec compliance · plan fidelity · drift
ark agent task verify
⛓ nothing PENDING
🔶
COMMIT
you stage, then one atomic commit closes the task
ark agent task commit
→ next slide
② Ark · the keystone
⚛ commit — five things, one commit
① VERIFY gate deep: refuse on PENDING
② extract SPEC deep · → features tree
③ task.toml phase = Committed
④ stage files exactly Ark's · no -A
⑤ journal session entry
1 git commit
all-or-nothing
on failure ↺
scoped rollback
your index untouched
② Ark · subagents
The author never grades itself
🔎
ark-researcher
DESIGN · PLAN
gathers what the session lacks → research/
⚖
ark-reviewer
REVIEW · deep
judges the PLAN → verdict + R-NNN
✓
ark-verifier
VERIFY · final gate
audits shipped code → V-NNN
author → hands off → independent agent you choose
② Ark · memory
Memory ≠ context
💨
Context
in-prompt · ephemeral
gone when the session ends · costs tokens every turn
💾
Memory
on disk · durable
survives sessions, agents, platforms · loaded on demand
Ark's bet: state lives in structured files — PRD · PLAN · SPEC · VERIFY · journals. Available, not resident.
② Ark · memory · specs
Specs — the durable contract
📐 Project specs
specs/project/<name>/SPEC.md
user-authored conventions
read before EVERY task · always apply
🧩 Feature specs
specs/features/…/SPEC.md (recursive)
machine-promoted contracts
read only the ones your task touches
deep PLAN
## Spec section
on commit
extract → SPEC.md
+ upsert INDEX leaf→root
VERIFY checks adherence
drift needs a CHANGELOG entry to pass
Anchored, versioned, drift-detected — specs are how knowledge compounds instead of evaporating between sessions.
② Ark · memory · the rest
The rest of the durable state
✎ Journals .ark/workspace/<dev>/ a session block written on every commit
⑂ Worktrees one branch / task parallel tasks, no collisions
⤓ Snapshots unload / load freeze & restore the whole footprint
All on disk, versioned with the repo — the project remembers.
② Ark · architecture
Inside the binary
ARK-CLI · thin adapter
main.rs
agent_cli.rs
parse · dispatch · render(summary)
ARK-CORE · all logic
commands/
init load unload
upgrade context …
commands/agent/
task · spec
workspace · worktree
platforms.rs
one path →
claude·codex·opencode
layout.rs
rooted paths
discover·resolve_safe
state/
manifest·snapshot
.state.toml
io/ — structured FS & git
PathExt · managed blocks · git
agent/state.rs
tier × phase transition table
templates.rs — include_dir!()
ark·claude·codex·opencode trees in binary
The CLI parses & prints; the core does everything. A platform registry drives all three agents from one code path .
03
ArkOS
The harness becomes a substrate — workflow as a service, for agents instead of humans.
① Agents & harnesses →
② Ark →
③ ArkOS →
④ Future agent model
③ ArkOS · RFC 001
Next: ArkOS
What Ark does for humans , ArkOS does for agents .
⌘
Ark
human · gated
a CLI you drive
◈
ArkOS
agent · autonomous
workflow as a service
services → lifecycle task tree memory SPEC storage event log grounding hooks
③ ArkOS · how it evolves
Swap the agent form. The substrate stays.
WORKLOADS
tasks agents work on · autonomous orchestrators · products — supply the grounding signal
SUBSTRATE — peers, different audience
Ark · human audience
CLI harness · human-gated · no self-improve
ArkOS · agent audience
workflow as service · autonomy-operable · self-improves
AGENT RUNTIMES — the swappable "agent form"
Claude Code
Codex
OpenCode
→
native runtime (stage 2)
+ raw LLM API
HOST — POSIX · Linux · git · process isolation
Stage 1: bootstrap on today's runtimes (Ark's primitives re-exposed to agents). Stage 2: grow a native runtime — the agent form swaps, the substrate is constant.
③ ArkOS · self-evolution
Self-evolving — but grounded
ArkOS rev N+1
improved primitives
agents run
a real workload
workload grades
tests · LTP · panel
faster · fewer iterations · smaller budget → keep the change
🚫 the substrate never grades itself
it cannot edit its own evaluation harness · the judge is independent of the generator
Like the Linux kernel: it doesn't self-grade — programs running on it shipping faster is the signal.
04
The future agent model
My vision for how agents will work — autonomous, isolated, and running at scale.
① Agents & harnesses →
② Ark →
③ ArkOS →
④ Future agent model
④ The future · the agent model
From one agent to a fleet of sandboxes
🧑💻
Today
one agent · your terminal
you drive it, step by step
→
🛰️
Next
N isolated sandboxes
each drives its own project, on its own
Isolation is a stack : worktree → container → microVM / hypervisor. The fleet model takes it all the way down.
④ The future · architecture
Hypervisor below · agent sandboxes above
HYPERVISOR
manages VMs · CPU · memory · isolation · scheduling
bare metal / cloud host
VMs
microVM
microVM
microVM
AGENT SANDBOXES
📁 project + 🤖 ArkOS
agent drives it ⟳
📁 project + 🤖 ArkOS
agent drives it ⟳
📁 project + 🤖 ArkOS
agent drives it ⟳
+ … N more
running in parallel
fully isolated
One hypervisor schedules many VMs; each VM hosts one agent sandbox — a project plus its own ArkOS agent. No sibling-interference: every project is walled off.
④ The future · inside a sandbox
Each sandbox runs itself
sandbox (one microVM)
🤖 ArkOS agent
workflow as a service
grounded · self-evolving
📁 the project
code · tests · git
the grounding signal
DESIGN→PLAN→…→COMMIT
tests pass / fail → next move
⟳ continuously, with no human in the loop
Gates from Ark · autonomy & grounding from ArkOS · isolation from the sandbox.
① Agent a model in a loop — the harness makes it reliable
② Ark a workflow harness — gates, review, living specs
③ ArkOS workflow as a service — self-evolving, grounded
④ The fleet autonomous agents, isolated, at scale