Testing your agent
AgentKitTesting is a separate library product with scriptable doubles so you can
test your tools and your agent loop fully offline: no network, no API keys, no
burned cloud calls. Add it as a test-only dependency and import it from your tests
alongside AgentKit.
import AgentKit
import AgentKitTesting
The building blocks:
SimulatedProviderdrives a real reason -> tool -> respond loop from a script.SimulatedTool/SimulatedToolDomainstand in for your tools with handlers you control.SimulatedGuardforces allow / confirm / deny paths.RecordingUndoProviderrecords what a turn added, committed, or rolled back.ToolSpywraps any executor and records the calls and outcomes it saw.
Script a multi-turn loop
SimulatedProvider dequeues one SimulatedTurn per provider round-trip. A turn
that requests a tool drives the loop forward; the next turn sees the tool result
fed back into the conversation, exactly as a live provider would.
let provider = SimulatedProvider(turns: [
.toolCall(id: "1", name: "timeline.trim_clip", arguments: .object(["clip": .string("a")])),
.text("Done — trimmed clip a."),
])
let registry = ToolRegistry()
try registry.register(SimulatedToolDomain(domainId: "timeline", tools: [
SimulatedTool(id: "timeline.trim_clip") { _ in
.success(ToolResultPayload(content: [.text("trimmed")]))
},
]))
let session = try AgentSession(provider: provider, role: AgentRole(staticPersona: "editor"), registry: registry)
try await session.send("Trim clip a")
#expect(provider.recordedRequests.count == 2) // tool turn, then response turn
#expect(session.currentText.contains("Done"))
Build turns with the validated factories — .text, .toolCall, .assistant
(text plus one or more tool calls plus usage), .failure (throws before any
event), and .assistantThenFailure (well-formed partial output, then the stream
errors). Each emits an event sequence a real provider could actually produce, so
you cannot accidentally script an impossible stream; reach for .unsafeEvents only
to exercise malformed-protocol handling. If the agent asks for more turns than you
scripted, the stream fails with
SimulatedProviderError.scriptExhausted(requestIndex:) rather than trapping.
Test a custom tool: success, failed, denied, and undo
Give a tool a handler that returns whatever outcome you want to exercise, attach a
RecordingUndoProvider, and assert the effect.
let undo = RecordingUndoProvider()
let domain = try SimulatedToolDomain(domainId: "timeline", tools: [
SimulatedTool(id: "timeline.trim_clip") { _ in .success(ToolResultPayload(content: [.text("ok")])) },
])
let provider = SimulatedProvider(turns: [
.toolCall(id: "1", name: "timeline.trim_clip", arguments: .object([:])),
.text("done"),
])
let session = try AgentSession(
provider: provider,
role: AgentRole(staticPersona: "editor"),
registry: { let r = ToolRegistry(); try r.register(domain); return r }(),
undoProvider: undo
)
try await session.send("trim")
let transaction = try #require(undo.transactions.first)
#expect(transaction.entries.map(\.toolName) == ["timeline.trim_clip"]) // the successful tool was recorded for undo
#expect(transaction.didCommit)
A handler returning .failed(ToolErrorPayload(...)) lets you test the failure
branch; a handler that throws exercises the rollback path
(transaction.didRollback). To test that policy blocks a call, deny it with a
guard:
let session = try AgentSession(
provider: provider,
role: AgentRole(staticPersona: "editor"),
registry: registry,
guards: [SimulatedGuard(fixed: .deny(reason: "read-only test"))]
)
// A denied call never reaches your tool's handler.
Spy on tool calls
Wrap any executor in a ToolSpy to record the calls and outcomes — including the
thrown-error path — it received:
let spy = ToolSpy(domain.executor)
// ... drive the agent ...
#expect(spy.calls.map(\.name) == ["timeline.trim_clip"])
Simulate the cloud loop
To test how your agent behaves against AgentKit Cloud's loop semantics offline, use
the cloudProfileLoop capabilities preset:
let provider = SimulatedProvider(capabilities: .cloudProfileLoop, turns: [.text("hi")])
This pins the cloud profile's loop capabilities (eager session tools, server-managed system prompt). It does not simulate the cloud transport itself.
Replay recorded HTTP responses
SimulatedProvider tests the agent loop; RecordedTransport tests the layer below
it: a provider's real request-build to byte-stream to parse path, replayed from
recorded HTTP fixtures with no network and no API keys. Hand its urlSession to a
provider's session: parameter and the provider runs its genuine URLSession
streaming against the fixture bytes.
let transport = RecordedTransport.replaying(from: fixturesDir, provider: RecordedTransport.anthropic)
let provider = AnthropicProvider(apiKey: "test", session: transport.urlSession)
for try await event in provider.stream(request) {
// parsed from the recorded bytes, exactly as in production
}
transport.invalidate() // release the session and unregister
Each transport is isolated by an opaque token, so parallel tests never cross-match.
A request with no matching fixture fails the stream with an inspectable NSError in
the "AgentKitTesting.RecordedTransport" domain carrying the request method and URL,
so a miss is easy to diagnose. The same seam works for every provider, including
BackendRouterProvider in cloud-profile mode via OfflineCloudRequestSigner (which
adds no auth headers, for local tests only).
Fixtures are matched by request method and URL, not body — URLSession does not expose
a request's body to the replay layer. Repeated requests to the same (method, URL) are
disambiguated by recorded ORDER: recording writes a numbered sequence and replay returns
it in the same order, so a multi-turn loop that calls one endpoint repeatedly replays
faithfully. Replaying past the end of the recorded sequence is a miss.
Recording fixtures from real responses is available behind
@_spi(Experimental) import AgentKitTesting (RecordedTransport.recording(to:provider:));
the default redactor strips auth headers and cookies before anything is written. The
fixture format is stable v1; replay is the 1.0 surface.