Testing your agent

AgentKitTesting is a separate library product with scriptable doubles so you can test your tools and your agent loop fully offline: no network, no API keys, no burned cloud calls. Add it as a test-only dependency and import it from your tests alongside AgentKit.

import AgentKit
import AgentKitTesting

The building blocks:

SimulatedProvider drives a real reason -> tool -> respond loop from a script.
SimulatedTool / SimulatedToolDomain stand in for your tools with handlers you control.
SimulatedGuard forces allow / confirm / deny paths.
RecordingUndoProvider records what a turn added, committed, or rolled back.
ToolSpy wraps any executor and records the calls and outcomes it saw.

Script a multi-turn loop

SimulatedProvider dequeues one SimulatedTurn per provider round-trip. A turn that requests a tool drives the loop forward; the next turn sees the tool result fed back into the conversation, exactly as a live provider would.

let provider = SimulatedProvider(turns: [
    .toolCall(id: "1", name: "timeline.trim_clip", arguments: .object(["clip": .string("a")])),
    .text("Done — trimmed clip a."),
])
let registry = ToolRegistry()
try registry.register(SimulatedToolDomain(domainId: "timeline", tools: [
    SimulatedTool(id: "timeline.trim_clip") { _ in
        .success(ToolResultPayload(content: [.text("trimmed")]))
    },
]))
let session = try AgentSession(provider: provider, role: AgentRole(staticPersona: "editor"), registry: registry)

try await session.send("Trim clip a")

#expect(provider.recordedRequests.count == 2)            // tool turn, then response turn
#expect(session.currentText.contains("Done"))

Build turns with the validated factories — .text, .toolCall, .assistant (text plus one or more tool calls plus usage), .failure (throws before any event), and .assistantThenFailure (well-formed partial output, then the stream errors). Each emits an event sequence a real provider could actually produce, so you cannot accidentally script an impossible stream; reach for .unsafeEvents only to exercise malformed-protocol handling. If the agent asks for more turns than you scripted, the stream fails with SimulatedProviderError.scriptExhausted(requestIndex:) rather than trapping.

Test a custom tool: success, failed, denied, and undo

Give a tool a handler that returns whatever outcome you want to exercise, attach a RecordingUndoProvider, and assert the effect.

let undo = RecordingUndoProvider()
let domain = try SimulatedToolDomain(domainId: "timeline", tools: [
    SimulatedTool(id: "timeline.trim_clip") { _ in .success(ToolResultPayload(content: [.text("ok")])) },
])
let provider = SimulatedProvider(turns: [
    .toolCall(id: "1", name: "timeline.trim_clip", arguments: .object([:])),
    .text("done"),
])
let session = try AgentSession(
    provider: provider,
    role: AgentRole(staticPersona: "editor"),
    registry: { let r = ToolRegistry(); try r.register(domain); return r }(),
    undoProvider: undo
)

try await session.send("trim")

let transaction = try #require(undo.transactions.first)
#expect(transaction.entries.map(\.toolName) == ["timeline.trim_clip"])   // the successful tool was recorded for undo
#expect(transaction.didCommit)

A handler returning .failed(ToolErrorPayload(...)) lets you test the failure branch; a handler that throws exercises the rollback path (transaction.didRollback). To test that policy blocks a call, deny it with a guard:

let session = try AgentSession(
    provider: provider,
    role: AgentRole(staticPersona: "editor"),
    registry: registry,
    guards: [SimulatedGuard(fixed: .deny(reason: "read-only test"))]
)
// A denied call never reaches your tool's handler.

Spy on tool calls

Wrap any executor in a ToolSpy to record the calls and outcomes — including the thrown-error path — it received:

let spy = ToolSpy(domain.executor)
// ... drive the agent ...
#expect(spy.calls.map(\.name) == ["timeline.trim_clip"])

Simulate the cloud loop

To test how your agent behaves against AgentKit Cloud's loop semantics offline, use the cloudProfileLoop capabilities preset:

let provider = SimulatedProvider(capabilities: .cloudProfileLoop, turns: [.text("hi")])

This pins the cloud profile's loop capabilities (eager session tools, server-managed system prompt). It does not simulate the cloud transport itself.

Replay recorded HTTP responses

SimulatedProvider tests the agent loop; RecordedTransport tests the layer below it: a provider's real request-build to byte-stream to parse path, replayed from recorded HTTP fixtures with no network and no API keys. Hand its urlSession to a provider's session: parameter and the provider runs its genuine URLSession streaming against the fixture bytes.

let transport = RecordedTransport.replaying(from: fixturesDir, provider: RecordedTransport.anthropic)
let provider = AnthropicProvider(apiKey: "test", session: transport.urlSession)

for try await event in provider.stream(request) {
    // parsed from the recorded bytes, exactly as in production
}

transport.invalidate()   // release the session and unregister

Each transport is isolated by an opaque token, so parallel tests never cross-match. A request with no matching fixture fails the stream with an inspectable NSError in the "AgentKitTesting.RecordedTransport" domain carrying the request method and URL, so a miss is easy to diagnose. The same seam works for every provider, including BackendRouterProvider in cloud-profile mode via OfflineCloudRequestSigner (which adds no auth headers, for local tests only).

Fixtures are matched by request method and URL, not body — URLSession does not expose a request's body to the replay layer. Repeated requests to the same (method, URL) are disambiguated by recorded ORDER: recording writes a numbered sequence and replay returns it in the same order, so a multi-turn loop that calls one endpoint repeatedly replays faithfully. Replaying past the end of the recorded sequence is a miss.

Recording fixtures from real responses is available behind @_spi(Experimental) import AgentKitTesting (RecordedTransport.recording(to:provider:)); the default redactor strips auth headers and cookies before anything is written. The fixture format is stable v1; replay is the 1.0 surface.