Send audio

Attach a WAV clip to a turn and an audio-capable model hears it alongside your text. Audio rides through AgentKit Cloud, which routes the clip to a tier whose model accepts audio. Audio is cloud-only and fail-closed: where audio input cannot be positively confirmed, the call fails loud rather than quietly dropping your clip or sending bytes a route will reject.

Attach audio to a message

send(_:audio:) takes an array of AudioRef. Use the wav factories: raw bytes, or a file URL.

let clip = try Data(contentsOf: recordingURL)
try await agent.send("Transcribe this voice memo.", audio: [.wav(clip)])
try await agent.send("Summarize the meeting.", audio: [.wav(fileURL: recordingURL)])

The turn appends one user message with a pinned content order: your text first, then the clips in argument order. File URLs are read exactly once, at the send boundary (security-scoped, so a file vended by a picker reads correctly), and the conversation stores bytes, so providers never touch your disk and a resumed conversation carries real content.

WAV only, within a proven profile

Before anything leaves the device, each clip is validated against the proven profile the relay accepts. The bytes must be a RIFF/WAVE container of signed 16-bit PCM, mono or stereo, at 16 kHz, 44.1 kHz, or 48 kHz. The validator reads the fmt chunk and rejects anything off-profile (IEEE float, 8/24/32-bit, more than two channels, an off-allowlist sample rate) before a byte is uploaded:

do {
    try await agent.send("Transcribe.", audio: [.wav(clip)])
} catch AgentSessionError.audioOffProfile(let detail) {
    print("re-encode to 16-bit PCM mono/stereo at 16/44.1/48 kHz: \(detail)")
}

Size and count are capped too. Each clip has a per-clip decoded-size cap (AudioRef.maxBytesPerAudio), one send accepts at most AudioRef.maxAudioBlocksPerRequest clips, and the whole request (image plus document plus audio, across history and this turn) is bounded by a cross-media total. AgentKit Cloud mirrors the relay's enforced limits; the relay stays the final authority on duration and billing, so the SDK never rejects a clip purely on estimated length.

Where audio works

Path Audio input
AgentKit Cloud, audio-capable tier the WAV rides the wire as a real audio block
AgentKit Cloud, tier whose model cannot accept audio fails fast before upload (audioInputUnsupported)
Anthropic / Gemini / OpenAI (direct) not built, send(_:audio:) throws audioRouteUnsupported
Apple on-device not built, throws audioRouteUnsupported

Direct-provider audio is a separate, later capability. Today a direct route has no audio input, so attaching audio fails loud rather than silently degrading your clip to a text placeholder (silent substitution is exactly the behavior this SDK refuses).

Fail-closed, the opposite of documents. A document send dispatches when the route's capability is unknown and lets the backend stay the authority. Audio is stricter: the SDK sends a clip only when the route positively confirms audio input through its pre-request capability probe. Until the probe resolves, or against a relay that predates the audio capability, audio support is unknown and the send fails closed with audioInputUnsupported, because the SDK cannot validate the clip against an unknown media-type allowlist, and large audio bytes should never be uploaded on an unconfirmed route. The capability is resolved fresh at send time, so a route that gains audio between turns is picked up on the next send.

If a clip already sits in your conversation history and you continue the turn on a route that cannot represent it, the send throws audioHistoryUnsupported rather than silently pruning the clip. Seeded or persisted history audio is held to the same bar as a fresh clip: its bytes are re-validated locally (the per-clip size cap, and the WAV profile for audio/wav) before the turn dispatches, so a malformed or off-profile historical clip fails loud rather than reaching the relay.

Media types: an open allowlist

The route advertises the exact audio media types it accepts. The SDK branches on that list rather than hard-coding audio/wav:

  1. A type the SDK knows (WAV) that the route also advertises is fully preflighted locally, then sent.
  2. A type the route advertises but the SDK cannot preflight locally (a future, non-WAV type) is sent relay-validated only, and just when you opt in with allowRelayValidatedUnknownAudioTypes. Without the opt-in it fails loud.
  3. A type the route does not advertise fails with unsupportedAudioMediaType(mimeType:advertised:), naming the accepted types.

Today every audio-capable tier advertises ["audio/wav"], so WAV is the path that works end to end. The allowlist is open so new types become available without an SDK change.

Degrade to text, only if you ask

By default an unsupported route or media type fails loud. If you would rather a clip the route cannot take become a sanitized text placeholder so the turn still runs, opt in at session construction:

let agent = try AgentSession(
    provider: provider,
    role: role,
    registry: registry,
    degradeUnsupportedAttachmentsToText: true
)

With the opt-in, an unsupported clip is replaced by a descriptor carrying its media type when known and its decoded size when the bytes are already in hand (a file-backed clip that never loaded shows only the media type), never a filename or path. A malformed or oversized clip still throws: degrade re-expresses an unsupported route, it does not paper over bad input.

Capture in-profile on Apple platforms

AVAudioRecorder defaults to off-profile compressed audio. AppleAudioCapture vends the recorder settings for the SDK's default capture profile (16 kHz mono 16-bit little-endian PCM in a WAV container) and self-validates the finished file through the SDK's own preflight, so a bad capture fails at capture, not at send:

let recorder = try AppleAudioCapture.makeRecorder(url: wavURL)
recorder.record()
// ... later ...
recorder.stop()
let clip = try AppleAudioCapture.audioRef(fromRecordedWAV: wavURL)
try await agent.send("Transcribe this.", audio: [clip])

AppleAudioCapture is gated on AVFoundation, independent of the on-device model, so it is available wherever you record audio.

Inspect a clip before sending

For transparency ahead of a send (decoded size, sample rate, channels, bit depth, and the worst-case reserved tokens the relay will bill), inspect a WAV locally:

let result = try AudioPreflightResult.inspectWAV(clip)
print("\(result.decodedBytes) bytes, ~\(result.wavProfile?.estimatedReservedTokens ?? 0) tokens")

The duration and token figures are best-effort estimates; the relay owns the authoritative accounting.

A note on privacy

Audio carries no filename on the wire, unlike images and documents. A clip name can reveal who or what was recorded, so neither the name nor the path is ever sent on the wire or written into an SDK-generated descriptor or log, and a .fileURL clip that enters history without going through send() degrades to a descriptor that names the canonical media type when it can be inferred from the file extension, otherwise a generic audio descriptor; it never includes the filename or path. The one place a path appears is the unreadableAudioAttachment(url:detail:) error, which carries the URL so you can debug a missing local file; that is a thrown error for your handler, never part of a request or an SDK log line.

Error handling

Without the degrade opt-in, send(_:audio:) raises a typed AgentSessionError before the rejected audio is dispatched (the conversation is untouched). With degradeUnsupportedAttachmentsToText, an unsupported-route or unsupported-media-type clip instead becomes sanitized text rather than throwing; a malformed, off-profile, oversized, or over-count clip still throws. The cases:

Error When
audioRouteUnsupported(provider:model:mediaType:) audio on a direct (non-cloud) route, which has no audio input
audioInputUnsupported a cloud route whose audio support is false or unconfirmed (fail-closed)
audioHistoryUnsupported history carries a clip the resolved route cannot represent
unsupportedAudioMediaType(mimeType:advertised:) the clip's type is not sendable on the route (not advertised, or SDK-unknown without the opt-in)
audioOffProfile(detail:) a WAV clip is outside the proven profile
audioNotWAV the bytes are not a RIFF/WAVE container
audioTooLarge(bytes:limit:) a clip exceeds the per-clip size cap
totalMediaTooLarge(bytes:limit:) the request's total image+document+audio exceeds the cross-media cap
tooManyAudioBlocks(count:limit:) more clips than the per-request limit
audioCapabilityContractViolation(detail:) the route advertised a broken audio capability
unreadableAudioAttachment(url:detail:) a .fileURL clip could not be read

Next