Send audio
Attach a WAV clip to a turn and an audio-capable model hears it alongside your text. Audio rides through AgentKit Cloud, which routes the clip to a tier whose model accepts audio. Audio is cloud-only and fail-closed: where audio input cannot be positively confirmed, the call fails loud rather than quietly dropping your clip or sending bytes a route will reject.
Attach audio to a message
send(_:audio:) takes an array of AudioRef. Use the wav factories: raw
bytes, or a file URL.
let clip = try Data(contentsOf: recordingURL)
try await agent.send("Transcribe this voice memo.", audio: [.wav(clip)])
try await agent.send("Summarize the meeting.", audio: [.wav(fileURL: recordingURL)])
The turn appends one user message with a pinned content order: your text first, then the clips in argument order. File URLs are read exactly once, at the send boundary (security-scoped, so a file vended by a picker reads correctly), and the conversation stores bytes, so providers never touch your disk and a resumed conversation carries real content.
WAV only, within a proven profile
Before anything leaves the device, each clip is validated against the proven
profile the relay accepts. The bytes must be a RIFF/WAVE container of signed
16-bit PCM, mono or stereo, at 16 kHz, 44.1 kHz, or 48 kHz. The validator reads
the fmt chunk and rejects anything off-profile (IEEE float, 8/24/32-bit, more
than two channels, an off-allowlist sample rate) before a byte is uploaded:
do {
try await agent.send("Transcribe.", audio: [.wav(clip)])
} catch AgentSessionError.audioOffProfile(let detail) {
print("re-encode to 16-bit PCM mono/stereo at 16/44.1/48 kHz: \(detail)")
}
Size and count are capped too. Each clip has a per-clip decoded-size cap
(AudioRef.maxBytesPerAudio), one send accepts at most
AudioRef.maxAudioBlocksPerRequest clips, and the whole request (image plus
document plus audio, across history and this turn) is bounded by a cross-media
total. AgentKit Cloud mirrors the relay's enforced limits; the relay stays the
final authority on duration and billing, so the SDK never rejects a clip purely
on estimated length.
Where audio works
| Path | Audio input |
|---|---|
| AgentKit Cloud, audio-capable tier | the WAV rides the wire as a real audio block |
| AgentKit Cloud, tier whose model cannot accept audio | fails fast before upload (audioInputUnsupported) |
| Anthropic / Gemini / OpenAI (direct) | not built, send(_:audio:) throws audioRouteUnsupported |
| Apple on-device | not built, throws audioRouteUnsupported |
Direct-provider audio is a separate, later capability. Today a direct route has no audio input, so attaching audio fails loud rather than silently degrading your clip to a text placeholder (silent substitution is exactly the behavior this SDK refuses).
Fail-closed, the opposite of documents. A document send dispatches when the
route's capability is unknown and lets the backend stay the authority. Audio is
stricter: the SDK sends a clip only when the route positively confirms audio
input through its pre-request capability probe. Until the probe resolves, or
against a relay that predates the audio capability, audio support is unknown and
the send fails closed with audioInputUnsupported, because the SDK cannot
validate the clip against an unknown media-type allowlist, and large audio bytes
should never be uploaded on an unconfirmed route. The capability is resolved
fresh at send time, so a route that gains audio between turns is picked up on the
next send.
If a clip already sits in your conversation history and you continue the turn on
a route that cannot represent it, the send throws audioHistoryUnsupported
rather than silently pruning the clip. Seeded or persisted history audio is held
to the same bar as a fresh clip: its bytes are re-validated locally (the per-clip
size cap, and the WAV profile for audio/wav) before the turn dispatches, so a
malformed or off-profile historical clip fails loud rather than reaching the relay.
Media types: an open allowlist
The route advertises the exact audio media types it accepts. The SDK branches on
that list rather than hard-coding audio/wav:
- A type the SDK knows (WAV) that the route also advertises is fully preflighted locally, then sent.
- A type the route advertises but the SDK cannot preflight locally (a future,
non-WAV type) is sent relay-validated only, and just when you opt in with
allowRelayValidatedUnknownAudioTypes. Without the opt-in it fails loud. - A type the route does not advertise fails with
unsupportedAudioMediaType(mimeType:advertised:), naming the accepted types.
Today every audio-capable tier advertises ["audio/wav"], so WAV is the path
that works end to end. The allowlist is open so new types become available
without an SDK change.
Degrade to text, only if you ask
By default an unsupported route or media type fails loud. If you would rather a clip the route cannot take become a sanitized text placeholder so the turn still runs, opt in at session construction:
let agent = try AgentSession(
provider: provider,
role: role,
registry: registry,
degradeUnsupportedAttachmentsToText: true
)
With the opt-in, an unsupported clip is replaced by a descriptor carrying its media type when known and its decoded size when the bytes are already in hand (a file-backed clip that never loaded shows only the media type), never a filename or path. A malformed or oversized clip still throws: degrade re-expresses an unsupported route, it does not paper over bad input.
Capture in-profile on Apple platforms
AVAudioRecorder defaults to off-profile compressed audio. AppleAudioCapture
vends the recorder settings for the SDK's default capture profile (16 kHz mono
16-bit little-endian PCM in a WAV container) and self-validates the finished file
through the SDK's own preflight, so a bad capture fails at capture, not at send:
let recorder = try AppleAudioCapture.makeRecorder(url: wavURL)
recorder.record()
// ... later ...
recorder.stop()
let clip = try AppleAudioCapture.audioRef(fromRecordedWAV: wavURL)
try await agent.send("Transcribe this.", audio: [clip])
AppleAudioCapture is gated on AVFoundation, independent of the on-device model,
so it is available wherever you record audio.
Inspect a clip before sending
For transparency ahead of a send (decoded size, sample rate, channels, bit depth, and the worst-case reserved tokens the relay will bill), inspect a WAV locally:
let result = try AudioPreflightResult.inspectWAV(clip)
print("\(result.decodedBytes) bytes, ~\(result.wavProfile?.estimatedReservedTokens ?? 0) tokens")
The duration and token figures are best-effort estimates; the relay owns the authoritative accounting.
A note on privacy
Audio carries no filename on the wire, unlike images and documents. A clip name
can reveal who or what was recorded, so neither the name nor the path is ever
sent on the wire or written into an SDK-generated descriptor or log, and a
.fileURL clip that enters history without going through send() degrades to a
descriptor that names the canonical media type when it can be inferred from the
file extension, otherwise a generic audio descriptor; it never includes the
filename or path. The one place a path appears is the
unreadableAudioAttachment(url:detail:) error, which carries the URL so you can
debug a missing local file; that is a thrown error for your handler, never part
of a request or an SDK log line.
Error handling
Without the degrade opt-in, send(_:audio:) raises a typed AgentSessionError
before the rejected audio is dispatched (the conversation is untouched). With
degradeUnsupportedAttachmentsToText, an unsupported-route or unsupported-media-type
clip instead becomes sanitized text rather than throwing; a malformed, off-profile,
oversized, or over-count clip still throws. The cases:
| Error | When |
|---|---|
audioRouteUnsupported(provider:model:mediaType:) |
audio on a direct (non-cloud) route, which has no audio input |
audioInputUnsupported |
a cloud route whose audio support is false or unconfirmed (fail-closed) |
audioHistoryUnsupported |
history carries a clip the resolved route cannot represent |
unsupportedAudioMediaType(mimeType:advertised:) |
the clip's type is not sendable on the route (not advertised, or SDK-unknown without the opt-in) |
audioOffProfile(detail:) |
a WAV clip is outside the proven profile |
audioNotWAV |
the bytes are not a RIFF/WAVE container |
audioTooLarge(bytes:limit:) |
a clip exceeds the per-clip size cap |
totalMediaTooLarge(bytes:limit:) |
the request's total image+document+audio exceeds the cross-media cap |
tooManyAudioBlocks(count:limit:) |
more clips than the per-request limit |
audioCapabilityContractViolation(detail:) |
the route advertised a broken audio capability |
unreadableAudioAttachment(url:detail:) |
a .fileURL clip could not be read |
Next
- Send documents and send images, the sibling attachment guides.
- AgentKit Cloud, how tiers route to audio-capable models.
- Error reference, every typed error, including the audio cases.