Video Audio + Speech To Text

Hello,

I am wondering if it is possible to have audio from my AirPods be sent to my speech to text service and at the same time have the built in mic audio input be sent to recording a video?

I ask because I want my users to be able to say "CAPTURE" and I start recording a video (with audio from the built in mic) and then when the user says "STOP" I stop the recording.

Answered by captadoh in 869052022

The answer to this is no. Cursor said so and I tried everything I and Cursor could think of and nothing worked.

Accepted Answer

The answer to this is no. Cursor said so and I tried everything I and Cursor could think of and nothing worked.

This is actually possible, though it requires a different approach than the typical single-AVAudioEngine setup.

The key insight is that iOS allows multiple AVCaptureSession instances to coexist under certain conditions. You can configure two separate audio routes:

  1. Use AVCaptureSession with the AirPods as the input device for your speech recognition pipeline. Set the audio session category to .playAndRecord with .allowBluetooth option.

  2. For video recording with the built-in mic, use a second AVCaptureSession (or the camera API you are already using). The built-in mic can be explicitly selected as the audio input for this session.

The catch is you need to manage the audio session category carefully. The .mixWithOthers option is essential here — without it, one session will interrupt the other.

Another approach that avoids the dual-session complexity: use a single AVCaptureSession that captures from the built-in mic for video, and run SFSpeechRecognizer (or the new SpeechAnalyzer on macOS 26 / iOS 26) on the same audio buffer. Speech recognition does not need a dedicated audio route — it can process any audio buffer you feed it, including one that is simultaneously being written to a video file.

So the architecture becomes:

  • One AVCaptureSession capturing video + built-in mic audio
  • Fork the audio buffers in captureOutput delegate: one copy goes to the video writer, the other feeds SFSpeechRecognizer
  • Voice commands ("CAPTURE", "STOP") are detected from the speech recognition results

This avoids the Bluetooth routing problem entirely and is much more reliable in practice.

Video Audio + Speech To Text
 
 
Q