-
Build live production tools for Apple Immersive Video
Go behind the scenes of live Apple Immersive Video production. Discover how to package immersive video, spatial audio, and scene metadata for transport over IP networks using the SMPTE 2110 standard. Harness Apple's Immersive Media Support, Video Toolbox, and AVFoundation frameworks to power real-time Apple Immersive Video workflows. To get the most out of this session, watch “Learn about Apple Immersive Video technologies” from WWDC25.
Chapters
- 0:00 - Introduction
- 2:08 - Live production overview
- 5:16 - What makes immersive live different
- 7:05 - Immersive live format
- 9:09 - Real-time media transport
- 11:25 - Recording and playback
Resources
- kVTCompressionPropertyKey_ProjectionKind
- CMVideoCodecType
- Apple ProRes RAW White Paper
- Apple ProRes White Paper
- Immersive Media Support
Related Videos
WWDC25
WWDC20
-
Search this video…
Hi there and welcome to "Build Live Production Tools, for Apple Immersive Video". I'm Jared King and I lead the Apple Immersive Video Live Engineering team. Apple Immersive Video is an incredibly exciting medium, and live streaming unlocks entirely new ways for customers to experience sports, music, and entertainment events! Earlier this year, Apple did something remarkable. For the first time fans were transported, courtside to select LA Lakers games — live in Apple Vision Pro, through the Spectrum SportsNet and NBA apps.
Real games and arenas, experienced like you were there! Live Immersive cameras gave fans access to unreachable seats. Data-driven graphics augmented the action and spatial audio, embedded customers inside the roaring crowd. Behind the scenes, a live broadcast platform, built by Apple, powered the events, and transported this unique experience to customers around the world.
My goal is to inspire you to build the next generation of immersive tools, workflows, and live experiences. For many, live broadcast technology might be new. The systems are complex, but the opportunity to build for this space is enormous. So to set the right foundation… First I'll provide a high level overview of the systems that make up a modern live production pipeline - along with some of the creative tools used across studios, production trucks, and broadcast facilities worldwide to create content.
Understanding this foundation will be invaluable as you begin building immersive production tools of your own.
Second, I'll cover what makes live immersive broadcast different from traditional 2D production. Transporting customers from their homes, directly into an event — introduces an entirely new set of technical challenges. The media formats, production tools, and the way content moves between tools within a production workflow, have been rebuilt.
I'll start today with a top level review of a live production pipeline. Whether you're building for immersive, or traditional 2D broadcast, understanding the fundamental components of the end-to-end system is crucial.
A live production pipeline is a system where video audio and data is captured and creatively produced at one end - what I'll call the production domain, then encoded and streamed live to an audience through what I'll call the delivery domain.
Broadcasts can be large-scale productions, such as TV studios, and broadcast trucks, that manage many live cameras, audio sources, and graphics for sporting or entertainment events, Or, they can be smaller systems. Such as podcast studios, theaters or local music venues where only a few of these elements might be used within a production. Either way, the end goal is the same: live content is captured and produced in the production domain, then encoded and transmitted to the viewer in the delivery domain, broadcasting the event to an audience, in real time. But to keep focused, I'll concentrate primarily on what makes the production domain unique, when building for immersive live. Regardless of scale, most live pipelines rely on many of the same creative tools within the workflow simply scaled in quantity and sophistication, depending on the level of production.
Live Cameras are used to capture video of the scene, or event.
Oftentimes, multiple cameras are used on a production, to capture different angles, or points of view.
For example, in this production, here is camera 1. This is camera 2.
And that is camera 3.
Graphics may be generated, and keyed onto the video to provide additional context, and creative flair to the production. These can be elements like: a name - which just appeared as a lower third, a scoreboard, displayed in the top right, or a complex video animation. Like the one down below.
Replay systems are used to record media and replay the footage back out when called for, as part of the live production. Or, they can archive it for later use, in post production and editorial.
To assemble all these elements, video switchers let operators cut between cameras, overlay graphics, and produce the final creative stream the viewer sees.
On the audio front, microphones are used to pick up announcers, interviews, musical elements, and other sound sources on the production.
Audio consoles ingest all these sources! And artfully combine them together into the final product or "mix" the audience hears! Finally all of these tools need to exchange media with one another. For example: camera feeds - connected into the video switcher inputs, or microphone sources - fed into an audio console.
To do that, every tool connects through a centralized media router that handles content exchange between them. Think of it as a unified network layer, that lets every device send and receive signals to one another. Now that you've got a handle on the basics of live production, here's where things start to diverge for immersive and where the real story begins! In this format, fidelity, presence, and preserving these through every step of the workflow is everything when you are transporting customers inside of the content. And that translates into truly big numbers! Video resolution is 32 times larger than what is typically used in a 2D broadcast production - in order to match human visual acuity and it's produced at two times the frame rate! Audio mixes supporting Apple Immersive Live are far more resolute than traditional stereo audio or even 5.1 surround sound.
Apple Spatial Audio Format, or ASAF mixes, can contain 64 or more channels, in order to immerse the audience in a rich spatial audio experience. These massive formats, and quality requirements - ripple through every part of the production pipeline. Unfortunately, not all the traditional tools, transport methods, and formats, support media at this scale. So, immersive live requires building an entirely different workflow. I'll break down three key concepts that will enable you to begin building tools and exciting new workflows in the ecosystem! First, a media format standard - designed to deliver the required quality while remaining efficient, and practically useful, when building live production tools. Second, a method for transporting immersive content in real time, between production devices, over an important standard, called SMPTE 2110.
Finally, saving a live stream to file, and playing it out again, is a fundamental function within any broadcast. I'll go over how live streams from devices can be saved to file within an application, and played back out again without any compromise to quality. I'll start at the top. Within any production, there are three classes of devices. Devices that output media, for example, a camera, a microphone, or a graphics generator. Devices that ingest media such as a video encoder, or a color grading monitor. And devices that do both, for example, a video switcher, receiving camera sources on its inputs, and switching the different angles to air, on its outputs! In any workflow, all devices have to agree on a unified set of media formats, so they can exchange content between them through the media router, seamlessly, in real time. Like a common language.
To do that, three existing standards have been combined into one format to support live immersive production.
Apple Immersive Live Video is composed entirely of streamed ProRes frames, as opposed to uncompressed video frames, typical of regular broadcast cameras.
ProRes is a powerful video codec that strikes an exceptional balance between image quality and bandwidth, reducing video signals to a practical size that can be processed by tools, but maintains the high fidelity required of the image. And because Apple Silicon is optimized for ProRes processing, it's the perfect platform for building production tools and pipelines! To learn more, refer to the "Apple ProRes" developer documentation.
ASAF audio mixes are composed of standard, uncompressed PCM audio tracks carrying high-order ambisonic beds and spatial audio objects.
Metadata is delivered as per-frame JSON objects that contain elements describing attributes of related video and audio feeds - such as lens calibrations, creative events, spatial audio behavior, and much more.
Together these three standards define the live, immersive, production format.
All tools and processes must be compliant with each media type - to ensure interoperability with the rest of the ecosystem! Next, devices need a standardized transport layer, to exchange live video, audio, and metadata feeds between them. To achieve this, live feeds from devices are exchanged as individual SMPTE 2110 media streams — the industry standard, for professional media transport over IP.
"2110", as it's commonly known, is widely deployed across broadcast facilities worldwide, and is interoperable with a broad ecosystem of professional tools.
2110 uses multicast RTP or, Real-time Transport Protocol to move media across the network. RTP streams carry timing information user flags, and other metadata, alongside a main media payload.
A 2110 stream will transmit either a video, an audio, or a metadata payload - and the transport of each media type is defined by a sub-standard within the broader 2110 specification. I'll break down how each fits into that model. Immersive ProRes video is transported as 2110-22 streams on the network, the defined standard for compressed media over IP. This 2110-22 flow contains both the left and right eye of the immersive content transmitted as two separate data essences, but contained within the single stream. This means there is no need to frame pack each eye, side by side into a single image raster or produce separate IP streams per eye. This is hugely advantageous, as this eliminates the complex management of independent left and right eye video feeds within the production architecture.
ASAF Audio is transported as standard, 2110-30 streams on the network.
These contain the high order ambisonics and audio object channels that compose ASAF spatial audio mixes.
JSON objects are transmitted per-frame over 2110-41, the standard for the transport of user defined metadata over IP. This carries the important metadata information — like lens calibrations, creative events, and motion data — in real time alongside the -22 video and -30 audio feeds within the production. Lastly, the ability to record feeds, edit them, and play them back out again - for example Instant Replay is a crucial part of any live workflow.
In traditional 2D workflows, recording live video to file often introduces visual quality loss as content is encoded, decoded, and re-encoded many times throughout a typical workflow. In Apple Immersive Video, even these small reductions in quality can significantly impact the customer experience especially as generational loss, due to multiple cycles of compression and decompression, compound over time.
Fortunately, this is solved by the immersive format: Everything is already ProRes! Because live media is natively generated in a file-friendly ProRes payload, recording to disk requires no additional encode, or decode steps as the content moves through the workflow. The same ProRes frames are simply copied directly into MOV files, and read back out again, into live 2110 streams during playout — untouched.
Practically, this means that live content can be produced by a camera, transported between devices, recorded to disk, edited, and played back out live on repeat with no impact to quality throughout the entire process! Save video feeds to QuickTime MOV video tracks, using AVFoundation's AVAssetWriter. The resulting MOV file, now contains the same untouched resolution, framerate, and stereo image data as the live stream, saved to a file that can be used in editorial, replay, or post production.
When writing the MOV video track, it's important to set the constant kVTProjectionKind_AppleImmersiveVideo in the AVVideoCompressionPropertiesKey. This is a new VideoToolbox property and will add the correct video extended usage, or vexu, static metadata for Apple Immersive Video to the file, signaling it as immersive to other applications.
Save audio feeds in the usual way — uncompressed PCM carried in the 2110 stream is written directly into the MOV's audio tracks using AVAssetWriter.
Finally, the streamed JSON data is written into the Metadata, Box, Exchange format, or MEBX tracks, within the MOV container, using AVAssetWriter.
Prior to storage, the streamed JSON data must be deserialized, parsed and then Immersive Media Support framework — or IMS — is used to create lens calibration objects, camera IDs, and other metadata objects that are written into the MOV, synchronized with the video and audio. IMS was first introduced in visionOS 26. It enables reading and writing the essential metadata for Apple Immersive Video, and provides capabilities for previewing content in creative workflows. While immersive video and audio is already powered by well known technologies — like AVFoundation, VideoToolbox, and Core Audio — IMS is a powerful framework, purpose-built for Apple Immersive Video. As you're building the next generation of production tools, IMS will be one of the most important frameworks to understand! Check out the "Immersive Media Support" developer documentation to learn more.
Then, during a file playback scenario, all processes are reversed. Video, audio, and metadata media types, are read directly from their tracks within the MOV, and retransmitted back into 2110 output streams for use within the wider production — using all the same frameworks and libraries.
Now that you understand the foundations of live immersive production, it's the perfect time to get started! Build your own immersive tools, using frameworks like AVFoundation, VideoToolbox, AudioToolbox, and Immersive Media Support. Every layer of the stack is open for innovation.
Dive deeper into the world of 2110 and connect your tools together into a true live workflow. Level up, by visiting the SMPTE website to learn more about the various standards and best practices during network implementation. Finally, be sure to watch: "Learn about Apple Immersive Video technologies" and "Support immersive video playback in visionOS apps". Together, they provide valuable context around the creative and technical principles behind the format.
The future of immersive live has just started. While it builds on the foundations of traditional broadcast it introduces entirely new creative and technical possibilities! And many of the best ideas haven't been invented yet! This is your chance to be a part of getting this new format off the ground, and on air. I'll see you next time!
-
-
13:17 - Set compression properties for vexu metadata
import VideoToolbox let compressionProperties: [String: Any] = [ // ... kVTCompressionPropertyKey_ProjectionKind as String: kVTProjectionKind_AppleImmersiveVideo // ... ]
-
-
- 0:00 - Introduction
Apple Immersive Video live streaming transports fans to sports, music, and entertainment events on Apple Vision Pro — illustrated by courtside LA Lakers games delivered live through the Spectrum SportsNet and NBA apps.
- 2:08 - Live production overview
A high-level overview of the fundamental components and creative tools that make up a modern live production pipeline, from cameras and graphics to video switchers and audio consoles.
- 5:16 - What makes immersive live different
Discover the unique scale and fidelity required for delivering Apple Immersive Video, including massive video resolutions, high frame rates, and rich Apple Spatial Audio Format (ASAF) mixes.
- 7:05 - Immersive live format
Learn about the core formats powering live immersive workflows, including streaming ProRes, uncompressed PCM audio, and per-frame JSON metadata.
- 9:09 - Real-time media transport
Explore how live immersive feeds are transported between devices in real time over IP using the SMPTE 2110 industry standard.
- 11:25 - Recording and playback
Learn how live streams are recorded to disk and played back using AVAssetWriter and the Immersive Media Support framework.