metal-cpp

RSS for tag

C++ games and apps can tap into the power of Metal by bridging with metal-cpp.

Posts under metal-cpp tag

71 Posts

Post

Replies

Boosts

Views

Activity

Metal GPU Driver Crash on M5 Pro + macOS 26.5 — kIOGPUCommandBufferCallbackErrorOutOfMemory with <2GB working sets
Metal GPU Driver Crash on M5 Pro + macOS 26.5 — kIOGPUCommandBufferCallbackErrorOutOfMemory with <2GB working sets Summary The Metal driver AGXMetalG17X 351.2 on macOS 26.5 (25F71) for the M5 Pro chip crashes with kIOGPUCommandBufferCallbackErrorOutOfMemory (00000008) when running LLM inference workloads with working sets as small as ~1.5GB, despite 24GB of unified memory being available and Apple Diagnostics confirming the hardware is fully functional. This affects multiple tools: MLX, llama.cpp (Metal backend), and native apps using Metal for inference. System Component Value Model MacBook Pro (Mac17,9) Chip Apple M5 Pro (applegpu_g17s) GPU Cores 16 RAM 24 GB LPDDR5 macOS 26.5 (25F71) Metal Metal 4 GPU Driver AGXMetalG17X 351.2 Xcode 26.5 (17F42) Reproduction MLX (Python) pip install mlx mlx-lm python -m mlx_lm.generate \ --model mlx-community/Qwen2.5-3B-Instruct-4bit \ --max-tokens 10 \ --prompt "Hello" Expected: Normal text generation Actual: Crash with: libc++abi: terminating due to uncaught exception of type std::runtime_error: [METAL] Command buffer execution failed: Insufficient Memory (00000008:kIOGPUCommandBufferCallbackErrorOutOfMemory) llama.cpp brew install llama.cpp llama-cli --model model.gguf --prompt "Hello" --n-predict 20 --n-gpu-layers 99 Expected: Fast GPU generation Actual: Process hangs indefinitely Test Results Tool Model Peak Memory Result MLX Qwen2.5-0.5B-4bit 0.36 GB ✅ Works MLX Qwen2.5-1.5B-4bit 0.98 GB ✅ Works MLX Qwen3-1.7B-4bit 1.01 GB ✅ Works MLX Qwen2.5-3B-4bit ~1.5 GB ❌ Metal OOM crash MLX Qwen3-4B-4bit ~2.1 GB ❌ Metal OOM crash MLX Qwen3-8B-4bit ~4.5 GB ❌ Metal OOM crash llama.cpp Qwen2.5-0.5B GGUF ~0.5 GB ❌ Hangs with GPU llama.cpp Qwen2.5-0.5B GGUF ~0.5 GB ✅ Works with CPU only Key Evidence Hardware is healthy — Apple Diagnostics passed all tests Basic Metal works — matmul, array ops work fine CPU inference works — llama.cpp with -ngl 0 runs correctly The error is NOT about actual memory exhaustion — kIOGPUCommandBufferCallbackErrorOutOfMemory means the kernel rejects the Metal memory commit, not that physical memory is full. The system reports 17.76GB available for Metal working set. Crash Log Extract Thread 31 Crashed: 0 libsystem_kernel.dylib __pthread_kill + 8 1 libsystem_pthread.dylib pthread_kill + 296 2 libsystem_c.dylib abort + 148 3 Metal MTLReportFailure.cold.1 + 48 4 Metal MTLReportFailure + 576 5 Metal -[_MTLCommandBuffer addCompletedHandler:] + 104 ... Exception Type: EXC_CRASH (SIGABRT) Termination Reason: Namespace SIGNAL, Code 6, Abort trap: 6 Related Issues ml-explore/mlx#3586 — Metal compiler regression on macOS 26.5 ml-explore/mlx#3534 — M5 float32 precision issue ml-explore/mlx#3568 — M5 random divergence ml-explore/mlx#3539 — Metal residency OOM (M4 Max) Request Please investigate the AGXMetalG17X driver for M5 Pro on macOS 26.5. The driver appears to incorrectly reject Metal memory commits for LLM inference workloads, even when the working set is well within the system's reported limits (1.5GB requested vs 17.76GB available). Happy to provide full crash logs, sysdiagnose archives, or run additional tests.
0
0
306
May ’26
Optimizing HZB Mip-Chain Generation and Bindless Argument Tables in a Custom Metal Engine
Hi everyone, I’ve been developing a custom, end-to-end 3D rendering engine called Crescent from scratch using C++20 and Metal-cpp (targeting macOS and visionOS). My primary goal is to build a zero-bottleneck, GPU-driven pipeline that maximizes the potential of Apple Silicon’s Unified Memory and TBDR architecture. While the fundamental systems are stable, I am looking for architectural feedback from Metal framework engineers regarding specific synchronization and latency challenges. Current Core Implementations: GPU-Driven Instance Culling: High-performance occlusion culling using a Hierarchical Z-Buffer (HZB) approach via Compute Shaders. Clustered Forward Shading: Support for high-count dynamic lights through view-space clustering. Temporal Stability: Custom TAA with history rejection and Motion Blur resolve. Asset Infrastructure: Robust GUID-based scene serialization and a JSON-driven ECS hierarchy. The Architectural Challenge: I am currently seeing slight synchronization overhead when generating the HZB mip-chain. On Apple Silicon, I am evaluating the cost of encoder transitions versus cache-friendly barriers. && m_hzbInitPipeline && m_hzbDownsamplePipeline && !m_hzbMipViews.empty(); if (canBuildHzb) { MTL::ComputeCommandEncoder* hzbInit = commandBuffer->computeCommandEncoder(); hzbInit->setComputePipelineState(m_hzbInitPipeline); hzbInit->setTexture(m_depthTexture, 0); hzbInit->setTexture(m_hzbMipViews[0], 1); if (m_pointClampSampler) { hzbInit->setSamplerState(m_pointClampSampler, 0); } else if (m_linearClampSampler) { hzbInit->setSamplerState(m_linearClampSampler, 0); } const uint32_t hzbWidth = m_hzbMipViews[0]->width(); const uint32_t hzbHeight = m_hzbMipViews[0]->height(); const uint32_t threads = 8; MTL::Size tgSize = MTL::Size(threads, threads, 1); MTL::Size gridSize = MTL::Size((hzbWidth + threads - 1) / threads * threads, (hzbHeight + threads - 1) / threads * threads, 1); hzbInit->dispatchThreads(gridSize, tgSize); hzbInit->endEncoding(); for (size_t mip = 1; mip < m_hzbMipViews.size(); ++mip) { MTL::Texture* src = m_hzbMipViews[mip - 1]; MTL::Texture* dst = m_hzbMipViews[mip]; if (!src || !dst) { continue; } MTL::ComputeCommandEncoder* downEncoder = commandBuffer->computeCommandEncoder(); downEncoder->setComputePipelineState(m_hzbDownsamplePipeline); downEncoder->setTexture(src, 0); downEncoder->setTexture(dst, 1); const uint32_t mipWidth = dst->width(); const uint32_t mipHeight = dst->height(); MTL::Size downGrid = MTL::Size((mipWidth + threads - 1) / threads * threads, (mipHeight + threads - 1) / threads * threads, 1); downEncoder->dispatchThreads(downGrid, tgSize); downEncoder->endEncoding(); } if (m_instanceCullHzbPipeline) { dispatchInstanceCulling(m_instanceCullHzbPipeline, true); } } My Questions: Encoder Synchronization: Would you recommend moving this loop into a single ComputeCommandEncoder using MTLBarrier between dispatches to maintain L2 cache residency, or is the overhead of separate encoders negligible for depth-downsampling on TBDR? visionOS Bindless Latency: For stereo rendering on visionOS, what are the best practices for managing MTL4ArgumentTable updates at 90Hz+? I want to ensure that updating bindless resources for each eye doesn't introduce unnecessary CPU-to-GPU latency. Memory Management: Are there specific hints for Memoryless textures that could be applied to intermediate HZB levels to save bandwidth during this process? I’ve attached a screenshot of a scene rendered with the engine (PBR, SSR, and IBL).
0
0
645
Feb ’26
Xcode Metal Trace
Code is download from apple official metal4 sample [https://developer.apple.com/documentation/metal/drawing-a-triangle-with-metal-4?language=objc] enable metal gpu trace in macOS schema and trace a frame in Xcode. Xcode may show segment fault on App from some 'GTTrace' function when click trace button. When replay a .gputrace file, Xcode may crash , throw an internal error or a XPC error. The example code using old metal-renderer can trace without any problem and everything works fine. Test Environment: Xcode Version 26.2 (17C52) macOS 26.2 (25C56) M1 Pro 16GB A2442
2
0
614
Jan ’26
# [CRITICAL] Metal RHI Memory Leak - Resource exhaustion vulnerability (CWE-400) - Bug Report
[CRITICAL] Metal API Memory Leak - Heap Memory Never Released to OS (CWE-400) Security Classification This issue constitutes a resource exhaustion vulnerability (CWE-400): Aspect Details Type Uncontrolled Resource Consumption CWE CWE-400 Vector Local (any Metal application) Impact System instability, denial of service User Control None - no mitigation available Recovery Requires application restart Summary Metal heap allocations are never released back to macOS, even when the memory is entirely unused. This causes continuous, unbounded memory growth until system instability or crash. The issue affects any application using Metal API heap allocation. This was discovered in Unreal Engine 5, but reproduces in a completely blank UE5 project with zero application code - confirming this is Metal framework behavior, not application-level. Environment OS: macOS Tahoe 26.2 Hardware: Apple Silicon M4 Max (also reproduced on M1, M2, M3) API: Metal Reproduction Steps Run any Metal application that allocates and deallocates GPU buffers via Metal heaps Open Activity Monitor and observe the application's memory usage Let the application run idle (no user interaction required) Observe memory growing continuously at ~1-2 MB per second Memory never plateaus or stabilizes Eventually system becomes unstable For testing: Any Unreal Engine 5.4+ project on macOS will reproduce this. Even a blank project with no gameplay code exhibits the leak. (Tested on UE 5.7.1) Observed Behavior Memory Analysis Using Unreal's memreport -full command, two reports taken 86 seconds apart: Metric Report 1 (183s) Report 2 (269s) Delta Process Physical 4373.64 MB 4463.39 MB +89.75 MB Metal Heap Buffer 7168 MB 8192 MB +1024 MB Unused Heap 3453 MB 4477 MB +1024 MB Object Count 73,840 73,840 0 (no change) Key Finding Metal Heap grew by exactly 1 GB while "Unused Heap" also grew by 1 GB. This demonstrates: Metal is allocating new heap blocks in ~1 GB increments Previously allocated heap memory becomes "unused" but is never released The unused memory accumulates indefinitely No application-level objects are leaking (count remains constant) Memory Growth Pattern Continuous growth while idle (no user interaction) Growth rate: approximately 1-2 MB per second No plateau or stabilization occurs Metal allocates new 1 GB heap blocks rather than reusing freed space Eventually leads to system instability and crash What is NOT Causing This We verified the following are NOT the source: Application objects - Object count remains constant Application code - Blank project with no code reproduces the issue Texture streaming - Disabling texture streaming had no effect CPU garbage collection - Running GC has no effect (this is GPU memory) Mitigations Attempted (None Worked) setPurgeableState Setting resources to purgeable state before release: [buffer setPurgeableState:MTLPurgeableStateEmpty]; Result: Metal ignores this hint and does not reclaim heap memory. Avoiding Heap Pooling Forcing individual buffer allocations instead of heap-based pooling. Result: Leak persists - Metal still manages underlying allocations. Aggressive Buffer Compaction Attempting to compact/defragment buffers within heaps every frame. Result: Only moves data between existing heaps. Does NOT release heaps back to OS. Reducing Pool Sizes Minimizing all buffer pool sizes to force more frequent reuse. Result: Slightly slows the leak rate but does not stop it. Root Cause Analysis How Metal Heap Allocation Appears to Work Metal allocates GPU heap blocks in large chunks (~1 GB observed) Application requests buffers from these heaps When application releases buffers, memory becomes "unused" within the heap Metal does NOT release heap blocks back to macOS, even when entirely unused When fragmentation prevents reuse, Metal allocates new heap blocks Result: Continuous memory growth with no upper bound The Core Problem There appears to be no Metal API to force heap memory release. The only way to reclaim this memory is to destroy the Metal device entirely, which requires restarting the application. Expected Behavior Metal should: Release unused heaps - When a heap block is entirely unused, release it back to macOS Respect purgeable hints - Honor setPurgeableState calls from applications Compact allocations - Defragment heap allocations to reduce fragmentation Provide control APIs - Allow applications to request heap compaction or release Enforce limits - Have configurable maximum heap memory consumption Security Implications Local Denial of Service - Any Metal application can exhaust system memory, causing instability affecting all running applications Memory Pressure Attack - Forces other applications to swap to disk, degrading system-wide performance No Upper Bound - Memory consumption continues until system failure Unmitigable - End users have no way to prevent or limit the leak Affects All Metal Apps - Any application using Metal heaps is potentially affected Impact Applications become unstable after extended use System-wide performance degrades as memory pressure increases Users must periodically restart applications Developers cannot work around this at the application level Long-running applications (games, creative tools, servers) are particularly affected Request Investigate Metal heap memory management behavior Implement heap release when blocks become entirely unused Honor setPurgeableState hints from applications Consider providing an API for applications to request heap compaction Document any intended behavior or workarounds Additional Notes This issue has been observed across multiple Unreal Engine versions (5.4, 5.7) and multiple Apple Silicon generations (M1 through M4). The behavior is consistent and reproducible. The Unreal Engine team has implemented various CVars to attempt mitigation (rhi.Metal.HeapBufferBytesToCompact, rhi.Metal.ResourcePurgeInPool, etc.) but none successfully address the issue because the root cause is at the Metal framework level. Tested: January 2026 Platform: macOS Tahoe 26.2, Apple Silicon (M1/M2/M3/M4)
5
2
1.3k
Jan ’26
Basic c++xcodeproj call to swift code
I have c++ macOs app(Xcode +14) and I try to add call to swift code. I can't find any simple c++ xcodeproj call to swift code. I create new simple project and fail to build it with error when I try to include #include <SwiftMixTester/SwiftMixTester-Swift.h>: main.m:9:10: error: 'SwiftMixTester/SwiftMixTester-Swift.h' file not found (in target 'CppCallSwift' from project 'CppCallSwift') note: Did not find header 'SwiftMixTester-Swift.h' in framework 'SwiftMixTester' (loaded from '/Users/yanivsmacm4/Library/Developer/Xcode/DerivedData/CppCallSwift-exdxjvwdcczqntbkksebulvfdolq/Build/Products/Debug') . Please help.
5
0
1.1k
Oct ’25
Slow compilation
Hi, I am working with a large project. We are compiling each material to its own .metallib. They all include many common files full of inline functions. Finally we link it all together at the end with a single big pathtrace kernel. Everything works as expected, however the compile times have gotten completely out of hand and it takes multiple minutes to compile at runtime (to native code). I have gathered that I can do this offline by using metal-tt however if I am wondering if there is a way to reduce the compile times in such a scenario, and how to investigate what the root cause of the problem is. I suspect it could have to do with the fact that every materials metallib contains duplications of all the inline functions. Any ideas on how to profile and debug this? Thanks, Rasmus
0
1
216
Mar ’25
How to properly pass a Metal layer from SwiftUI MTKView to C++ for use with metal-cpp?
Hello! I'm currently porting a videogame console emulator to iOS and I'm trying to make the renderer (tested on MacOS) work on iOS as well. The emulator core is written in C++ and uses metal-cpp for rendering, whereas the iOS frontend is written in Swift with SwiftUI. I have an Objective-C++ bridging header for bridging the Swift and C++ sides. On the Swift side, I create an MTKView. Inside the MTKView delegate, I run the emulator for 1 video frame and pass it the view's backing layer for it to render the final output image with. The emulator runs and returns, but when it returns I get a crash in Swift land (callstack attached below), inside objc_release, which indicates I'm doing something wrong with memory management. My bridging interface (ios_driver.h): #pragma once #include <Foundation/Foundation.h> #include <QuartzCore/QuartzCore.h> void iosCreateEmulator(); void iosRunFrame(CAMetalLayer* layer); Bridge implementation (ios_driver.mm): #import <Foundation/Foundation.h> extern "C" { #include "ios_driver.h" } <...> #define IOS_EXPORT extern "C" __attribute__((visibility("default"))) std::unique_ptr<Emulator> emulator = nullptr; IOS_EXPORT void iosCreateEmulator() { ... } // Runs 1 video frame of the emulator and IOS_EXPORT void iosRunFrame(CAMetalLayer* layer) { void* layerBridged = (__bridge void*)layer; // Pass the CAMetalLayer to the emulator emulator->getRenderer()->setMTKLayer(layerBridged); // Runs the emulator for 1 frame and renders the output image using our layer emulator->runFrame(); } My MTKView delegate: class Renderer: NSObject, MTKViewDelegate { var parent: ContentView var device: MTLDevice! init(_ parent: ContentView) { self.parent = parent if let device = MTLCreateSystemDefaultDevice() { self.device = device } super.init() } func mtkView(_ view: MTKView, drawableSizeWillChange size: CGSize) {} func draw(in view: MTKView) { var metalLayer = view.layer as! CAMetalLayer // Run the emulator for 1 frame & display the output image iosRunFrame(metalLayer) } } Finally, the emulator's render function that interacts with the layer: void RendererMTL::setMTKLayer(void* layer) { metalLayer = (CA::MetalLayer*)layer; } void RendererMTL::display() { CA::MetalDrawable* drawable = metalLayer->nextDrawable(); if (!drawable) { return; } MTL::Texture* texture = drawable->texture(); <rest of rendering follows here using the drawable & its texture> } This is the Swift callstack at the time of the crash: To my understanding, I shouldn't be violating ARC rules as my bridging header uses CAMetalLayer* instead of void* and Swift will automatically account for ARC when passing CoreFoundation objects to Objective-C. However I don't have any other idea as to what might be causing this. I've been trying to debug this code for a couple of days without much success. If you need more info, the emulator code is also on Github Metal renderer: https://github.com/wheremyfoodat/Panda3DS/blob/ios/src/core/renderer_mtl/renderer_mtl.cpp#L58-L68 Bridge implementation: https://github.com/wheremyfoodat/Panda3DS/blob/ios/src/ios_driver.mm Bridging header: https://github.com/wheremyfoodat/Panda3DS/blob/ios/include/ios_driver.h Any help is more than appreciated. Thank you for your time in advance.
0
0
664
Mar ’25
Question about metal-cpp resource allocation
I notice some metal-cpp classes have static funtion like static URL* fileURLWithPath(const class String* pPath); static class ComputePassDescriptor* computePassDescriptor(); static class AccelerationStructurePassDescriptor* accelerationStructurePassDescriptor(); which return a new object. these classes also provide 'alloc' and 'init' function to create object by default. for object created by 'alloc' and 'init', I use something like NS::Shaderd_Ptr or call release directly to free memory. Because 'alloc' and 'init' not explicit call on these static function. I wonder how to correctly free object created by these static function? did they managed by autorelease pool?
2
0
633
Mar ’25
3D Skeletal animation in metal-cpp?
Hey all! I'm got my hands on a refurbished mac mini m1 and already diving into metal. At the moment, i'm currently studying graphics programming with opengl and got to a point where I can almost create a 3d cube. However, I noticed there aren't many tutorials for metal cpp but rather demos. One thing I love about graphic programming, is skinning/skeletal animation. At the moment, I can't find any sources or tutorials on how to load skeletal animations into metal-cpp. So, if I create my character in blender and had all types of animations all loaded into a .FBX or maybe .DAE and load this into metal api with metal-cpp, how can I go on about how this works?
1
0
622
Mar ’25
Metal-CPP Errors
After following the instructions here: https://developer.apple.com/metal/cpp/ I attempted building my project and Xcode presented several errors. In essence it's complaining about some redeclarations in the Metal-CPP headers. NSBundle.hpp and NSError.hpp are included in the metal-cpp/foundation directory from the metal-cpp download. Any help in getting these issues resolved is appreciated. Thanks!
2
0
641
Feb ’25
Can't link metal-cpp to Modern Rendering With Metal sample
There is a sample project from Apple here. It has a scene of a city at night and you can move in it. It basically has 2 parts: application code written in what looks like Objective-C (I am more familiar with C++), which inherits from things like NSObject, MTKView, NSViewController and so on - it processes input and all app-related and window-related stuff. rendering code that also looks like Objective-C. Btw both parts are mostly in .mm files (Obj-C++ AFAIK). The application part directly uses only one class from the rendering part - AAPLRenderer. I want to move the rendering part to C++ using metal-cpp. For that I need to link metal-cpp to the project. I did it successfully with blank projects several times before using this tutorial. But with this sample project Xcode can't find Foundation/Foundation.hpp (and other metal-cpp headers). The error says this: Did not find header 'Foundation.hpp' in framework 'Foundation' (loaded from '/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX15.0.sdk/System/Library/Frameworks') Pls help
2
0
928
Feb ’25
After updating CAMetalLayer.drawableSize, [CAMetalLayer nextDrawable:] frequently takes ~1s
I have a bare-bones Metal app setup where I attach a CAMetalLayer to a window that inherits from a NSWindow with a custom delegate. Everything else is vanilla. I'm also using metal-cpp and metal shader converter. I'm running into a issue where the application runs fine in the beginning, but once I resize the window, it starts hitching. It turns out that [CAMetalLayer nextDrawable:] frequently (but not always) takes around a full second (plus or minus a few milliseconds) to return once drawableSize has been updated. I've tried setting allowsNextDrawableTimeout to false which doesn't work; it returns a valid drawable after a second instead of nil. Setting displaySyncEnabled to false reduces the likelihood of this happening to around 50% from 90%+ but does not eliminate it. Setting maximumDrawableCount to 2 or 3 does not seem to make a difference. By dumping the resource IDs of the returned textures I've noticed something interesting: Before resizing, the layer seems to shuffle between 2 textures or at least 2 resource IDs, but after resizing it starts to create new textures for each returned drawable. Occasionally it seems to reuse a previous resource ID, but it does not seem to have anything to do with whether the method returns quickly or not. Why does this happen, and how can I fix it? Should I create a new CAMetalLayer when resizing the window instead of updating drawableSize?
3
0
886
Jan ’25
Learn Metal
I am interested in learning the Metal framework for rendering development. However, most of Apple’s official documentation uses Objective-C code. Therefore, I am seeking guidance on whether it is more advantageous for me to focus solely on learning Swift to gain proficiency in Metal.
2
0
915
Jan ’25
Metal-cpp-extensions isn't working inside frameworks
I am making a framework in C++ using metal-cpp, basically a small game engine. I am also consequently using metal-cpp-extensions provided in LearnMetalCPP to make applications work. For one of my classes, I needed to add AppKit.hpp inside a public header file, so I moved it and its associate headers(NSApplication.hpp, NSMenu.hpp, etc.) from Project headers to Public in Build Phases' Headers, however, it started giving me the error "cast of C pointer type 'void *' to Objective-C pointer type 'Class' requires a bridged cast" at several points in the AppKit headers. They don't appear when AppKit and its associates are in the Project headers, or when they are in the Private headers and no headers import it. I imagined that disabling Objective-C ARC and Using __bridge casts outside of ARC in Build Settings would solve it, but it didn't budge. I imagined it wouldn't involve actively changing the headers would be the answer, but even if I try to put __bridge before the problematic casts, it didn't recognize __bridge. How do I solve this? And why is it only happening in Public and not Project headers?
2
0
908
Jan ’25
Metal GPU Driver Crash on M5 Pro + macOS 26.5 — kIOGPUCommandBufferCallbackErrorOutOfMemory with <2GB working sets
Metal GPU Driver Crash on M5 Pro + macOS 26.5 — kIOGPUCommandBufferCallbackErrorOutOfMemory with <2GB working sets Summary The Metal driver AGXMetalG17X 351.2 on macOS 26.5 (25F71) for the M5 Pro chip crashes with kIOGPUCommandBufferCallbackErrorOutOfMemory (00000008) when running LLM inference workloads with working sets as small as ~1.5GB, despite 24GB of unified memory being available and Apple Diagnostics confirming the hardware is fully functional. This affects multiple tools: MLX, llama.cpp (Metal backend), and native apps using Metal for inference. System Component Value Model MacBook Pro (Mac17,9) Chip Apple M5 Pro (applegpu_g17s) GPU Cores 16 RAM 24 GB LPDDR5 macOS 26.5 (25F71) Metal Metal 4 GPU Driver AGXMetalG17X 351.2 Xcode 26.5 (17F42) Reproduction MLX (Python) pip install mlx mlx-lm python -m mlx_lm.generate \ --model mlx-community/Qwen2.5-3B-Instruct-4bit \ --max-tokens 10 \ --prompt "Hello" Expected: Normal text generation Actual: Crash with: libc++abi: terminating due to uncaught exception of type std::runtime_error: [METAL] Command buffer execution failed: Insufficient Memory (00000008:kIOGPUCommandBufferCallbackErrorOutOfMemory) llama.cpp brew install llama.cpp llama-cli --model model.gguf --prompt "Hello" --n-predict 20 --n-gpu-layers 99 Expected: Fast GPU generation Actual: Process hangs indefinitely Test Results Tool Model Peak Memory Result MLX Qwen2.5-0.5B-4bit 0.36 GB ✅ Works MLX Qwen2.5-1.5B-4bit 0.98 GB ✅ Works MLX Qwen3-1.7B-4bit 1.01 GB ✅ Works MLX Qwen2.5-3B-4bit ~1.5 GB ❌ Metal OOM crash MLX Qwen3-4B-4bit ~2.1 GB ❌ Metal OOM crash MLX Qwen3-8B-4bit ~4.5 GB ❌ Metal OOM crash llama.cpp Qwen2.5-0.5B GGUF ~0.5 GB ❌ Hangs with GPU llama.cpp Qwen2.5-0.5B GGUF ~0.5 GB ✅ Works with CPU only Key Evidence Hardware is healthy — Apple Diagnostics passed all tests Basic Metal works — matmul, array ops work fine CPU inference works — llama.cpp with -ngl 0 runs correctly The error is NOT about actual memory exhaustion — kIOGPUCommandBufferCallbackErrorOutOfMemory means the kernel rejects the Metal memory commit, not that physical memory is full. The system reports 17.76GB available for Metal working set. Crash Log Extract Thread 31 Crashed: 0 libsystem_kernel.dylib __pthread_kill + 8 1 libsystem_pthread.dylib pthread_kill + 296 2 libsystem_c.dylib abort + 148 3 Metal MTLReportFailure.cold.1 + 48 4 Metal MTLReportFailure + 576 5 Metal -[_MTLCommandBuffer addCompletedHandler:] + 104 ... Exception Type: EXC_CRASH (SIGABRT) Termination Reason: Namespace SIGNAL, Code 6, Abort trap: 6 Related Issues ml-explore/mlx#3586 — Metal compiler regression on macOS 26.5 ml-explore/mlx#3534 — M5 float32 precision issue ml-explore/mlx#3568 — M5 random divergence ml-explore/mlx#3539 — Metal residency OOM (M4 Max) Request Please investigate the AGXMetalG17X driver for M5 Pro on macOS 26.5. The driver appears to incorrectly reject Metal memory commits for LLM inference workloads, even when the working set is well within the system's reported limits (1.5GB requested vs 17.76GB available). Happy to provide full crash logs, sysdiagnose archives, or run additional tests.
Replies
0
Boosts
0
Views
306
Activity
May ’26
Optimizing HZB Mip-Chain Generation and Bindless Argument Tables in a Custom Metal Engine
Hi everyone, I’ve been developing a custom, end-to-end 3D rendering engine called Crescent from scratch using C++20 and Metal-cpp (targeting macOS and visionOS). My primary goal is to build a zero-bottleneck, GPU-driven pipeline that maximizes the potential of Apple Silicon’s Unified Memory and TBDR architecture. While the fundamental systems are stable, I am looking for architectural feedback from Metal framework engineers regarding specific synchronization and latency challenges. Current Core Implementations: GPU-Driven Instance Culling: High-performance occlusion culling using a Hierarchical Z-Buffer (HZB) approach via Compute Shaders. Clustered Forward Shading: Support for high-count dynamic lights through view-space clustering. Temporal Stability: Custom TAA with history rejection and Motion Blur resolve. Asset Infrastructure: Robust GUID-based scene serialization and a JSON-driven ECS hierarchy. The Architectural Challenge: I am currently seeing slight synchronization overhead when generating the HZB mip-chain. On Apple Silicon, I am evaluating the cost of encoder transitions versus cache-friendly barriers. && m_hzbInitPipeline && m_hzbDownsamplePipeline && !m_hzbMipViews.empty(); if (canBuildHzb) { MTL::ComputeCommandEncoder* hzbInit = commandBuffer->computeCommandEncoder(); hzbInit->setComputePipelineState(m_hzbInitPipeline); hzbInit->setTexture(m_depthTexture, 0); hzbInit->setTexture(m_hzbMipViews[0], 1); if (m_pointClampSampler) { hzbInit->setSamplerState(m_pointClampSampler, 0); } else if (m_linearClampSampler) { hzbInit->setSamplerState(m_linearClampSampler, 0); } const uint32_t hzbWidth = m_hzbMipViews[0]->width(); const uint32_t hzbHeight = m_hzbMipViews[0]->height(); const uint32_t threads = 8; MTL::Size tgSize = MTL::Size(threads, threads, 1); MTL::Size gridSize = MTL::Size((hzbWidth + threads - 1) / threads * threads, (hzbHeight + threads - 1) / threads * threads, 1); hzbInit->dispatchThreads(gridSize, tgSize); hzbInit->endEncoding(); for (size_t mip = 1; mip < m_hzbMipViews.size(); ++mip) { MTL::Texture* src = m_hzbMipViews[mip - 1]; MTL::Texture* dst = m_hzbMipViews[mip]; if (!src || !dst) { continue; } MTL::ComputeCommandEncoder* downEncoder = commandBuffer->computeCommandEncoder(); downEncoder->setComputePipelineState(m_hzbDownsamplePipeline); downEncoder->setTexture(src, 0); downEncoder->setTexture(dst, 1); const uint32_t mipWidth = dst->width(); const uint32_t mipHeight = dst->height(); MTL::Size downGrid = MTL::Size((mipWidth + threads - 1) / threads * threads, (mipHeight + threads - 1) / threads * threads, 1); downEncoder->dispatchThreads(downGrid, tgSize); downEncoder->endEncoding(); } if (m_instanceCullHzbPipeline) { dispatchInstanceCulling(m_instanceCullHzbPipeline, true); } } My Questions: Encoder Synchronization: Would you recommend moving this loop into a single ComputeCommandEncoder using MTLBarrier between dispatches to maintain L2 cache residency, or is the overhead of separate encoders negligible for depth-downsampling on TBDR? visionOS Bindless Latency: For stereo rendering on visionOS, what are the best practices for managing MTL4ArgumentTable updates at 90Hz+? I want to ensure that updating bindless resources for each eye doesn't introduce unnecessary CPU-to-GPU latency. Memory Management: Are there specific hints for Memoryless textures that could be applied to intermediate HZB levels to save bandwidth during this process? I’ve attached a screenshot of a scene rendered with the engine (PBR, SSR, and IBL).
Replies
0
Boosts
0
Views
645
Activity
Feb ’26
Xcode Metal Trace
Code is download from apple official metal4 sample [https://developer.apple.com/documentation/metal/drawing-a-triangle-with-metal-4?language=objc] enable metal gpu trace in macOS schema and trace a frame in Xcode. Xcode may show segment fault on App from some 'GTTrace' function when click trace button. When replay a .gputrace file, Xcode may crash , throw an internal error or a XPC error. The example code using old metal-renderer can trace without any problem and everything works fine. Test Environment: Xcode Version 26.2 (17C52) macOS 26.2 (25C56) M1 Pro 16GB A2442
Replies
2
Boosts
0
Views
614
Activity
Jan ’26
# [CRITICAL] Metal RHI Memory Leak - Resource exhaustion vulnerability (CWE-400) - Bug Report
[CRITICAL] Metal API Memory Leak - Heap Memory Never Released to OS (CWE-400) Security Classification This issue constitutes a resource exhaustion vulnerability (CWE-400): Aspect Details Type Uncontrolled Resource Consumption CWE CWE-400 Vector Local (any Metal application) Impact System instability, denial of service User Control None - no mitigation available Recovery Requires application restart Summary Metal heap allocations are never released back to macOS, even when the memory is entirely unused. This causes continuous, unbounded memory growth until system instability or crash. The issue affects any application using Metal API heap allocation. This was discovered in Unreal Engine 5, but reproduces in a completely blank UE5 project with zero application code - confirming this is Metal framework behavior, not application-level. Environment OS: macOS Tahoe 26.2 Hardware: Apple Silicon M4 Max (also reproduced on M1, M2, M3) API: Metal Reproduction Steps Run any Metal application that allocates and deallocates GPU buffers via Metal heaps Open Activity Monitor and observe the application's memory usage Let the application run idle (no user interaction required) Observe memory growing continuously at ~1-2 MB per second Memory never plateaus or stabilizes Eventually system becomes unstable For testing: Any Unreal Engine 5.4+ project on macOS will reproduce this. Even a blank project with no gameplay code exhibits the leak. (Tested on UE 5.7.1) Observed Behavior Memory Analysis Using Unreal's memreport -full command, two reports taken 86 seconds apart: Metric Report 1 (183s) Report 2 (269s) Delta Process Physical 4373.64 MB 4463.39 MB +89.75 MB Metal Heap Buffer 7168 MB 8192 MB +1024 MB Unused Heap 3453 MB 4477 MB +1024 MB Object Count 73,840 73,840 0 (no change) Key Finding Metal Heap grew by exactly 1 GB while "Unused Heap" also grew by 1 GB. This demonstrates: Metal is allocating new heap blocks in ~1 GB increments Previously allocated heap memory becomes "unused" but is never released The unused memory accumulates indefinitely No application-level objects are leaking (count remains constant) Memory Growth Pattern Continuous growth while idle (no user interaction) Growth rate: approximately 1-2 MB per second No plateau or stabilization occurs Metal allocates new 1 GB heap blocks rather than reusing freed space Eventually leads to system instability and crash What is NOT Causing This We verified the following are NOT the source: Application objects - Object count remains constant Application code - Blank project with no code reproduces the issue Texture streaming - Disabling texture streaming had no effect CPU garbage collection - Running GC has no effect (this is GPU memory) Mitigations Attempted (None Worked) setPurgeableState Setting resources to purgeable state before release: [buffer setPurgeableState:MTLPurgeableStateEmpty]; Result: Metal ignores this hint and does not reclaim heap memory. Avoiding Heap Pooling Forcing individual buffer allocations instead of heap-based pooling. Result: Leak persists - Metal still manages underlying allocations. Aggressive Buffer Compaction Attempting to compact/defragment buffers within heaps every frame. Result: Only moves data between existing heaps. Does NOT release heaps back to OS. Reducing Pool Sizes Minimizing all buffer pool sizes to force more frequent reuse. Result: Slightly slows the leak rate but does not stop it. Root Cause Analysis How Metal Heap Allocation Appears to Work Metal allocates GPU heap blocks in large chunks (~1 GB observed) Application requests buffers from these heaps When application releases buffers, memory becomes "unused" within the heap Metal does NOT release heap blocks back to macOS, even when entirely unused When fragmentation prevents reuse, Metal allocates new heap blocks Result: Continuous memory growth with no upper bound The Core Problem There appears to be no Metal API to force heap memory release. The only way to reclaim this memory is to destroy the Metal device entirely, which requires restarting the application. Expected Behavior Metal should: Release unused heaps - When a heap block is entirely unused, release it back to macOS Respect purgeable hints - Honor setPurgeableState calls from applications Compact allocations - Defragment heap allocations to reduce fragmentation Provide control APIs - Allow applications to request heap compaction or release Enforce limits - Have configurable maximum heap memory consumption Security Implications Local Denial of Service - Any Metal application can exhaust system memory, causing instability affecting all running applications Memory Pressure Attack - Forces other applications to swap to disk, degrading system-wide performance No Upper Bound - Memory consumption continues until system failure Unmitigable - End users have no way to prevent or limit the leak Affects All Metal Apps - Any application using Metal heaps is potentially affected Impact Applications become unstable after extended use System-wide performance degrades as memory pressure increases Users must periodically restart applications Developers cannot work around this at the application level Long-running applications (games, creative tools, servers) are particularly affected Request Investigate Metal heap memory management behavior Implement heap release when blocks become entirely unused Honor setPurgeableState hints from applications Consider providing an API for applications to request heap compaction Document any intended behavior or workarounds Additional Notes This issue has been observed across multiple Unreal Engine versions (5.4, 5.7) and multiple Apple Silicon generations (M1 through M4). The behavior is consistent and reproducible. The Unreal Engine team has implemented various CVars to attempt mitigation (rhi.Metal.HeapBufferBytesToCompact, rhi.Metal.ResourcePurgeInPool, etc.) but none successfully address the issue because the root cause is at the Metal framework level. Tested: January 2026 Platform: macOS Tahoe 26.2, Apple Silicon (M1/M2/M3/M4)
Replies
5
Boosts
2
Views
1.3k
Activity
Jan ’26
Basic c++xcodeproj call to swift code
I have c++ macOs app(Xcode +14) and I try to add call to swift code. I can't find any simple c++ xcodeproj call to swift code. I create new simple project and fail to build it with error when I try to include #include <SwiftMixTester/SwiftMixTester-Swift.h>: main.m:9:10: error: 'SwiftMixTester/SwiftMixTester-Swift.h' file not found (in target 'CppCallSwift' from project 'CppCallSwift') note: Did not find header 'SwiftMixTester-Swift.h' in framework 'SwiftMixTester' (loaded from '/Users/yanivsmacm4/Library/Developer/Xcode/DerivedData/CppCallSwift-exdxjvwdcczqntbkksebulvfdolq/Build/Products/Debug') . Please help.
Replies
5
Boosts
0
Views
1.1k
Activity
Oct ’25
Can't find arm_neon.h in macOS 15.6.1
I can't find the arm_neon.h header file in macOS 15.6.1 This header file defines NEON intrinsics.
Replies
1
Boosts
0
Views
1.7k
Activity
Sep ’25
View C++ std::string as char array
How do I make std::string look like a char array? The expression viewer for std::strings is horrible.
Replies
0
Boosts
0
Views
330
Activity
Jun ’25
Where is the `Metal-cpp-extensions` download website?
Hi, I really appreciate the C++ binding provided. I got the metal-cpp source code from the website at Getting Started. However, I could not find the same for metal-cpp-extensions. Is it not available or do we have to always extract it from the sample code? Thanks.
Replies
1
Boosts
0
Views
312
Activity
May ’25
Slow compilation
Hi, I am working with a large project. We are compiling each material to its own .metallib. They all include many common files full of inline functions. Finally we link it all together at the end with a single big pathtrace kernel. Everything works as expected, however the compile times have gotten completely out of hand and it takes multiple minutes to compile at runtime (to native code). I have gathered that I can do this offline by using metal-tt however if I am wondering if there is a way to reduce the compile times in such a scenario, and how to investigate what the root cause of the problem is. I suspect it could have to do with the fact that every materials metallib contains duplications of all the inline functions. Any ideas on how to profile and debug this? Thanks, Rasmus
Replies
0
Boosts
1
Views
216
Activity
Mar ’25
How to properly pass a Metal layer from SwiftUI MTKView to C++ for use with metal-cpp?
Hello! I'm currently porting a videogame console emulator to iOS and I'm trying to make the renderer (tested on MacOS) work on iOS as well. The emulator core is written in C++ and uses metal-cpp for rendering, whereas the iOS frontend is written in Swift with SwiftUI. I have an Objective-C++ bridging header for bridging the Swift and C++ sides. On the Swift side, I create an MTKView. Inside the MTKView delegate, I run the emulator for 1 video frame and pass it the view's backing layer for it to render the final output image with. The emulator runs and returns, but when it returns I get a crash in Swift land (callstack attached below), inside objc_release, which indicates I'm doing something wrong with memory management. My bridging interface (ios_driver.h): #pragma once #include <Foundation/Foundation.h> #include <QuartzCore/QuartzCore.h> void iosCreateEmulator(); void iosRunFrame(CAMetalLayer* layer); Bridge implementation (ios_driver.mm): #import <Foundation/Foundation.h> extern "C" { #include "ios_driver.h" } <...> #define IOS_EXPORT extern "C" __attribute__((visibility("default"))) std::unique_ptr<Emulator> emulator = nullptr; IOS_EXPORT void iosCreateEmulator() { ... } // Runs 1 video frame of the emulator and IOS_EXPORT void iosRunFrame(CAMetalLayer* layer) { void* layerBridged = (__bridge void*)layer; // Pass the CAMetalLayer to the emulator emulator->getRenderer()->setMTKLayer(layerBridged); // Runs the emulator for 1 frame and renders the output image using our layer emulator->runFrame(); } My MTKView delegate: class Renderer: NSObject, MTKViewDelegate { var parent: ContentView var device: MTLDevice! init(_ parent: ContentView) { self.parent = parent if let device = MTLCreateSystemDefaultDevice() { self.device = device } super.init() } func mtkView(_ view: MTKView, drawableSizeWillChange size: CGSize) {} func draw(in view: MTKView) { var metalLayer = view.layer as! CAMetalLayer // Run the emulator for 1 frame & display the output image iosRunFrame(metalLayer) } } Finally, the emulator's render function that interacts with the layer: void RendererMTL::setMTKLayer(void* layer) { metalLayer = (CA::MetalLayer*)layer; } void RendererMTL::display() { CA::MetalDrawable* drawable = metalLayer->nextDrawable(); if (!drawable) { return; } MTL::Texture* texture = drawable->texture(); <rest of rendering follows here using the drawable & its texture> } This is the Swift callstack at the time of the crash: To my understanding, I shouldn't be violating ARC rules as my bridging header uses CAMetalLayer* instead of void* and Swift will automatically account for ARC when passing CoreFoundation objects to Objective-C. However I don't have any other idea as to what might be causing this. I've been trying to debug this code for a couple of days without much success. If you need more info, the emulator code is also on Github Metal renderer: https://github.com/wheremyfoodat/Panda3DS/blob/ios/src/core/renderer_mtl/renderer_mtl.cpp#L58-L68 Bridge implementation: https://github.com/wheremyfoodat/Panda3DS/blob/ios/src/ios_driver.mm Bridging header: https://github.com/wheremyfoodat/Panda3DS/blob/ios/include/ios_driver.h Any help is more than appreciated. Thank you for your time in advance.
Replies
0
Boosts
0
Views
664
Activity
Mar ’25
Question about metal-cpp resource allocation
I notice some metal-cpp classes have static funtion like static URL* fileURLWithPath(const class String* pPath); static class ComputePassDescriptor* computePassDescriptor(); static class AccelerationStructurePassDescriptor* accelerationStructurePassDescriptor(); which return a new object. these classes also provide 'alloc' and 'init' function to create object by default. for object created by 'alloc' and 'init', I use something like NS::Shaderd_Ptr or call release directly to free memory. Because 'alloc' and 'init' not explicit call on these static function. I wonder how to correctly free object created by these static function? did they managed by autorelease pool?
Replies
2
Boosts
0
Views
633
Activity
Mar ’25
3D Skeletal animation in metal-cpp?
Hey all! I'm got my hands on a refurbished mac mini m1 and already diving into metal. At the moment, i'm currently studying graphics programming with opengl and got to a point where I can almost create a 3d cube. However, I noticed there aren't many tutorials for metal cpp but rather demos. One thing I love about graphic programming, is skinning/skeletal animation. At the moment, I can't find any sources or tutorials on how to load skeletal animations into metal-cpp. So, if I create my character in blender and had all types of animations all loaded into a .FBX or maybe .DAE and load this into metal api with metal-cpp, how can I go on about how this works?
Replies
1
Boosts
0
Views
622
Activity
Mar ’25
Metal-CPP Errors
After following the instructions here: https://developer.apple.com/metal/cpp/ I attempted building my project and Xcode presented several errors. In essence it's complaining about some redeclarations in the Metal-CPP headers. NSBundle.hpp and NSError.hpp are included in the metal-cpp/foundation directory from the metal-cpp download. Any help in getting these issues resolved is appreciated. Thanks!
Replies
2
Boosts
0
Views
641
Activity
Feb ’25
some functions from NS::Bundle in metal-cppnot
*** Terminating app due to uncaught exception 'NSInvalidArgumentException', reason: '-[NSBundle allFrameworks]: unrecognized selector sent to instance NS::Bundle* Bundle = NS::Bundle::mainBundle(); Bundle->allFrameworks(); call to allFrameworks() and allBundles() will throw exception, but other functions work well.
Replies
3
Boosts
0
Views
728
Activity
Feb ’25
Can't link metal-cpp to Modern Rendering With Metal sample
There is a sample project from Apple here. It has a scene of a city at night and you can move in it. It basically has 2 parts: application code written in what looks like Objective-C (I am more familiar with C++), which inherits from things like NSObject, MTKView, NSViewController and so on - it processes input and all app-related and window-related stuff. rendering code that also looks like Objective-C. Btw both parts are mostly in .mm files (Obj-C++ AFAIK). The application part directly uses only one class from the rendering part - AAPLRenderer. I want to move the rendering part to C++ using metal-cpp. For that I need to link metal-cpp to the project. I did it successfully with blank projects several times before using this tutorial. But with this sample project Xcode can't find Foundation/Foundation.hpp (and other metal-cpp headers). The error says this: Did not find header 'Foundation.hpp' in framework 'Foundation' (loaded from '/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX15.0.sdk/System/Library/Frameworks') Pls help
Replies
2
Boosts
0
Views
928
Activity
Feb ’25
CAMetalDisplayLink in metal-cpp
The title is self-exploratory. I wasn't able to find the CAMetalDisplayLink on the most recent metal-cpp release (metal-cpp_macOS15_iOS18-beta). Are there any plans to include it in the next release?
Replies
1
Boosts
1
Views
1.1k
Activity
Feb ’25
Any example about using metal-cpp on iOS?
Now the examples of metal-cpp are target on desktop and using AppKit which is not supported on iOS. Is there any tips for developing with metal-cpp on mobile device?
Replies
2
Boosts
0
Views
1.3k
Activity
Feb ’25
After updating CAMetalLayer.drawableSize, [CAMetalLayer nextDrawable:] frequently takes ~1s
I have a bare-bones Metal app setup where I attach a CAMetalLayer to a window that inherits from a NSWindow with a custom delegate. Everything else is vanilla. I'm also using metal-cpp and metal shader converter. I'm running into a issue where the application runs fine in the beginning, but once I resize the window, it starts hitching. It turns out that [CAMetalLayer nextDrawable:] frequently (but not always) takes around a full second (plus or minus a few milliseconds) to return once drawableSize has been updated. I've tried setting allowsNextDrawableTimeout to false which doesn't work; it returns a valid drawable after a second instead of nil. Setting displaySyncEnabled to false reduces the likelihood of this happening to around 50% from 90%+ but does not eliminate it. Setting maximumDrawableCount to 2 or 3 does not seem to make a difference. By dumping the resource IDs of the returned textures I've noticed something interesting: Before resizing, the layer seems to shuffle between 2 textures or at least 2 resource IDs, but after resizing it starts to create new textures for each returned drawable. Occasionally it seems to reuse a previous resource ID, but it does not seem to have anything to do with whether the method returns quickly or not. Why does this happen, and how can I fix it? Should I create a new CAMetalLayer when resizing the window instead of updating drawableSize?
Replies
3
Boosts
0
Views
886
Activity
Jan ’25
Learn Metal
I am interested in learning the Metal framework for rendering development. However, most of Apple’s official documentation uses Objective-C code. Therefore, I am seeking guidance on whether it is more advantageous for me to focus solely on learning Swift to gain proficiency in Metal.
Replies
2
Boosts
0
Views
915
Activity
Jan ’25
Metal-cpp-extensions isn't working inside frameworks
I am making a framework in C++ using metal-cpp, basically a small game engine. I am also consequently using metal-cpp-extensions provided in LearnMetalCPP to make applications work. For one of my classes, I needed to add AppKit.hpp inside a public header file, so I moved it and its associate headers(NSApplication.hpp, NSMenu.hpp, etc.) from Project headers to Public in Build Phases' Headers, however, it started giving me the error "cast of C pointer type 'void *' to Objective-C pointer type 'Class' requires a bridged cast" at several points in the AppKit headers. They don't appear when AppKit and its associates are in the Project headers, or when they are in the Private headers and no headers import it. I imagined that disabling Objective-C ARC and Using __bridge casts outside of ARC in Build Settings would solve it, but it didn't budge. I imagined it wouldn't involve actively changing the headers would be the answer, but even if I try to put __bridge before the problematic casts, it didn't recognize __bridge. How do I solve this? And why is it only happening in Public and not Project headers?
Replies
2
Boosts
0
Views
908
Activity
Jan ’25