使用 Metal 张量优化自定机器学习运算

使用 Metal Tensor API 和 Metal Performance Primitives (MPP) Tensor Ops 资源库，解锁强大的机器学习性能。探索如何利用 Apple M5 和 A19 GPU 中的神经网络加速器构建可移植的运算。了解如何为你的 Core AI 应用程序构建自定机器学习内核，以及如何高效地处理量化数据格式并优化 GPU 内存。

章节

0:00 - Introduction
0:21 - Apple's ML software stack
2:25 - Managing quantized data
4:23 - Multi-plane tensors
5:17 - Quantized matrix multiplication
9:31 - Building advanced ops
13:35 - Integrating custom ops into Core AI
15:25 - Next steps

资源

相关视频

Tech Talks

使用 M5 和 A19 GPU 加速机器学习工作负载

WWDC25

实现 Metal 4 机器学习与图形应用程序的完美融合

你好，我叫Shiyao，我是一名GPU软件工程师。

今天，我很高兴带领大家探索Metal张量，并向你们展示如何使用TensorOps 编写经过优化的自定义ML内核。

Apple平台提供一流的支持，可在软件栈的每一层运行ML模型。 Core AI和MLX等高层框架使你能以极少的代码轻松部署模型，而Metal Performance Shaders等低层API 则提供对高性能Metal内核的访问。这些层都构建在低层加速之上，由Metal Performance Primitives 和TensorOps库提供。在Metal层面工作有几个原因。 ML研究发展迅速，你可能需要实现自定义操作，这些操作可以接入Core AI等更高层次的框架。如果你正在为ML框架做贡献，也可能需要编写Metal内核，比如MLX或llama.cpp，或者你正在开发基于Metal的应用。入门最简单的方式是使用TensorOps库。 TensorOps是一个Metal着色语言API，可在GPU上加速张量操作，包括矩阵乘法和卷积。它会自动利用任何可用的硬件加速，跨所有Apple Silicon GPU世代，因此你无需担心不同硬件世代之间的差异。特别是，它充分利用了神经加速器，即M5芯片系列中的神经加速器。

神经加速器是M5中全新的硬件模块，直接位于每个着色器核心内。它与其他GPU流水线并列，专为加速密集型计算密集型工作而设计，例如LLM的预填充阶段。

你可以查阅相关讲座，了解TensorOps入门基础知识。在本session中，我将在这些基础上进行扩展，从处理量化数据的最佳实践开始。然后，我会展示如何构建高级自定义操作，例如FlashAttention。

让我们深入第一个主题—— 处理量化数据。

众所周知，最先进的机器学习模型正在变得越来越大。推理阶段通常受内存带宽限制，因此压缩权重变得必不可少，既为了让模型更好地适应内存，也为了节省内存带宽。

压缩权重的标准方法是量化。思路很简单—— 取较高精度的权重，将其降低为较低精度的数据类型。例如，16位半精度权重可以压缩至仅4位。这些量化权重与缩放因子配对，缩放因子让我们能将量化值在计算时缩放回原始范围。

除了16位和32位浮点类型之外， TensorOps现在原生支持量化数据类型。我们在macOS和iOS 26的更新中新增了对4位和8位整数类型的支持，在macOS和iOS 26的更新中，并在macOS和iOS 27中扩展支持更多数据类型。这包括4位和8位浮点类型，以及2位整数类型。你只需创建量化张量并传递给TensorOps，它将自动利用任何可用的硬件加速。

创建具有量化数据类型的张量与创建普通张量非常相似。像其他张量一样填写描述符的属性，但只需指定量化的dataType。然后通过在Metal设备上调用 newTensorWithDescriptor 来创建张量。

这就是存储量化元素数据的方式。接下来，我们来谈谈缩放因子。在macOS和iOS 27中，

单个MTLTensor对象现在可以将你的缩放因子与张量的量化数据一起表示，作为额外的缩放平面。该平面支持流行的FP8 E8M0分块缩放因子格式。缩放平面的每个元素应用于数据平面中的一个元素块。声明缩放平面类似于声明张量。首先，为缩放平面创建一个描述符对象。然后填入dataType 和blockFactors。最后，创建一个辅助平面映射，以指定该平面用于缩放。

然后只需将辅助平面映射附加到原始tensorDescriptor即可。量化数据、缩放因子，以及元数据，都将被打包到单个张量对象中。

现在让我们将其付诸实践，通过扩展一个基本的矩阵乘法内核来支持量化。矩阵乘法是机器学习工作负载中的核心操作。例如，LLM在推理过程中执行数百万次矩阵乘法。

我们介绍了如何使用TensorOps 编写高性能矩阵乘法内核的基础知识，在M5机器学习讲座中。基本方法是将输入矩阵切分为较小的分块，然后使用TensorOps 执行按分块的矩阵乘法。这最大化了并行性并使数据保留在缓存中。

我们可以使用量化来进一步减少内存流量，并将更大的模型装入内存。在内核中，预先定义类型别名很有帮助，在绑定张量之前。这里我们声明一个缩放因子平面，使用 fp8_e8m0_ 数据类型，块大小为32×1。这意味着数据平面中每32个元素共享缩放平面中的单个元素。然后我们声明完整的张量类型，指定FP8数据类型，以及 scales_plane 。你只需将这些张量绑定到缓冲区绑定点。内核将可以访问张量，即你在主机端分配的张量。或者，如果你不想在主机端创建完整的MTLTensor，你可以直接在着色器栈上创建临时张量。语法几乎完全相同，只需将标签tensor_handle 替换为tensor_inline即可。然后将你的缓冲区指针和其他元数据传递给张量构造函数，即可在栈上创建张量。

如前所述，我们将把问题分配到多个线程组中以获得更好的并行性。首先，我们为每个线程组切分分块，然后使用TensorOps 执行乘法。

为此，只需在输入和输出张量上调用slice，使用线程组ID。数据平面和缩放平面将同时被切分，根据块大小。使用量化张量设置矩阵乘法与普通张量完全相同。首先，设置matmul2d_descriptor，指定分块大小和其他参数。然后创建一个matmul2d运算，指定线程组中 simdgroup的数量。然后只需传入你的量化张量， TensorOps将为你处理反量化。

在大多数情况下，你应该将量化数据直接馈送给TensorOps，这样它就能自动利用任何可用的硬件加速。但是，如果你需要对自定义格式进行反量化， TensorOps仍然可以满足你的需求。最简单的方法是让每个线程从设备内存加载一块量化数据，并将其反量化为线程组内存中的f16值。然后你可以将其作为内联线程组张量传递给TensorOps。但是，这种方法需要额外的加载和存储，通过线程组内存。理想情况下，我们应该将所有数据保留在线程寄存器中。你可以通过将数据反量化到协作张量来实现这一点，该张量现在可以作为输入传递给matmul2d运算。协作张量将其存储分布在线程私有内存中，即参与matmul操作的各线程的私有内存。因此，如果无法直接使用量化张量，你仍然可以跳过经由线程组内存的往返。总结一下——Metal张量原生支持各种量化数据类型，包括新的MX缩放格式，以及即将在iOS和macOS 27中推出的 E8M0缩放因子。请注意，这些新数据类型具有额外的对齐要求，与较大的数据类型相比，因此请务必查阅 Metal文档以获取详细信息。

现在让我们更进一步—— 使用TensorOps构建完整、更复杂的自定义操作。注意力机制是每个 Transformer网络的核心，包括LLM。为了计算注意力，首先将两个矩阵相乘，称为Q和K。接下来，使用对中间矩阵各行的归约来计算SoftMax。

最后，乘以第三个矩阵，称为V。流行的FlashAttention算法将所有这些操作融合在一起，成为单个内核。

要使用TensorOps实现这一点，你首先需要设置自定义simd组映射，使每个simd组拥有中间矩阵的完整行。这使你能够计算SoftMax，而无需在simd组之间交换数据。你可以使用 execution_simdgroup操作作用域来实现这一点。这意味着每个simd组将并行执行独立的矩阵乘法。

你可以使用simd组ID 来切分输入分块。我们将使用协作张量来存储中间矩阵，这样我们就可以将其作为输入传递给下一步，而无需将其写入内存。我们将对结果计算SoftMax。

为此，我们需要计算几次归约，对协作张量。 TensorOps包含一个reduce_rows函数来帮助完成这项工作。线程将相互之间交换数据，以计算每行的最大值。结果在另一个协作张量中返回。

让我们来设置一下。首先，创建一个协作张量来存储归约输出。然后将源和目标传递给reduce_rows函数。这里我们将使用max归约操作，初始值为负无穷。

这两个协作张量具有不同的形状，因此为了在它们之间进行映射， TensorOps还包含一个map_iterator函数。给定一个指向2D张量中某元素的迭代器，它返回一个指向对应元素的迭代器，在归约目标中。

首先，使用迭代器设置遍历2D协作张量的循环。然后调用map_iterator，将每个元素映射到其对应的行最大值。最后，解引用这些迭代器以计算SoftMax，并将结果存回协作张量中。

现在我们准备好将这个协作张量乘以V了。在macOS 26中，你必须先将其存储到线程组内存中。但现在可以直接使用协作张量作为matmul操作的输入。

为此，调用 get_left_input_cooperative_tensor方法，将源协作张量作为参数传递。然后你可以将结果作为输入传递给第二个matmul操作。有一点需要注意：并非每个协作张量都能重新用作输入。布局可能因数据类型和其他因素而有所不同。因此，在这样做之前，调用is_compatible_as_left 或right_input方法，以检查兼容性。如果返回true，则可以继续。如果不是，你需要存储并重新加载数据，通过线程组内存将其转换为正确的布局。无论哪种方式，调用op.run的方式都相同。这些是你构建高级操作所需的关键TensorOps功能，例如使用TensorOps实现FlashAttention。现在我们已经了解了如何构建此操作，让我们看看它在使用Core AI的真实模型中如何运行。

Core AI为Python开发者提供工具，将PyTorch模型转换为Core AI模型，包括对自定义Metal内核的支持。请查看《深入了解 Core AI 模型创作与优化》讲座，了解如何将Metal内核集成到Core AI模型的详细信息。

我已按照该session 中概述的步骤，将我们的自定义 FlashAttention内核集成到Sam3图像分割模型中。我们将自定义注意力内核的主体定义为Python中的字符串，并注册如图所示的TorchMetalKernel对象。

然后，我们替换默认的 huggingface注意力实现，改为调用我们内核的实现，如图所示。

最后，我们从huggingface 加载模型，并将其从PyTorch导出为优化的Core AI资产。导出需要一点时间才能完成。

现在我们可以进行推理了。 Sam3执行可提示的概念分割，因此我们向模型提供图像和文本，然后它会以分割掩码进行响应，指示对象在图像中的位置。这里，我让模型标记图中所有包含汽车的像素。

好，现在我将运行分割。

查看最终结果，我们可以看到模型正确地分割了图像。汽车以蓝色高亮显示，因此我们的注意力内核已按预期完全集成到模型中。

今天，我介绍了你可以使用的所有工具，用于在Apple Silicon上构建优化的自定义ML内核。从量化数据类型，到高级TensorOps功能，例如协作张量和归约，再到与Core AI的集成。要进一步探索，请查阅Metal Performance Primitives 文档以获取完整API参考，以及编程指南以获取更多性能优化指南。你还可以下载 TensorOps示例代码，查看我在这里无法涵盖的详细内容。并务必查阅相关sessions，了解更多关于Core AI和Metal的内容。谢谢！

3:53 - Create a quantized MTLTensor

// Creating a tensor with a quantized data type from device

#define RANK 2

MTLTensorDescriptor *tensorDesc = [MTLTensorDescriptor new];

tensorDesc.dataType = MTLTensorDataTypeMetalFloat8E4M3;
tensorDesc.usage = MTLTensorUsageCompute;

NSInteger dimensions[RANK] = {NumCols, NumRows};
tensorDesc.dimensions = [[MTLTensorExtents alloc] initWithRank:RANK values:dimensions];

NSError *err = nil;
id <MTLTensor> tensor = [device newTensorWithDescriptor:tensorDesc error:&err];

4:48 - Declare a multi-plane tensor with scale factors

// Creating a tensor with a scales auxiliary plane from device

#define RANK 2

MTLTensorAuxiliaryPlaneDescriptor *planeDesc = [MTLTensorAuxiliaryPlaneDescriptor new];
planeDesc.dataType = MTLTensorDataTypeMetalFloat8UE8M0;

NSInteger blockFactors[RANK] = {32, 1};
planeDesc.blockFactors = [[MTLTensorExtents alloc] initWithRank:RANK values:blockFactors];

MTLTensorAuxiliaryPlaneDescriptorMap *auxiliaryPlanes =
    [MTLTensorAuxiliaryPlaneDescriptorMap new];
[auxiliaryPlanes setDescriptor:planeDesc forPlane:MTLTensorPlaneTypeScales];

MTLTensorDescriptor *tensorDesc = [MTLTensorDescriptor new];
tensorDesc.dataType = MTLTensorDataTypeMetalFloat8E4M3;
tensorDesc.usage = MTLTensorUsageCompute;

NSInteger dimensions[RANK] = {NumCols, NumRows};
tensorDesc.dimensions = [[MTLTensorExtents alloc] initWithRank:RANK values:dimensions];
tensorDesc.auxiliaryPlanes = auxiliaryPlanes;

NSError *err = nil;
id <MTLTensor> tensor = [device newTensorWithDescriptor:tensorDesc error:&err];

6:07 - MSL type aliases for an MXFP8 tensor handle

// Type aliases for a MXFP8 multi-plane tensor handle

#include <metal_tensor>

using namespace metal;

using scales_plane = tensor_blockwise<tensor_plane_scales,
                                      device metal_fp8_ue8m0_format,
                                      32, 1>;

using mxfp8_tensor = tensor<device metal_fp8_e4m3_format,
                            dextents<int, 2>,
                            tensor_handle,
                            scales_plane>;

kernel void matmul(mxfp8_tensor matrixA [[buffer(0)]],
                   mxfp8_tensor matrixB [[buffer(1)]],
                   tensor<device half, dextents<int, 2>> matrixC [[buffer(2)]])
{
    // ...
}

6:51 - Declare an inline MXFP8 tensor on the stack

// Type aliases for a MXFP8 multi-plane tensor inline

#include <metal_tensor>

using namespace metal;

using scales_plane = tensor_blockwise<tensor_plane_scales,
                                      device metal_fp8_ue8m0_format,
                                      32, 1>;

using mxfp8_tensor_inline = tensor<device metal_fp8_e4m3_format,
                                   dextents<int, 2>,
                                   tensor_inline,
                                   scales_plane>;

// Construct tensor on the stack from buffer pointers
mxfp8_tensor_inline matrixA(dataBufferA,
                             dextents<int, 2>(K, M),
                             array<int, 2>({ 1, K }),
                             scales_plane(scalesBufferA));

7:19 - Slice tensors and run a quantized matmul

// Slice the tensors to extract the relevant tile
auto tA = matrixA.slice(0, tgid.y * TILEM);
auto tB = matrixB.slice(tgid.x * TILEN, 0);
auto tC = matrixC.slice(tgid.x * TILEN, tgid.y * TILEM);

// Set up the matmul descriptor
constexpr auto descriptor = matmul2d_descriptor(TILEM,                  // M
                                                TILEN,                  // N
                                                dynamic_length_v<int>,  // K
                                                false,   // Left matrix transposed
                                                false);  // Right matrix transposed

matmul2d<descriptor, execution_simdgroups<4>> op;

// Run the op — TensorOps handles dequantization automatically
op.run(tA, tB, tC);

10:27 - Set up simdgroup-scoped QxK multiplication

// Setup QxK matrix multiplication op
constexpr auto mul_qk_op_desc = matmul2d_descriptor(/* ... */);
matmul2d<mul_qk_op_desc, execution_simdgroups> mul_qk_op;

// Slice Q, K, V
auto tQSlice = tQ.slice<D, ROWS_PER_SIMD>(0, sgid * ROWS_PER_SIMD);
auto tKSlice = tK.slice<D, BK>(0, k);
auto tVSlice = tV.slice<D, BK>(0, k);

// Create cooperative tensor to store tile of QxK
auto ctQK = mul_qk_op.get_destination_cooperative_tensor<decltype(tQSlice),
                                                         decltype(tKSlice),
                                                         float>();

// Multiply QxK
mul_qk_op.run(tQSlice, tKSlice, ctQK);

11:18 - Compute row-wise reduction for SoftMax

// Create a cooperative tensor to store row reduction output
auto ctTileRowMax = mul_qk_op.get_row_reduction_destination_cooperative_tensor<
                        decltype(tQSlice),
                        decltype(tKSlice),
                        float>();

// Compute max over each row of QxK tile
reduce_rows(ctQK, ctTileRowMax, reduction_operation::max, -INFINITY);

11:56 - Compute element-wise SoftMax with map_iterator

// Iterate over elements of QxK tile
#pragma clang loop unroll(full)
for (auto it = ctQK.begin(); it != ctQK.end(); it++) {
    // Fetch row max corresponding to this element
    auto row_it = ctRowMax.map_iterator(it);

    // Subtract row max from each element and compute exponent
    *it = exp(*it - *row_it);
}

12:33 - Reuse cooperative tensor as matmul input

constexpr auto mul_sv_op_desc = matmul2d_descriptor(/* ... */);
matmul2d<mul_sv_op_desc, metal::execution_simdgroup> mul_sv_op;

if (mul_sv_op.is_compatible_as_left_input<float, half, float>(ctQK)) {
    // Directly reuse cooperative tensor as input
    auto ctQKIn = mul_sv_op.get_left_input_cooperative_tensor<float, half, float>(ctQK);
    mul_sv_op.run(ctQKIn, tVSlice, ctO);
} else {
    // Store and reload through threadgroup memory if layout is not compatible
    ctQK.store(tgTensor);
    simdgroup_barrier(mem_flags::mem_threadgroup);

    auto ctQKIn = mul_sv_op.get_left_input_cooperative_tensor<float, half, float>();
    ctQKIn.load(tgTensor);
    mul_sv_op.run(ctQKIn, tVSlice, ctO);
}

0:00 - Introduction
Overview of how Metal tensors and TensorOps enable you to write optimized custom ML kernels on Apple Silicon.
0:21 - Apple's ML software stack
A tour of Apple's ML software stack, from high-level frameworks like Core AI and MLX down to Metal Performance Shaders, Metal Performance Primitives, and the TensorOps library — and why you might want to work at the Metal level.
2:25 - Managing quantized data
How quantization reduces memory bandwidth requirements for large models, and the new quantized data types natively supported by TensorOps, including MX scaling formats.
4:23 - Multi-plane tensors
How a single MTLTensor object can now represent both quantized element data and scale factors as separate planes, and how to configure multi-plane tensor descriptors in your Metal shaders.
5:17 - Quantized matrix multiplication
How to extend a tiled matrix multiplication kernel to support quantized inputs, including binding scales planes, using inline tensors, slicing with threadgroup IDs, and handling custom dequantization formats.
9:31 - Building advanced ops
How to implement Flash Attention with TensorOps, covering custom SIMD group mappings, cooperative tensors, row reductions, SoftMax, and the new API for passing cooperative tensors directly as matrix multiplication inputs — eliminating the threadgroup memory round-trip.
13:35 - Integrating custom ops into Core AI
How to integrate a custom Metal TensorOps kernel into a Core AI application, using Core AI's Python tools to convert PyTorch models and plug in custom Metal operations.
15:25 - Next steps
A summary of the TensorOps features covered — quantized types, multi-plane tensors, Flash Attention, and Core AI integration — with pointers to sample code and related sessions on Core AI and Metal.

探索“入门汇总”

及时了解最新动态

探索“平台”

精选

探索“技术”

精选

探索“社区”

精选

探索“文档”

发布说明

探索“下载”

精选

探索“支持”

精选

快速链接

使用 Metal 张量优化自定机器学习运算

章节

资源

相关视频

Tech Talks

WWDC25