Mixed-input matrix multiplication performance optimizations

Posted by Manish Gupta, Employees Software program Engineer, Google Analysis

AI-driven applied sciences are weaving themselves into the material of our every day routines, with the potential to reinforce our entry to data and enhance our general productiveness. The spine of those purposes lies in giant language fashions (LLMs). LLMs are memory-intensive and sometimes require specialised {hardware} accelerators to effectively ship tens of exaflops of computing energy. This weblog publish exhibits how we will begin addressing the computational challenges by using reminiscence extra successfully.

The majority of an LLM’s reminiscence and compute are consumed by weights in matrix multiplication operations. Utilizing narrower information varieties reduces reminiscence consumption. For instance, storing weights within the 8-bit integer (i.e., U8 or S8) information kind reduces the reminiscence footprint by 4× relative to single-precision (F32) and a pair of× relative to half-precision (F16) or bfloat16 (BF16). Moreover, earlier work has proven that LLM fashions working matrix multiplications with weights in S8 and enter in F16 (preserving larger precision of the user-input) is an efficient methodology for growing the effectivity with acceptable trade-offs in accuracy. This system is called weight-only quantization and requires environment friendly implementation of matrix multiplication with mixed-inputs, e.g., half-precision enter multiplied with 8-bits integer. {Hardware} accelerators, together with GPUs, help a set set of information varieties, and thus, mixed-input matrix multiplication requires software program transformations to map to the {hardware} operations.

To that finish, on this weblog we concentrate on mapping mixed-input matrix multiplication onto the NVIDIA Ampere structure. We current software program methods addressing information kind conversion and format conformance to map mixed-input matrix multiplication effectively onto hardware-supported information varieties and layouts. Our outcomes present that the overhead of extra work in software program is minimal and allows efficiency near the height {hardware} capabilities. The software program methods described listed here are launched within the open-source NVIDIA/CUTLASS repository.

Reminiscence footprint for an 175B parameter LLM mannequin with numerous information varieties codecs.

The matrix-multiply-accumulate operation

Trendy AI {hardware} accelerators comparable to Google’s TPU and NVIDIA’s GPU multiply matrices natively within the {hardware} by focusing on Tensor Cores, that are specialised processing components to speed up matrix operations, significantly for AI workloads. On this weblog, we concentrate on NVIDIA Ampere Tensor Cores, which offer the matrix-multiply-accumulate (mma) operation. For the remainder of the weblog the reference to mma is for Ampere Tensor Cores. The supported information varieties, shapes, and information format of the 2 enter matrices (known as operands) for the mma operation are fastened in {hardware}. Which means matrix multiplications with numerous information varieties and bigger shapes are carried out within the software program by tiling the issue onto hardware-supported information varieties, shapes, and layouts.

The Tensor Core mma operation is outlined by specifying two enter matrices (e.g., A & B, proven under) to supply a end result matrix, C. The mma operation natively helps mixed-precision. Combined-precision Tensor Cores enable mixing enter (A and B) information kind with the end result (C) information kind. In distinction, mixed-input matrix multiplication includes mixing the enter information varieties, and it’s not supported by the {hardware}, so it must be carried out within the software program.

Tensor Core operation of M-by-N-by-Okay on enter matrix A of M-by-Okay and matrix B of Okay-by-N produces output matrix C of M-by-N.

Challenges of mixed-input matrix multiplication

To simplify the dialogue, we limit to a particular instance of mixed-input matrix multiplication: F16 for person enter and U8 for the mannequin weights (written as F16 * U8). The methods described right here work for numerous combos of mixed-input information varieties.

A GPU programmer can entry a hierarchy of reminiscence, together with international reminiscence, shared reminiscence, and registers, that are organized so as of lowering capability however growing pace. NVIDIA Ampere Tensor Core mma operations eat enter matrices from registers. Moreover, enter and output matrices are required to adapt to a format of information inside a bunch of 32 threads often known as a warp. The supported information kind and format inside a warp are fastened for an mma operation, so to implement mixed-input multiplication effectively, it’s essential to unravel the challenges of information kind conversion and format conformance in software program.

Knowledge kind conversion

The mma operation requires two enter matrices with the identical information kind. Thus, mixed-input matrix multiplication, the place one of many operands is saved in U8 in international reminiscence and different in F16, requires a knowledge kind conversion from U8 to F16. The conversion will carry two operands to F16, mapping the mixed-input matrix multiplication to hardware-supported mixed-precision Tensor Cores. Given the massive variety of weights, there are numerous such operations, and our methods present tips on how to cut back their latency and enhance efficiency.

Format conformance

The mma operation additionally requires the format of two enter matrices, throughout the registers of a warp, to be conformat with {hardware} specification. The format for the enter matrix B of U8 information kind in mixed-input matrix multiplication (F16 * U8) wants to adapt with the transformed F16 information kind. That is known as format conformance and must be achieved within the software program.

The determine under exhibits an mma operation consuming matrix A and matrix B from registers to supply matrix C in registers, distributed throughout one warp. The thread T0 is highlighted and zoomed in to indicate the load matrix B goes by information kind conversion and wishes a format conformance to have the ability to map to the hardware-supported Tensor Core operation.

Software program methods addressing challenges

A typical information kind conversion includes a sequence of operations on 32-bit registers, proven under. Every rectangular block represents a register and the adjoining textual content are the operations. The complete sequence exhibits the conversion from 4xU8 to 2x(2xF16). The sequence includes roughly 10 operations.

There are numerous methods of attaining format conformance. Two of the prevailing options are:

Narrower bitwidth shared reminiscence masses: On this method, threads concern slender bitwidth reminiscence masses shifting the U8 information from shared reminiscence to registers. This leads to two 32-bit registers, with every register containing 2xF16 values (proven above for the matrix B’s thread T0). The narrower shared reminiscence load achieves format conformance straight into registers without having any shuffles; nonetheless, it doesn’t make the most of the complete shared reminiscence bandwidth.
Pre-processing in international reminiscence: Another technique includes rearranging the info throughout the international reminiscence (one degree above the shared reminiscence in reminiscence hierarchy), permitting wider shared reminiscence masses. This method maximizes the shared reminiscence bandwidth utilization and ensures that the info is loaded in a conformant format straight within the registers. Though the rearrangement course of may be executed offline previous to the LLM deployment, making certain no impression on the appliance efficiency, it introduces an extra, non-trivial hardware-specific pre-processing step that requires an additional program to rearrange the info. NVIDIA/FasterTransformer adopts this methodology to successfully handle format conformance challenges.

Optimized software program methods

To additional optimize and cut back the overhead of information kind conversion and format conformance, we now have carried out FastNumericArrayConvertor and FragmentShuffler, respectively.

FastNumericArrayConvertor operates on 4xU8 in 32-bit registers with out unpacking particular person 1xU8 values. Moreover, it makes use of cheaper arithmetic operations which reduces the variety of directions and will increase the pace of the conversion.

The conversion sequence for U8-to-F16 is proven under. The operations use packed 32b registers, avoiding express unpacking and packing. FastNumericArrayConvertor makes use of the permute byte to rearrange bytes of 4xU8 into two registers. Moreover, FastNumericArrayConvertor doesn’t use costly integer to floating-point conversion directions and employs vectorized operations to acquire the packed leads to two 32-bit registers containing 2x(2xF16) values. The FastNumericArrayConvertor for U8-to-F16 roughly makes use of six operations, a 1.6× discount relative to the method proven above.

FastNumericArrayConvertor makes use of permute bytes and packed arithmetic, lowering the variety of directions within the information kind conversion.

FragmentShuffler handles the format conformance by shuffling information in a approach that permits the usage of wider bitwidth load operation, growing shared reminiscence bandwidth utilization and lowering the full variety of operations.

NVIDIA Ampere structure supplies a load matrix instruction (ldmatrix). The ldmatrix is a warp-level operation, the place 32 threads of a warp transfer the info from shared reminiscence to registers within the form and format that mma matrix A and B eat. The usage of ldmatrix reduces the variety of load directions and will increase the reminiscence bandwidth utilization. Because the ldmatrix instruction strikes U8 information to registers, the format after the load conforms with U8*U8 mma operation, and never with F16*F16 mma operation. We carried out FragmentShuffler to rearrange the info inside registers utilizing shuffle (shfl.sync) operations to realize the format conformance.

Probably the most vital contribution of this work is to realize format conformance by register shuffles, avoiding offline pre-processing in international reminiscence or narrower bitwidth shared reminiscence masses. Moreover, we offer implementations for FastNumericArrayConvertor masking information kind conversion from U8-to-F16, S8-to-F16, U8-to-BF16, and S8-to-BF16.

Efficiency outcomes

We measured the efficiency of eight mixed-input variants of our methodology (proven under in blue and pink; various the info kinds of matrix A and B) and two mixed-precision information varieties (proven in inexperienced) on an NVIDIA A100 SXM chip. The efficiency outcomes are proven in FLOPS (larger is healthier). Notably, the primary eight matrix-multipications require extra operations relative to the final two, as a result of the mixed-precision variants straight goal hardware-accelerated Tensor Core operations and don’t want information kind conversion and format conformance. Even so, our method demonstrates mixed-input matrix multiplication efficiency solely barely under or on par with mixed-precision.

Combined-input matrix multiplication efficiency on NVIDIA A100 40GB SMX4 chip for a compute-bound matrix drawback form m=3456, n=4096, okay=2048.

Acknowledgements

We wish to point out a number of of us who’ve contributed by technical brainstorming and enhancing the weblog publish together with, Quentin Colombet, Jacques Pienaar, Allie Culp, Calin Cascaval, Ashish Gondimalla, Matt Walsh, Marek Kolodziej, and Aman Bhatia. We wish to thank our NVIDIA companions Rawn Henry, Pradeep Ramani, Vijay Thakkar, Haicheng Wu, Andrew Kerr, Matthew Properly, and Vartika Singh.

Important Pages:

Mixed-input matrix multiplication performance optimizations – Google Research Blog

AI could help people find common ground during deliberations

Katanemo Open Sources Arch-Function: A Set of Large Language Models (LLMs) Promising Ultra-Fast Speeds at Function-Calling Tasks for Agentic Workflows

Artificial intelligence meets “blisk” in new DARPA-funded collaboration

Intro to AI: a beginner’s guide to artificial intelligence from MIT Technology Review

Meissonic: A Non-Autoregressive Mask Image Modeling Text-to-Image Synthesis Model that can Generate High-Resolution Images

Combining next-token prediction and video diffusion in computer vision and robotics | KryptoCoinz

OpenAI says ChatGPT treats us all the same (most of the time)

AutoDAN-Turbo: A Black-Box Jailbreak Method for LLMs with a Lifelong Agent