A few days ago, information was released by a team of researchers on the development of the first Rowhammer attack that has been successfully directed to la GDDR6 video memory of a GPU, specifically an NVIDIA A6000.
The technique, dubbed GPUHammer, allows individual bits in the GPU's DRAM to be manipulated, drastically degrading the accuracy of machine learning models by altering just a single bit of their parameters. These bit flips allow a malicious GPU user to manipulate another user's GPU data in shared, time-sliced environments.
So far, Applying Rowhammer to video memories was considered impractical due to several technical limitations. The physical layout of memory cells in GDDR chips is difficult to map, access latencies are up to four times slower than in conventional DRAM, and refresh rates are significantly higher. Added to this are proprietary protection mechanisms against premature charge loss, the reverse engineering of which required specialized equipment.
To overcome these obstacles, Researchers developed a new reverse engineering technique targeting GDDR DRAMUsing low-level CUDA code, they executed the attack through specific optimizations that intensified access to certain memory cells, creating conditions conducive to bit manipulation. The key to success lay in achieving highly organized parallel computing, which acted as an amplifier of the pressure on adjacent cells.
How does the attack work?
El ataque exploits a physical weakness in the DRAM, where intensive access to a memory row (known as “hammering”) can induce alterations in adjacent rowsAlthough this vulnerability was identified in 2014 and extensively studied in CPU DDR memory, porting it to GPUs has so far been a challenge due to:
- The high access latency of GDDR6 (up to 4 times higher than DDR4).
- The complexity in the physical allocation of memory.
- The presence of proprietary and poorly documented mitigations, such as TRR.
Rowhammer is a hardware vulnerability in which rapidly activating one row of memory introduces bit flips in adjacent rows. Since 2014, this vulnerability has been widely studied in CPUs and CPU-based memory such as DDR3, DDR4, and LPDDR4. However, as critical AI and machine learning workloads now run on discrete GPUs in the cloud, assessing the vulnerability of GPU memory to Rowhammer attacks is critical.
Despite these obstacles, the Researchers managed to apply reverse engineering on virtual/physical memory allocation in CUDA, They developed a method to identify specific DRAM memory banks and optimized parallel access using multiple threads and warps, maximizing the hammering rate without causing additional latency.
The proof of concept showed how a single-bit flip in deep neural network (DNN) model weights, specifically in FP16 exponents, can degrade the top-1 accuracy of image classification models on ImageNet from 80% to 0,1%. This finding is alarming for data centers and cloud services running AI workloads in shared environments with GPUs.
Mitigations and limitations
NVIDIA has confirmed the vulnerability and recommends enabling ECC support. (Error-Correcting Code) using the command nvidia-smi -e 1. Although This measure can correct errors single-bit, This implies a loss of performance of up to 10%. and a 6,25% reduction in available memory. It also doesn't protect against future attacks involving multiple bit flips.
We confirmed Rowhammer bit fluctuations on NVIDIA A6000 GPUs with GDDR6 memory. Other GDDR6 GPUs, such as the RTX 3080, did not exhibit bit fluctuations in our testing, possibly due to variations in DRAM vendor, chip characteristics, or operating conditions such as temperature. We also did not observe any fluctuations on an A100 GPU with HBM memory.
The team highlights that GPUHammer has currently only been verified on the A6000 GPU with GDDR6, and not on models like the A100 (HBM) or RTX 3080. However, since this is an extensible attack, other researchers are encouraged to replicate and expand the analysis on different GPU architectures and models.
Finally, if you are interested in learning more about it, you can check the details in the following link