DDR5 Spec Update Has All It Needs to End Rowhammer: Will It?

Note: This article makes heavy use of Rowhammer terminology, which may be challenging for non-experts. For assistance, refer to my Rowhammer terminology cheat sheet and watch my video presentation on DRAM and Rowhammer. You can also explore other online resources, such as Prof. Mutlu's lectures.


A new updated DDR5 spec (JESD79-5C) is out and includes a brand new chapter -- Chapter 16: "DDR5 Per Row Activation Counting (PRAC)". PRAC introduces two key mechanisms for comprehensive Rowhammer defenses: an Activation Counter for every DRAM row and a mechanism that triggers when an Activation Counter reaches a specific threshold. This allows the DRAM to pause the memory controller from issuing new commands, giving it time to refresh potential victim rows.

In the words of a DRAM industry veteran who will remain nameless, PRAC is the biggest change to DRAM in decades. 😁 Thus, I thought I should write up a brief article summarizing the change and its potential to solve Rowhammer once and for all.


Background

As many researchers have shown, current in-DRAM Rowhammer defenses, also known as Target Row Refresh (TRR), have significant shortcomings. There are two reasons for this. First, these defenses attempt to track aggressor rows by leveraging a small set of counters. Sophisticated Rowhammer attack patterns may overwhelm the tracking ability of these counters. Second, even when an aggressor row is correctly identified, the DRAM must find time to refresh its corresponding victim rows. Unfortunately, DRAM protocols are synchronous and thus the memory controller solely dictates how time is spent. DRAM protocols have never adopted a provision that allows the DRAM itself to control how time is allocated or, at the very least, instructs the DRAM to pause issuing new commands to allow time for refreshes to complete.

PRAC solves both these problems. Each row in DRAM is equipped with Activation Counter bits that track the number of activations a row receives. When this counter reaches a threshold, known as the Rowhammer threshold, the DRAM will internally attempt to find time to refresh its corresponding victim rows. However, if the DRAM cannot find enough time to refresh the victim rows, it can leverage a back-off protocol based on the ALERTn signal. The spec deliberately avoids being prescriptive about PRAC implementation and configuration details.

Instead, I will discuss Panopticon (https://stefan.t8k2.com/publications/dramsec/2021/panopticon.pdf), a research paper that introduced these exact two mechanisms. Panopticon influenced the design of PRAC. Panopticon was published in DRAMSec in 2021 and is openly available. You can also watch a 15-minute presentation or experiment with the code.


High-Level Overview of Per Row Activation Counters in Panopticon

Each DRAM row is equipped with its own counter. When a counter reaches the Rowhammer threshold, a signal is sent to a service queue to enqueue the row address. Once enqueued, Panopticon must refresh potential victim rows in a timely manner to avoid the possibility of Rowhammer bit flips. One option is to provide extra time for mitigations during regular background refresh operations. With this design, Panopticon can service the queue when it receives a REF command (each tREFI). However, should the DRAM have no extra time or should the queue be full, the DRAM must find a way to signal the memory controller that it needs time to perform the Rowhammer remedies. Unfortunately, DRAM protocols do not specify a way for the DRAM to ask for free time.

Panopticon retrofits an existing signal in the DDR specification, called ALERTn, to effectively “trick” the memory controller to pause issuing new DDR commands. DRAM uses ALERTn to signal errors to the memory controller. Upon receiving this signal, the memory controller stops issuing new DRAM commands and instead re-issues the old memory access. By making use of ALERTn, Panopticon requires no modifications to any hardware other than DRAM itself.

Panopticon makes several key contributions to design PRAC inside the DRAM itself in way that is efficient.

  • Counter Mats. Panopticon leverages an open-space design to place the counters mats in a staggered pattern.
  • Incrementer. The incrementer makes use of the read and writeback cycle inherent in DRAM row activation to perform its logic.
  • Service Queue. When a high-order bit of a counter toggles, Panopticon sends a signal to the refresh logic to enqueue the row address in a service queue. Once enqueued, a row must be serviced in a timely manner. Servicing a row requires refreshing multiple corresponding victim rows.
  • Threshold Bit rather than Threshold Value. Panopticon does not maintain a Rowhammer threshold value, but a threshold bit. Whenever this bit is toggled during a counter increment, Panopticon enqueues the row address into a service queue.


Why Does PRAC Have the Potential to Put Rowhammer to Rest?

PRAC has the potential to make two strong guarantees. First, PRAC can guarantee that no aggressor row can escape tracking. For the first time, DRAM has a mechanism to track every single activation of every single row. Second, PRAC can guarantee that every victim row is refreshed in a timely manner. This is because the DRAM can now signal the memory controller to pause issuing new commands whenever it needs time to refresh potential victim rows.

A PRAC implementation that makes these two guarantees could eliminate Rowhammer once and for all. This is a huge step forward for the DRAM industry and for the security community. Kudos to them!

However, the devil is in the details. First, a small pet peeve of mine, the spec is written in the typical ambiguous language that has permeated all DDR specs in JEDEC. For example, the per-row counter stores a count associated with the number of activations received by a row. What does associated mean? Does it mean the counter stores the actual count or some other value that somehow is vaguely "associated" with the actual count. Ugh!

Second, there is long list of potential pitfalls that could make PRAC stop short from delivering its potential. Here's a brief laundry list:

  • Correct configuration. The spec lacks details on how PRAC is configured. For example, what if the Rowhammer threshold is set incorrectly?
  • Counter reset. The spec indicates that the per-row counters are periodically reset. Unfortunately, deciding when to reset a counter value is fraught with peril. For example, what if the counter is reset too early?
  • Accounting for fudge factors. The spec does not guarantee that an aggressor row will stop being activated even after its counter has reached the Rowhammer threshold. This is typically addressed through a fudge factor -- the threshold is conservatively set lower than necessary to accommodate this extra time. Other similar factors exist. What if they are not properly accounted for?
  • RowPress. Counting row activations is not effective against RowPress. Perhaps that's what the spec refers to when it says that count values are associated with the number of activations rather than count the actual number of activations.
  • Blast Radius. Since an aggressor row can affect distant victim rows, PRAC must refresh these distant victims as well. This is challenging to handle correctly.
I will leave a more thorough analysis of these pitfalls for a future article.

For now, I am excited to see the DRAM industry implement PRAC and to see the first DDR5 devices shipped with it. Despite my complaints, with PRAC, the DRAM industry has taken a huge step forward in addressing Rowhammer. I am excited to see what the future holds.


Thanks to Alec Wolman for reading drafts of this document.
April 17th, 2024