DRAM Modeling

August 20, 2018

Can queuing models be used to model main memory (memory controller and DRAM)? Why does DRAM latency increase with utilized bandwidth? What is the correct way to model main memory? To answer these questions in this post, we will use an example from everyday life.

Consider a cashier at a store. Our aim, as store manager, is to process as many customers as possible i.e. keep the cashier as busy as possible. Assume that the cashier has a service rate (customers per unit time), $\mu$, and that customers arrive with an average arrival rate, $\lambda$. Assume that the customer arrivals are random i.e. we have a Poisson process. In reality, customer arrivals might be bursty i.e. Pareto distributed. Also assume that the maximum number of customers in the queue, $L$, is finite i.e. the queue cannot grow indefinitely. This assumption reflects the fact that on-chip buffers have finite capacity.

Given a service rate, $\mu$, and maximum queue length, $L$, we want to understand the relationship between arrival rate, $\lambda$, cashier utilization (bandwidth) and average customer wait time (latency). Now consider three cases.

Case 1

In this case, assume that the cashier processes the customers in FIFO (First In First Out) order. Also assume that the service time is the same for all customers.
Service Time ($s = \frac{1}{\mu}$)
Queue Length ($L$)
The plot on the right shows the cashier utilization (bandwidth) as a function of $\frac{\lambda}{\mu}$. As the arrival rate increases, the cashier utilization increases. As soon as the fraction $\frac{\lambda}{\mu}$ exceeds 1 i.e. the arrival rate exceeds the service rate i.e. there is atleast 1 customer in the queue at all times, the cashier is always busy and his utilization is 100%. As is obvious, when the arrival rate exceeds the service rate, the queue (on chip buffer) will get full.
The plot on the right shows the average customer latency as a function of cashier utilization (bandwidth). The average latency seen by the customer is fairly small until the utilization approaches 100%. We observed in the plot above, that utilization approaches 100% when the arrival rate exceeds the service rate and the queue will get full. At that time, the latency seen by every customer will be $L \times s$.
The plot on the right shows the average customer latency as a function of $\frac{\lambda}{\mu}$. As above, the average latency is fairly small until the fraction approaches 1. As the fraction exceeds 1, the queue gets full and the average latency approaches $L \times s$. We estimate latency from the time the customer enters the queue and so it will never exceed $L \times s$.

Case 2

In this case, assume there are three kinds of customers viz. customers paying with cash, card and cheque. As before, assume that the cashier processes the customers in FIFO order and that the service time is the same for all customers. However, there is an overhead in switching the payment processing system from cash to card, card to cheque and so on.
Service Time ($s = \frac{1}{\mu}$)
Switch Overhead ($o$)
Queue Length ($L$)
Customer Types ($t$)
The plot on the right shows the cashier utilization (bandwidth) as a function of $\frac{\lambda}{\mu}$. As the arrival rate increases, the cashier utilization increases. In the best case if there are, say, only cash paying customers, there will be no switching overhead and the cashier utilization will reach 100%. In the worst case, there can be switching overhead between every customer and the cashier utilization will fall to $\frac{s}{s + o}$. On average, the utilization will be between these two values. If $o=0$ or $t=1$, the cashier utilization will reach 100%. The service time as seen by a customer is either $s$ or $s + o$ and on average somewhere in between. Therefore, the effective service rate is less than $\mu$. As a result, the cashier utilization saturates when $\frac{\lambda}{\mu_{eff}} \rightarrow 1$ i.e. before $\frac{\lambda}{\mu} \rightarrow 1$.
The plot on the right shows the average customer latency as a function of cashier utilization (bandwidth). The average latency seen by the customer is fairly small until we reach the saturation bandwidth at which point the queue gets full and the latency seen by every customer $\in [L \times s, L \times (s+o)]$.
The plot on the right shows the average customer latency as a function of $\frac{\lambda}{\mu}$. The average latency is small until the fraction $\frac{\lambda}{\mu_{eff}}$ approaches 1. As the fraction exceeds 1, the queue gets full and the average latency $\in [L \times s, L \times (s+o)]$.

Case 3

In case 2, the cashier processed the customers in a FIFO order and there was no way to reduce switching overhead. However, if the store manager has the ability to choose any customer from the queue, he can reduce the switching overhead and increase the cashier's utilization. The manager can, say, first select all cash paying customers, switch the payment system once, select all card paying customers, switch the payment system once, select all cheque paying customers and so on. As in case 2, assume there are three kinds of customers viz. customers paying with cash, card and cheque, the service time is the same for all customers and there is an overhead in switching the payment processing system from cash to card, card to cheque and so on. However, the manager can choose any customer from the queue to reduce switching overhead and maximize the cashier's utilization.
Service Time ($s = \frac{1}{\mu}$)
Switch Overhead ($o$)
Queue Length ($L$)
Customer Types ($t$)
The plot on the right shows the cashier bandwidth (utilization) as a function of $\frac{\lambda}{\mu}$. As the arrival rate increases, the cashier utilization increases. As the fraction exceeds 1 i.e. the arrival rate exceeds the service rate, the queue gets longer and the manager has more opportunity to intelligently choose customers to minimize switching overhead, and increase the cashier's utilization. In such systems, long queues are, therefore, a requirement for achieving high utilization.
The plot on the right shows the average customer latency as a function of cashier utilization (bandwidth). The average latency as seen by the customer is fairly small until the utilization approaches 100%. We saw in the plot above that utilization approaches 100% only when the queue has a lot of customers and the switching overhead is minimum. At that time, the average latency approaches $L\times s$. In such a system, some customers will see very short latencies and some will see very high latencies. This is because the manager can choose customers from anywhere in the queue.
The plot on the right shows the average customer latency as a function of $\frac{\lambda}{\mu}$. As above, the latency is fairly small until the fraction approaches 1. As the fraction exceeds 1, the queue gets full, switching overhead reduces and the average latency approaches $L\times s$.

Putting it all together

The plot below shows the average customer latency as a function of cashier utilization (bandwidth) for the three cases, case 1 in , case 2 in , and case 3 in . In case 1, there is no switching overhead and hence it can achieve 100% utilization. In case 2, there is switching overhead and hence it cannot achieve 100% utilization. In case 3, as in case 2, there is switching overhead. However, its impact can be minimized, by selecting customers intelligently, and hence its utilization is better than case 2. At a given utilization, the average latency is smallest in case 1 and largest in case 2. Needless to say, the three cases are very distinct and they should be modeled accordingly. It would be incorrect to approximate case 3 with a model of case 1.
Service Time ($s = \frac{1}{\mu}$)
Switch Overhead ($o$)
Queue Length ($L$)
Customer Types ($t$)

Let us now translate the cashier analogy to a DRAM subsystem.

  • Cashier = DRAM bank
  • Store (queue) manager = Memory controller
  • Service Time = Read/write time (under row hit)
  • Switching overhead = Row miss (precharge and activate delay)
  • Queue = Memory controller queue
  • Customer types = Number of rows

A DRAM subsystem is like case 3, albeit an order of magnitude more complicated. There are many banks, many queues, many sources of overheads (e.g. refresh, bus read/write turns, activate windows, row cycle time etc.) and many constraints (e.g. request priorities, maximum age etc.). To summarize, a DRAM subsystem cannot be modeled or approximated with simple queuing models. Bruce Jacob's book [1] on the subject is aptly named ''The Memory System: You Can't Avoid It, You Can't Ignore It, You Can't Fake It''.

References

  1. The Memory System: You Can't Avoid It, You Can't Ignore It, You Can't Fake It