Can queuing models be used to model main memory (memory controller and DRAM)? Why does DRAM latency increase with utilized bandwidth? What is the correct way to model main memory? To answer these questions in this post, we will use an example from everyday life.

Consider a cashier at a store. Our aim, as store manager, is to
process as many customers as possible *i.e.* keep the cashier as busy
as possible. Assume that the cashier has a service rate (customers per
unit time), $\mu$, and that customers arrive with an average arrival rate,
$\lambda$. Assume that the customer arrivals are random *i.e.* we have
a Poisson process. In reality, customer arrivals might be bursty *i.e.*
Pareto distributed. Also assume that the maximum number of customers in
the queue, $L$, is finite *i.e.* the queue cannot grow indefinitely. This
assumption reflects the fact that on-chip buffers have finite capacity.

Given a service rate, $\mu$, and maximum queue length, $L$, we want to understand the relationship between arrival rate, $\lambda$, cashier utilization (bandwidth) and average customer wait time (latency). Now consider three cases.

In this case, assume that the cashier processes the customers in FIFO
(First In First Out) order. Also assume that the service time is the
same for all customers.

Service Time ($s = \frac{1}{\mu}$)

Queue Length ($L$)

The plot on the right shows the cashier utilization (bandwidth) as a
function of $\frac{\lambda}{\mu}$. As the arrival rate increases,
the cashier utilization increases. As soon as the fraction
$\frac{\lambda}{\mu}$ exceeds 1 *i.e.* the arrival rate exceeds
the service rate *i.e. * there is atleast 1 customer in the queue
at all times, the cashier is always busy and his utilization is 100%.
As is obvious, when the arrival rate exceeds the service rate, the queue
(on chip buffer) will get full.

The plot on the right shows the average customer latency as a function of
cashier utilization (bandwidth). The average latency seen by the customer
is fairly small until the utilization approaches 100%. We observed
in the plot above, that utilization approaches 100% when the arrival
rate exceeds the service rate and the queue will get full. At that time,
the latency seen by every customer will be $L \times s$.

The plot on the right shows the average customer latency as a function
of $\frac{\lambda}{\mu}$. As above, the average latency is fairly small
until the fraction approaches 1. As the fraction exceeds 1, the queue
gets full and the average latency approaches $L \times s$. We estimate
latency from the time the customer enters the queue and so it will never
exceed $L \times s$.

In this case, assume there are three kinds of customers viz. customers
paying with cash, card and cheque. As before, assume that the cashier
processes the customers in FIFO order and that the service time is the
same for all customers. However, there is an overhead in switching the
payment processing system from cash to card, card to cheque and so on.

Service Time ($s = \frac{1}{\mu}$)

Switch Overhead ($o$)

Queue Length ($L$)

Customer Types ($t$)

The plot on the right shows the cashier utilization (bandwidth) as a
function of $\frac{\lambda}{\mu}$. As the arrival rate increases, the
cashier utilization increases. In the best case if there are, say, only
cash paying customers, there will be no switching overhead and the cashier
utilization will reach 100%. In the worst case, there can be switching
overhead between every customer and the cashier utilization will fall
to $\frac{s}{s + o}$. On average, the utilization will be between these
two values. If $o=0$ or $t=1$, the cashier utilization will reach 100%.
The service time as seen by a customer is either $s$ or $s + o$ and on
average somewhere in between. Therefore, the effective service rate is
less than $\mu$. As a result, the cashier utilization saturates when
$\frac{\lambda}{\mu_{eff}} \rightarrow 1$ *i.e.*
before $\frac{\lambda}{\mu} \rightarrow 1$.

The plot on the right shows the average customer latency as a function of
cashier utilization (bandwidth). The average latency seen by the customer
is fairly small until we reach the saturation bandwidth at which point
the queue gets full and the latency seen by every customer
$\in [L \times s, L \times (s+o)]$.

The plot on the right shows the average customer latency as a function of
$\frac{\lambda}{\mu}$. The average latency is small until the fraction
$\frac{\lambda}{\mu_{eff}}$ approaches 1. As the fraction exceeds 1,
the queue gets full and the average latency $\in [L \times s, L \times
(s+o)]$.

In case 2, the cashier processed the customers in a FIFO order and
there was no way to reduce switching overhead. However, if the store
manager has the ability to choose any customer from the queue, he
can reduce the switching overhead and increase the cashier's utilization. The
manager can, say, first select all cash paying customers, switch the
payment system once, select all card paying customers, switch the
payment system once, select all cheque paying customers and so on.
As in case 2, assume there are three kinds of customers viz. customers
paying with cash, card and cheque, the service time is the same for all
customers and there is an overhead in switching the payment processing
system from cash to card, card to cheque and so on. However, the manager
can choose any customer from the queue to reduce switching overhead and
maximize the cashier's utilization.

Service Time ($s = \frac{1}{\mu}$)

Switch Overhead ($o$)

Queue Length ($L$)

Customer Types ($t$)

The plot on the right shows the cashier bandwidth (utilization) as a
function of $\frac{\lambda}{\mu}$. As the arrival rate increases, the
cashier utilization increases. As the fraction exceeds 1 *i.e.*
the arrival rate exceeds the service rate, the queue gets longer and
the manager has more opportunity to intelligently choose customers to
minimize switching overhead, and increase the cashier's utilization. In
such systems, long queues are, therefore, a requirement for achieving
high utilization.

The plot on the right shows the average customer latency as a function
of cashier utilization (bandwidth). The average latency as seen by the
customer is fairly small until the utilization approaches 100%. We saw
in the plot above that utilization approaches 100% only when the queue
has a lot of customers and the switching overhead is minimum. At that
time, the average latency approaches $L\times s$. In such a system,
some customers will see very short latencies and some will see very
high latencies. This is because the manager can choose customers from
anywhere in the queue.

The plot on the right shows the average customer latency as a function of
$\frac{\lambda}{\mu}$. As above, the latency is fairly small until the
fraction approaches 1. As the fraction exceeds 1, the queue gets full,
switching overhead reduces and the average latency approaches $L\times s$.

The plot below shows the average customer latency as a function of
cashier utilization (bandwidth) for the three cases,
case 1 in ,
case 2 in , and
case 3 in .
In case 1, there is no switching overhead and hence it can achieve 100%
utilization. In case 2, there is switching overhead and hence it cannot
achieve 100% utilization. In case 3, as in case 2, there is switching
overhead. However, its impact can be minimized, by selecting customers
intelligently, and hence its utilization is better than case 2.
At a given utilization, the average latency is smallest in case 1
and largest in case 2.
Needless to say, the three cases are very distinct and they should be
modeled accordingly. It would be incorrect to approximate case 3 with
a model of case 1.

Service Time ($s = \frac{1}{\mu}$)

Switch Overhead ($o$)

Queue Length ($L$)

Customer Types ($t$)

Let us now translate the cashier analogy to a DRAM subsystem.

- Cashier = DRAM bank
- Store (queue) manager = Memory controller
- Service Time = Read/write time (under row hit)
- Switching overhead = Row miss (precharge and activate delay)
- Queue = Memory controller queue
- Customer types = Number of rows

A DRAM subsystem is like case 3, albeit an order of magnitude more
complicated. There are many banks, many queues, many sources of overheads
(*e.g.* refresh, bus read/write turns, activate windows, row cycle time
etc.) and many constraints (e.g. request priorities, maximum age etc.).
To summarize, a DRAM subsystem cannot be modeled or approximated with
simple queuing models. Bruce Jacob's book [1] on the subject is
aptly named ''The Memory System: You Can't Avoid It, You Can't Ignore It,
You Can't Fake It''.