Memory Access Patterns
######################

If you have not figured this out by now, computer architecture is tied to the
software it runs. Designers study programming patterns, and use that data to
find ways to improve performance.

Let's look at some code, and see how we might process data.

..  note::

    Some of this code is silly. We could just as easily do a few calculations
    and come up with the data we will uncover. However, building the code shown
    here is helpful in exploring aspects of modern architecture.

Processing Big Data
*******************

The rage today is working on "big data". We are not quite sure what that data
is, but we are told it is "BIG".

In my career, I have processed a lot of "big data", long before that term had
even been mentioned. In my work as a ``Computational Fluid Dynamicist``, I
routinely worked on piles of data, Gigabytes big. So big, in fact, that we had
a hard time moving it back and forth into the machine to do the calculations. 

Let's build a simple model of such work. Our model will be like all models, a
small version of the real thing.

In these experiments, the actual processing is not important. The fetching of
data from some storage area for processing is important. Therefore, we will set
up our model data storage areas as arrays of data items, say 64-bits wide. Our
arrays can be any size, the "bigger" the better, but we will keep things under
control for our model.

We will need three different storage areas for this work:

    * Registers - small

    * Memory - fairly big, but slower

    * Disk - huge but very slow

Just for fun, let's size these data areas based on the number of bits it takes
to address all of them. (Hey we are studying computer architecture, and the
address bus is an important part of this!)

Data Stores
===========

Here is a start on setting up the system's storage areas:

..  literalinclude::    code/caches.cpp
    :lines: 1-16

I wonder how big they are!

..  literalinclude::    code/caches.cpp
    :lines: 75-78,135

Here is the output:

..  code-block:: text

    Register Size: 64
    Memory Size  : 65536
    Data Size    : 262144

(Hey, these is not really big! They are tiny by today's standards, but back in
1975, when I started building my own computers, this was what we had available.

Access Patterns
***************

In reviewing Gauss's Busy Work code, we see that we have several access
patterns to look at. Actually, there are only two. One involves strictly
sequential access to memory, and the other is fairly random access. We hop all
over the data store fetching our data. 

If we want to model a real computer in these experiments, we need to consider
how the different devices work. Each one has some total capacity, that much
should be obvious.  But there are two other characteristics we need to consider
as well:

Access Time
===========

Every device takes some time to get ready. We deliver an address to the store,
and we wait for the result. The delay is called the ``access time``. 

For a simple device that time may be all we need. But other devices work
differently. For example if our data store is a rotating hard disk, the address
reaches the device, and we need to wait while the read-write head positions to
the right track, and then wait for the disk to spin around into the right spot
on that track. We can read our data at that moment.

Stream Time
===========

If we access totally random data stored on this disk, the access time is the
delay we will experience (although it will not be a fixed time (depending on
where the next data item lives). However, if the next fetch is located next to
the first fetch, we can do better. In fact, we can "stream" data from
sequential storage locations much faster than what we will see for random
accesses. The reads happen as the disk spins, we never move the read-write
head.

This faster access rate can be measured by a different term, we will call it
the ``stream_time``, a measure of the time between sequential fetches after the
first fetch is ready.

Modeling a Storage Device
=========================

Let's create a simple data structure (a C++ ``struct``) to store all of this
information about a store:

..  literalinclude::    code/caches.cpp
    :lines: 18-27

..  note::

    In case you have not seen this feature of C++, a ``struct`` is basically a
    class with no methods. In fact, some programmers never use this old
    pattern, inherited from "C", and set up classes with no methods. Internally,
    they are the same thing.

After doing some intense "Googling", I came up with this set of values for the
delays in accessing our three data storage areas:

..  literalinclude::    code/caches.cpp
    :lines: 29-35

Here is the code needed to set up the first two of our data stores using these
structures:

..  literalinclude::    code/caches.cpp
    :lines: 102-117

Utility Routines
================

To assist in our work, we need a utility routine to initialize a data area with
a sequence of numbers. Here is code to do tat:

..  literalinclude::    code/caches.cpp
    :lines: 37-42

We will hand this routine a management structure, and it will initialize that
array for us.

Modeling Memory Access
**********************

We will skip the first version of Gauss's Busy Work code we showed earlier, and
start off with the one-dimensional array version. The heart of this code was a
simple loop that accessed each data item in order to do the work.

Experiment 1: Modeling Random Access
====================================

If our data fetches are random, all fetches will result in the delay specified
by the ``access _time`` variable. To model this, we set up a simple loop that
looks like this:

..  literalinclude:: code/caches.cpp
    :lines: 88-98

We do not really need to fetch th data, each fetch will happen in the time
specified by the access time, so we just calculate that time here to get a
baseline number for reference. The code examines the time it would take to fetch
each item from each data store:

Here is the output:

..  code-block:: text

    Time to process data in registers: 262144
    Time to process data in memory   : 2621440
    Time to process data on disk     : 26214400

Experiment 2: Moving from Memory to Registers
=============================================

In this next experiment, we need to work harder. We want to process all of the
numbers stored in memory, using the registers. Since we have more memory than
registers, we need to pull the data in from the memory in blocks. Once the data
is in our registers, we can do the work.  

Our program code is unaware of all of this, it is simple running through a loop
adding up numbers. In doing that it generates a sequence of addresses that head
off to the controller. 

In this experiment, the raw data is in memory, so our addresses represent
locations in memory. Obviously those addresses will be beyond anything
available for the registers, so we need to translate the addresses.

Here is the idea:

Break each address up into two parts:

    * offset - the low bites (equal to register address bit size)

    * tag: all of the other bits.

If you view the memory area as a set of blocks, each one exactly the same size
as our register area, then a memory address ends up looking like this:

    * offset, - index into any block

    * tag: block number

We can watch the tag part  of an address and check to see if that tag is
currently loaded in the registers. If so, we add in the number from the register
indicated by the ``offset`` part. If the tag does not match, we need to load
that block into the register store.

In all of this, we will use the access time needed to measure our time. We will
add in the streaming improvement later.

Here is a routine to load a block:

..  literalinclude:: code/caches.cpp
    :lines: 44-71

The interesting part of this code is checking to see that the transfer will
work. If we step out of bounds in either block, we generate a fault message,
and exit our program.


Processing Loop 
---------------

The processing loop is simple. We will not do any actual processing, just set
up the data fetches. We can figure out the data we will get easily.

If we are going to process n bytes from slow memory into faster memory, then
process the data from faster memory, the total time to do the work is:

    * n * access_time1 + n * access_time 2.

As an example, if we move 65566 bytes from memory, which has an access time of
10 clocks, it will take 655360 clocks to complete the data transfers.

If the data must all be processed through the registers, which have an access
time of 1 clock tick, it will take 65536 clock ticks to do the work. 

The total is 655360+65536 = 720896.

Let's hope our code works properly!

..  note::
    
    This seems silly. The time it takes to do this processing is longer than
    just accessing the data directly from the slower of the two. In our
    example, we are modeling moving data from memory into registers where we
    will do the actual work. Registers are faster than the memory, but we have
    very few registers, and much more memory. The real story is a bit more
    complex, this is just a start.

Lets build some code and see if we get the right numbers:

..  literalinclude::    code/caches.cpp
    :lines: 100-134

Running this code give a lot of output, most of which is generated by the loads
of 64 bytes at a time into the registers.

The last few lines tell the story:

..  code-block:: text

	loading tag 1020 MEM address = 65280
	loading tag 1021 MEM address = 65344
	loading tag 1022 MEM address = 65408
	loading tag 1023 MEM address = 65472

    EX2: process time: 720896

Hey, we got the right number. That was sure a lot of work to generate a number
we could just figure out by hand. But, hey, we are programmers here, this was
much more interesting.

The Real Story
**************

Storage devices to not work in such a simple way. The ``access time`` is the
time it takes to get that first byte ready to move across a bus to the
destination. If our access into this device was totally random, then that
access time might be the number we need to calculate throughput. However, if we
fetch data sequentially the device can do much better. It can stream data at a
much higher rate. 

RAM has its own clock. A typical memory module is clocked at a rate of 1.333GHz
which works out to about half the speed of the processor. Therefore, the burst
transfer rate works out to one byte every two clock ticks. (twice as slow as
transfers inside the chip. Of course, this only works if the data access is
sequential. Introduce random access, and we are back to using the latency
numbers.

Here are some numbers we can work with:

    * Registers:
        * Latency: 1 clock
        * stream rate: 1 (no improvement here)

    * Memory (DDR3-1333):
        * Latency: 13 clocks
        * stream rate: 2 clock ticks.

    * Disk (7200rpm)
        * Latency: 100 tocks
        * Stream rate: 23

Let's see what this does to our code.

Adding Stream Rate
******************

The stream rate improves performance in our test.

When we move a block of data, say n bytes large, the total time needed works out
as follows:

    * time = 1 * access_time + (n-1) * stream time

If the stream time is the same as the access time, the time is just ``n *
access time``, which is what we used in the previous example code.

..  note::

    There is no new code for this experiment. The stream delay calculation is
    already in the code, I just made it equal to the access time for the last
    experiment! Lazy me!

Now, considering this improvement, our total processing time looks like this:

..  code-block:: text

    EX3: process time: 204800

    Looks like we improved things by a factor of three. Not bad!

Non-Sequential processing
*************************

Unfortunately, programmers do not always do the "right thing"

Let's consider a two dimensional array that we need to process:

..  literalinclude::    code/stride1.cpp

That code looks a bit strange. I am building up a ``sum1`` by placing a one in
each memory location, then adding it to the ``sum`` variable. This is just a
count of the cells.

Look at that silly code that calculates ``sum2``. What in the world is going on
there.

..  note::

    I am cheating here, I am over indexing the column index on purpose. I am
    also always aiming the processor at column zero. The effect is to access
    each item in the array using the address I calculate from the row number
    and the column number. This is exactly what the compiler does, after laying
    out your array in the correct form. The accesses are always using an
    address to fetch data inside the processor!

    If you are not convinced that this will work, run the code!

Compare that to this code:

..  literalinclude::    code/stride2.cpp

Do you see the difference?

The first example access the data column by column with a single row, then
moves on to the next row. Based on how an array is stored in memory (row
major) this is back to our sequential memory access time.

On the other hand, the second example access data row by row, working down one
column at a time. This is not going to be as efficient.

The reason why should be clear by now. If we are not going to fetch the next
byte in sequence from memory, we lose the advantage of that stream rate for
data transfer. Every fetch if going to incur the full access time delay. 

The Final Code
**************

Just for reference, here is the complete program I used for this lecture:

..  literalinclude::    code/caches.cpp
    :linenos: