Memory Access Patterns

If you have not figured this out by now, computer architecture is tied to the software it runs. Designers study programming patterns, and use that data to find ways to improve performance.

Let’s look at some code, and see how we might process data.

Note

Some of this code is silly. We could just as easily do a few calculations and come up with the data we will uncover. However, building the code shown here is helpful in exploring aspects of modern architecture.

Processing Big Data

The rage today is working on “big data”. We are not quite sure what that data is, but we are told it is “BIG”.

In my career, I have processed a lot of “big data”, long before that term had even been mentioned. In my work as a Computational Fluid Dynamicist, I routinely worked on piles of data, Gigabytes big. So big, in fact, that we had a hard time moving it back and forth into the machine to do the calculations.

Let’s build a simple model of such work. Our model will be like all models, a small version of the real thing.

In these experiments, the actual processing is not important. The fetching of data from some storage area for processing is important. Therefore, we will set up our model data storage areas as arrays of data items, say 64-bits wide. Our arrays can be any size, the “bigger” the better, but we will keep things under control for our model.

We will need three different storage areas for this work:

  • Registers - small
  • Memory - fairly big, but slower
  • Disk - huge but very slow

Just for fun, let’s size these data areas based on the number of bits it takes to address all of them. (Hey we are studying computer architecture, and the address bus is an important part of this!)

Data Stores

Here is a start on setting up the system’s storage areas:

#include <iostream>
#include <cstdint>
#include <string>

//set up data areas
const uint64_t raddr_bits = 6;
const uint64_t reg_size = 1 << raddr_bits;
const uint64_t maddr_bits = 16;
const uint64_t mem_size = 1 << maddr_bits;
const uint64_t daddr_bits = 18;
const uint64_t data_size = 1 << daddr_bits;

// byte arrays
uint8_t registers[reg_size];
uint8_t memory[mem_size];
uint8_t data[data_size];

I wonder how big they are!

int main( void ) {
    std::cout << "Register Size: " << reg_size << std::endl;
    std::cout << "Memory Size  : " << mem_size << std::endl;
    std::cout << "Data Size    : " << data_size << std::endl;
}

Here is the output:

Register Size: 64
Memory Size  : 65536
Data Size    : 262144

(Hey, these is not really big! They are tiny by today’s standards, but back in 1975, when I started building my own computers, this was what we had available.

Access Patterns

In reviewing Gauss’s Busy Work code, we see that we have several access patterns to look at. Actually, there are only two. One involves strictly sequential access to memory, and the other is fairly random access. We hop all over the data store fetching our data.

If we want to model a real computer in these experiments, we need to consider how the different devices work. Each one has some total capacity, that much should be obvious. But there are two other characteristics we need to consider as well:

Access Time

Every device takes some time to get ready. We deliver an address to the store, and we wait for the result. The delay is called the access time.

For a simple device that time may be all we need. But other devices work differently. For example if our data store is a rotating hard disk, the address reaches the device, and we need to wait while the read-write head positions to the right track, and then wait for the disk to spin around into the right spot on that track. We can read our data at that moment.

Stream Time

If we access totally random data stored on this disk, the access time is the delay we will experience (although it will not be a fixed time (depending on where the next data item lives). However, if the next fetch is located next to the first fetch, we can do better. In fact, we can “stream” data from sequential storage locations much faster than what we will see for random accesses. The reads happen as the disk spins, we never move the read-write head.

This faster access rate can be measured by a different term, we will call it the stream_time, a measure of the time between sequential fetches after the first fetch is ready.

Modeling a Storage Device

Let’s create a simple data structure (a C++ struct) to store all of this information about a store:

// data area management structure
struct Store {
    std::string name;
    uint8_t *data;
    uint64_t addr_bits;
    uint64_t size;
    uint64_t start_address;
    uint8_t access_time;
    uint8_t stream_time;
};

Note

In case you have not seen this feature of C++, a struct is basically a class with no methods. In fact, some programmers never use this old pattern, inherited from “C”, and set up classes with no methods. Internally, they are the same thing.

After doing some intense “Googling”, I came up with this set of values for the delays in accessing our three data storage areas:

// access times
const uint8_t reg_access = 1;
const uint8_t reg_stream = 1;
const uint8_t mem_access = 10;
const uint8_t mem_stream = 2;
const uint8_t disk_access = 100;
const uint8_t disk_stream = 24;

Here is the code needed to set up the first two of our data stores using these structures:

    Store a1 = {
        "REG",
        registers,
        raddr_bits,
        reg_size,
        0,
        reg_access,
        reg_stream};
    Store a2 = {
        "MEM",
        memory,
        maddr_bits,
        mem_size,
        0,
        mem_access,
        mem_stream};

Utility Routines

To assist in our work, we need a utility routine to initialize a data area with a sequence of numbers. Here is code to do tat:

// initialize an array with a sequence of numbers
void init( uint8_t *array, uint64_t size, int n) {
    for(uint64_t i=n; i<size;i++) {
        array[i] = i;
    }
}

We will hand this routine a management structure, and it will initialize that array for us.

Modeling Memory Access

We will skip the first version of Gauss’s Busy Work code we showed earlier, and start off with the one-dimensional array version. The heart of this code was a simple loop that accessed each data item in order to do the work.

Experiment 1: Modeling Random Access

If our data fetches are random, all fetches will result in the delay specified by the access _time variable. To model this, we set up a simple loop that looks like this:

    // experiment 1: process data in different areas:
    rtime = mtime = dtime = 0;
    for(uint64_t i=0; i < data_size; i++) {
        rtime += reg_access;
        mtime += mem_access;
        dtime += disk_access;
    }

    std::cout << "Time to process data in registers: " << rtime << std::endl; 
    std::cout << "Time to process data in memory   : " << mtime << std::endl; 
    std::cout << "Time to process data on disk     : " << dtime << std::endl; 

We do not really need to fetch th data, each fetch will happen in the time specified by the access time, so we just calculate that time here to get a baseline number for reference. The code examines the time it would take to fetch each item from each data store:

Here is the output:

Time to process data in registers: 262144
Time to process data in memory   : 2621440
Time to process data on disk     : 26214400

Experiment 2: Moving from Memory to Registers

In this next experiment, we need to work harder. We want to process all of the numbers stored in memory, using the registers. Since we have more memory than registers, we need to pull the data in from the memory in blocks. Once the data is in our registers, we can do the work.

Our program code is unaware of all of this, it is simple running through a loop adding up numbers. In doing that it generates a sequence of addresses that head off to the controller.

In this experiment, the raw data is in memory, so our addresses represent locations in memory. Obviously those addresses will be beyond anything available for the registers, so we need to translate the addresses.

Here is the idea:

Break each address up into two parts:

  • offset - the low bites (equal to register address bit size)
  • tag: all of the other bits.

If you view the memory area as a set of blocks, each one exactly the same size as our register area, then a memory address ends up looking like this:

  • offset, - index into any block
  • tag: block number

We can watch the tag part of an address and check to see if that tag is currently loaded in the registers. If so, we add in the number from the register indicated by the offset part. If the tag does not match, we need to load that block into the register store.

In all of this, we will use the access time needed to measure our time. We will add in the streaming improvement later.

Here is a routine to load a block:

// general purpose load routine (from area2 to area1)
uint64_t load(
        Store area1, 
        Store area2,
        uint64_t block_size) {
    uint64_t time = 0;
    uint64_t a1_addr = area1.start_address;
    uint64_t a2_addr = area2.start_address;

    for(uint64_t i=0; i < block_size; i++) {
        // check for transfer faults
        if(a1_addr + 1 >= area1.size) { //area 1 fault
            std::cout << area1.name << " fault" << std::endl;
            std::exit(1);
        }
        if(a2_addr + 1 >= area2.size) { //area 2 fault
            std::cout << area2.name << " fault" << std::endl;
            std::exit(1);
        }
        // no faults, do the transfer
        area1.data[a1_addr + i] = area2.data[a2_addr + i];
        if(i == 0) // access time delay
            time += area2.access_time;  // a2 is always slower
        else
            time += area2.stream_time;
    }
    return time;
}

The interesting part of this code is checking to see that the transfer will work. If we step out of bounds in either block, we generate a fault message, and exit our program.

Processing Loop

The processing loop is simple. We will not do any actual processing, just set up the data fetches. We can figure out the data we will get easily.

If we are going to process n bytes from slow memory into faster memory, then process the data from faster memory, the total time to do the work is:

  • n * access_time1 + n * access_time 2.

As an example, if we move 65566 bytes from memory, which has an access time of 10 clocks, it will take 655360 clocks to complete the data transfers.

If the data must all be processed through the registers, which have an access time of 1 clock tick, it will take 65536 clock ticks to do the work.

The total is 655360+65536 = 720896.

Let’s hope our code works properly!

Note

This seems silly. The time it takes to do this processing is longer than just accessing the data directly from the slower of the two. In our example, we are modeling moving data from memory into registers where we will do the actual work. Registers are faster than the memory, but we have very few registers, and much more memory. The real story is a bit more complex, this is just a start.

Lets build some code and see if we get the right numbers:

    // experiment 2: process data memory array through registers
    rtime = 0;
    Store a1 = {
        "REG",
        registers,
        raddr_bits,
        reg_size,
        0,
        reg_access,
        reg_stream};
    Store a2 = {
        "MEM",
        memory,
        maddr_bits,
        mem_size,
        0,
        mem_access,
        mem_stream};
    uint32_t tag;
    uint32_t offset;
    uint32_t current_tag = 99; // force initial load
    // process the data
    for(uint64_t i = 0; i < a2.size; i++) {
        tag = i >> a1.addr_bits;
        offset = i & (a1.size - 1);
        if(tag != current_tag) {
            rtime += load(a1,a2,a1.size);
            current_tag = tag;
                std::cout << "\t\tloading tag " << tag << " ";
                std::cout << a2.name << " address = " << i << std::endl;
        }
        a1.data[i] = 0;
        rtime += a1.access_time;
    }
    std::cout << "EX2: process time: " << rtime << std::endl;

Running this code give a lot of output, most of which is generated by the loads of 64 bytes at a time into the registers.

The last few lines tell the story:

    loading tag 1020 MEM address = 65280
    loading tag 1021 MEM address = 65344
    loading tag 1022 MEM address = 65408
    loading tag 1023 MEM address = 65472

EX2: process time: 720896

Hey, we got the right number. That was sure a lot of work to generate a number we could just figure out by hand. But, hey, we are programmers here, this was much more interesting.

The Real Story

Storage devices to not work in such a simple way. The access time is the time it takes to get that first byte ready to move across a bus to the destination. If our access into this device was totally random, then that access time might be the number we need to calculate throughput. However, if we fetch data sequentially the device can do much better. It can stream data at a much higher rate.

RAM has its own clock. A typical memory module is clocked at a rate of 1.333GHz which works out to about half the speed of the processor. Therefore, the burst transfer rate works out to one byte every two clock ticks. (twice as slow as transfers inside the chip. Of course, this only works if the data access is sequential. Introduce random access, and we are back to using the latency numbers.

Here are some numbers we can work with:

  • Registers:
    • Latency: 1 clock
    • stream rate: 1 (no improvement here)
  • Memory (DDR3-1333):
    • Latency: 13 clocks
    • stream rate: 2 clock ticks.
  • Disk (7200rpm)
    • Latency: 100 tocks
    • Stream rate: 23

Let’s see what this does to our code.

Adding Stream Rate

The stream rate improves performance in our test.

When we move a block of data, say n bytes large, the total time needed works out as follows:

  • time = 1 * access_time + (n-1) * stream time

If the stream time is the same as the access time, the time is just n * access time, which is what we used in the previous example code.

Note

There is no new code for this experiment. The stream delay calculation is already in the code, I just made it equal to the access time for the last experiment! Lazy me!

Now, considering this improvement, our total processing time looks like this:

EX3: process time: 204800

Looks like we improved things by a factor of three. Not bad!

Non-Sequential processing

Unfortunately, programmers do not always do the “right thing”

Let’s consider a two dimensional array that we need to process:

#include <iostream>

const int nrows = 5;
const int ncols = 10;
int addr;

int data[nrows][ncols];
int sum1, sum2  = 0;

int main( void ) {
    for(int i = 0; i < nrows; i++) {
        for (int j = 0; j < ncols; j++) {
            data[i][j] = 1;        
            sum1 += data[i][j];
            addr = i * ncols + j; 
            sum2 += data[0][addr];
        }
    }
    std::cout << sum1 << std::endl;
    std::cout << sum2 << std::endl;
}

That code looks a bit strange. I am building up a sum1 by placing a one in each memory location, then adding it to the sum variable. This is just a count of the cells.

Look at that silly code that calculates sum2. What in the world is going on there.

Note

I am cheating here, I am over indexing the column index on purpose. I am also always aiming the processor at column zero. The effect is to access each item in the array using the address I calculate from the row number and the column number. This is exactly what the compiler does, after laying out your array in the correct form. The accesses are always using an address to fetch data inside the processor!

If you are not convinced that this will work, run the code!

Compare that to this code:

#include <iostream>

const int nrows = 5;
const int ncols = 10;
int addr;

int data[nrows][ncols];
int sum1, sum2  = 0;

int main( void ) {
    for(int j = 0; j < ncols; j++) {
        for (int i = 0; i < nrows; i++) {
            data[i][j] = 1;        
            sum1 += data[i][j];
            addr = i * ncols + j; 
            sum2 += data[0][addr];
        }
    }
    std::cout << sum1 << std::endl;
    std::cout << sum2 << std::endl;
}

Do you see the difference?

The first example access the data column by column with a single row, then moves on to the next row. Based on how an array is stored in memory (row major) this is back to our sequential memory access time.

On the other hand, the second example access data row by row, working down one column at a time. This is not going to be as efficient.

The reason why should be clear by now. If we are not going to fetch the next byte in sequence from memory, we lose the advantage of that stream rate for data transfer. Every fetch if going to incur the full access time delay.

The Final Code

Just for reference, here is the complete program I used for this lecture:

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
#include <iostream>
#include <cstdint>
#include <string>

//set up data areas
const uint64_t raddr_bits = 6;
const uint64_t reg_size = 1 << raddr_bits;
const uint64_t maddr_bits = 16;
const uint64_t mem_size = 1 << maddr_bits;
const uint64_t daddr_bits = 18;
const uint64_t data_size = 1 << daddr_bits;

// byte arrays
uint8_t registers[reg_size];
uint8_t memory[mem_size];
uint8_t data[data_size];

// data area management structure
struct Store {
    std::string name;
    uint8_t *data;
    uint64_t addr_bits;
    uint64_t size;
    uint64_t start_address;
    uint8_t access_time;
    uint8_t stream_time;
};

// access times
const uint8_t reg_access = 1;
const uint8_t reg_stream = 1;
const uint8_t mem_access = 10;
const uint8_t mem_stream = 2;
const uint8_t disk_access = 100;
const uint8_t disk_stream = 24;

// initialize an array with a sequence of numbers
void init( uint8_t *array, uint64_t size, int n) {
    for(uint64_t i=n; i<size;i++) {
        array[i] = i;
    }
}

// general purpose load routine (from area2 to area1)
uint64_t load(
        Store area1, 
        Store area2,
        uint64_t block_size) {
    uint64_t time = 0;
    uint64_t a1_addr = area1.start_address;
    uint64_t a2_addr = area2.start_address;

    for(uint64_t i=0; i < block_size; i++) {
        // check for transfer faults
        if(a1_addr + 1 >= area1.size) { //area 1 fault
            std::cout << area1.name << " fault" << std::endl;
            std::exit(1);
        }
        if(a2_addr + 1 >= area2.size) { //area 2 fault
            std::cout << area2.name << " fault" << std::endl;
            std::exit(1);
        }
        // no faults, do the transfer
        area1.data[a1_addr + i] = area2.data[a2_addr + i];
        if(i == 0) // access time delay
            time += area2.access_time;  // a2 is always slower
        else
            time += area2.stream_time;
    }
    return time;
}



int main( void ) {
    std::cout << "Register Size: " << reg_size << std::endl;
    std::cout << "Memory Size  : " << mem_size << std::endl;
    std::cout << "Data Size    : " << data_size << std::endl;

    // initialize the data area
    init(data,data_size);

    // variable to track time data
    uint64_t rtime;
    uint64_t mtime;
    uint64_t dtime;

    // experiment 1: process data in different areas:
    rtime = mtime = dtime = 0;
    for(uint64_t i=0; i < data_size; i++) {
        rtime += reg_access;
        mtime += mem_access;
        dtime += disk_access;
    }

    std::cout << "Time to process data in registers: " << rtime << std::endl; 
    std::cout << "Time to process data in memory   : " << mtime << std::endl; 
    std::cout << "Time to process data on disk     : " << dtime << std::endl; 

    // experiment 2: process data memory array through registers
    rtime = 0;
    Store a1 = {
        "REG",
        registers,
        raddr_bits,
        reg_size,
        0,
        reg_access,
        reg_stream};
    Store a2 = {
        "MEM",
        memory,
        maddr_bits,
        mem_size,
        0,
        mem_access,
        mem_stream};
    uint32_t tag;
    uint32_t offset;
    uint32_t current_tag = 99; // force initial load
    // process the data
    for(uint64_t i = 0; i < a2.size; i++) {
        tag = i >> a1.addr_bits;
        offset = i & (a1.size - 1);
        if(tag != current_tag) {
            rtime += load(a1,a2,a1.size);
            current_tag = tag;
                std::cout << "\t\tloading tag " << tag << " ";
                std::cout << a2.name << " address = " << i << std::endl;
        }
        a1.data[i] = 0;
        rtime += a1.access_time;
    }
    std::cout << "EX2: process time: " << rtime << std::endl;
}