Memory Access Patterns¶
If you have not figured this out by now, computer architecture is tied to the software it runs. Designers study programming patterns, and use that data to find ways to improve performance.
Let’s look at some code, and see how we might process data.
Note
Some of this code is silly. We could just as easily do a few calculations and come up with the data we will uncover. However, building the code shown here is helpful in exploring aspects of modern architecture.
Processing Big Data¶
The rage today is working on “big data”. We are not quite sure what that data is, but we are told it is “BIG”.
In my career, I have processed a lot of “big data”, long before that term had
even been mentioned. In my work as a Computational Fluid Dynamicist
, I
routinely worked on piles of data, Gigabytes big. So big, in fact, that we had
a hard time moving it back and forth into the machine to do the calculations.
Let’s build a simple model of such work. Our model will be like all models, a small version of the real thing.
In these experiments, the actual processing is not important. The fetching of data from some storage area for processing is important. Therefore, we will set up our model data storage areas as arrays of data items, say 64-bits wide. Our arrays can be any size, the “bigger” the better, but we will keep things under control for our model.
We will need three different storage areas for this work:
Registers - small
Memory - fairly big, but slower
Disk - huge but very slow
Just for fun, let’s size these data areas based on the number of bits it takes to address all of them. (Hey we are studying computer architecture, and the address bus is an important part of this!)
Data Stores¶
Here is a start on setting up the system’s storage areas:
#include <iostream>
#include <cstdint>
#include <string>
//set up data areas
const uint64_t raddr_bits = 6;
const uint64_t reg_size = 1 << raddr_bits;
const uint64_t maddr_bits = 16;
const uint64_t mem_size = 1 << maddr_bits;
const uint64_t daddr_bits = 18;
const uint64_t data_size = 1 << daddr_bits;
// byte arrays
uint8_t registers[reg_size];
uint8_t memory[mem_size];
uint8_t data[data_size];
I wonder how big they are!
int main( void ) {
std::cout << "Register Size: " << reg_size << std::endl;
std::cout << "Memory Size : " << mem_size << std::endl;
std::cout << "Data Size : " << data_size << std::endl;
}
Here is the output:
Register Size: 64
Memory Size : 65536
Data Size : 262144
(Hey, these is not really big! They are tiny by today’s standards, but back in 1975, when I started building my own computers, this was what we had available.
Access Patterns¶
In reviewing Gauss’s Busy Work code, we see that we have several access patterns to look at. Actually, there are only two. One involves strictly sequential access to memory, and the other is fairly random access. We hop all over the data store fetching our data.
If we want to model a real computer in these experiments, we need to consider how the different devices work. Each one has some total capacity, that much should be obvious. But there are two other characteristics we need to consider as well:
Access Time¶
Every device takes some time to get ready. We deliver an address to the store,
and we wait for the result. The delay is called the access time
.
For a simple device that time may be all we need. But other devices work differently. For example if our data store is a rotating hard disk, the address reaches the device, and we need to wait while the read-write head positions to the right track, and then wait for the disk to spin around into the right spot on that track. We can read our data at that moment.
Stream Time¶
If we access totally random data stored on this disk, the access time is the delay we will experience (although it will not be a fixed time (depending on where the next data item lives). However, if the next fetch is located next to the first fetch, we can do better. In fact, we can “stream” data from sequential storage locations much faster than what we will see for random accesses. The reads happen as the disk spins, we never move the read-write head.
This faster access rate can be measured by a different term, we will call it
the stream_time
, a measure of the time between sequential fetches after the
first fetch is ready.
Modeling a Storage Device¶
Let’s create a simple data structure (a C++ struct
) to store all of this
information about a store:
// data area management structure
struct Store {
std::string name;
uint8_t *data;
uint64_t addr_bits;
uint64_t size;
uint64_t start_address;
uint8_t access_time;
uint8_t stream_time;
};
Note
In case you have not seen this feature of C++, a struct
is basically a
class with no methods. In fact, some programmers never use this old
pattern, inherited from “C”, and set up classes with no methods. Internally,
they are the same thing.
After doing some intense “Googling”, I came up with this set of values for the delays in accessing our three data storage areas:
// access times
const uint8_t reg_access = 1;
const uint8_t reg_stream = 1;
const uint8_t mem_access = 10;
const uint8_t mem_stream = 2;
const uint8_t disk_access = 100;
const uint8_t disk_stream = 24;
Here is the code needed to set up the first two of our data stores using these structures:
Store a1 = {
"REG",
registers,
raddr_bits,
reg_size,
0,
reg_access,
reg_stream};
Store a2 = {
"MEM",
memory,
maddr_bits,
mem_size,
0,
mem_access,
mem_stream};
Utility Routines¶
To assist in our work, we need a utility routine to initialize a data area with a sequence of numbers. Here is code to do tat:
// initialize an array with a sequence of numbers
void init( uint8_t *array, uint64_t size, int n) {
for(uint64_t i=n; i<size;i++) {
array[i] = i;
}
}
We will hand this routine a management structure, and it will initialize that array for us.
Modeling Memory Access¶
We will skip the first version of Gauss’s Busy Work code we showed earlier, and start off with the one-dimensional array version. The heart of this code was a simple loop that accessed each data item in order to do the work.
Experiment 1: Modeling Random Access¶
If our data fetches are random, all fetches will result in the delay specified
by the access _time
variable. To model this, we set up a simple loop that
looks like this:
// experiment 1: process data in different areas:
rtime = mtime = dtime = 0;
for(uint64_t i=0; i < data_size; i++) {
rtime += reg_access;
mtime += mem_access;
dtime += disk_access;
}
std::cout << "Time to process data in registers: " << rtime << std::endl;
std::cout << "Time to process data in memory : " << mtime << std::endl;
std::cout << "Time to process data on disk : " << dtime << std::endl;
We do not really need to fetch th data, each fetch will happen in the time specified by the access time, so we just calculate that time here to get a baseline number for reference. The code examines the time it would take to fetch each item from each data store:
Here is the output:
Time to process data in registers: 262144
Time to process data in memory : 2621440
Time to process data on disk : 26214400
Experiment 2: Moving from Memory to Registers¶
In this next experiment, we need to work harder. We want to process all of the numbers stored in memory, using the registers. Since we have more memory than registers, we need to pull the data in from the memory in blocks. Once the data is in our registers, we can do the work.
Our program code is unaware of all of this, it is simple running through a loop adding up numbers. In doing that it generates a sequence of addresses that head off to the controller.
In this experiment, the raw data is in memory, so our addresses represent locations in memory. Obviously those addresses will be beyond anything available for the registers, so we need to translate the addresses.
Here is the idea:
Break each address up into two parts:
offset - the low bites (equal to register address bit size)
tag: all of the other bits.
If you view the memory area as a set of blocks, each one exactly the same size as our register area, then a memory address ends up looking like this:
offset, - index into any block
tag: block number
We can watch the tag part of an address and check to see if that tag is
currently loaded in the registers. If so, we add in the number from the register
indicated by the offset
part. If the tag does not match, we need to load
that block into the register store.
In all of this, we will use the access time needed to measure our time. We will add in the streaming improvement later.
Here is a routine to load a block:
// general purpose load routine (from area2 to area1)
uint64_t load(
Store area1,
Store area2,
uint64_t block_size) {
uint64_t time = 0;
uint64_t a1_addr = area1.start_address;
uint64_t a2_addr = area2.start_address;
for(uint64_t i=0; i < block_size; i++) {
// check for transfer faults
if(a1_addr + 1 >= area1.size) { //area 1 fault
std::cout << area1.name << " fault" << std::endl;
std::exit(1);
}
if(a2_addr + 1 >= area2.size) { //area 2 fault
std::cout << area2.name << " fault" << std::endl;
std::exit(1);
}
// no faults, do the transfer
area1.data[a1_addr + i] = area2.data[a2_addr + i];
if(i == 0) // access time delay
time += area2.access_time; // a2 is always slower
else
time += area2.stream_time;
}
return time;
}
The interesting part of this code is checking to see that the transfer will work. If we step out of bounds in either block, we generate a fault message, and exit our program.
Processing Loop¶
The processing loop is simple. We will not do any actual processing, just set up the data fetches. We can figure out the data we will get easily.
If we are going to process n bytes from slow memory into faster memory, then process the data from faster memory, the total time to do the work is:
n * access_time1 + n * access_time 2.
As an example, if we move 65566 bytes from memory, which has an access time of 10 clocks, it will take 655360 clocks to complete the data transfers.
If the data must all be processed through the registers, which have an access time of 1 clock tick, it will take 65536 clock ticks to do the work.
The total is 655360+65536 = 720896.
Let’s hope our code works properly!
Note
This seems silly. The time it takes to do this processing is longer than just accessing the data directly from the slower of the two. In our example, we are modeling moving data from memory into registers where we will do the actual work. Registers are faster than the memory, but we have very few registers, and much more memory. The real story is a bit more complex, this is just a start.
Lets build some code and see if we get the right numbers:
// experiment 2: process data memory array through registers
rtime = 0;
Store a1 = {
"REG",
registers,
raddr_bits,
reg_size,
0,
reg_access,
reg_stream};
Store a2 = {
"MEM",
memory,
maddr_bits,
mem_size,
0,
mem_access,
mem_stream};
uint32_t tag;
uint32_t offset;
uint32_t current_tag = 99; // force initial load
// process the data
for(uint64_t i = 0; i < a2.size; i++) {
tag = i >> a1.addr_bits;
offset = i & (a1.size - 1);
if(tag != current_tag) {
rtime += load(a1,a2,a1.size);
current_tag = tag;
std::cout << "\t\tloading tag " << tag << " ";
std::cout << a2.name << " address = " << i << std::endl;
}
a1.data[i] = 0;
rtime += a1.access_time;
}
std::cout << "EX2: process time: " << rtime << std::endl;
Running this code give a lot of output, most of which is generated by the loads of 64 bytes at a time into the registers.
The last few lines tell the story:
loading tag 1020 MEM address = 65280
loading tag 1021 MEM address = 65344
loading tag 1022 MEM address = 65408
loading tag 1023 MEM address = 65472
EX2: process time: 720896
Hey, we got the right number. That was sure a lot of work to generate a number we could just figure out by hand. But, hey, we are programmers here, this was much more interesting.
The Real Story¶
Storage devices to not work in such a simple way. The access time
is the
time it takes to get that first byte ready to move across a bus to the
destination. If our access into this device was totally random, then that
access time might be the number we need to calculate throughput. However, if we
fetch data sequentially the device can do much better. It can stream data at a
much higher rate.
RAM has its own clock. A typical memory module is clocked at a rate of 1.333GHz which works out to about half the speed of the processor. Therefore, the burst transfer rate works out to one byte every two clock ticks. (twice as slow as transfers inside the chip. Of course, this only works if the data access is sequential. Introduce random access, and we are back to using the latency numbers.
Here are some numbers we can work with:
- Registers:
Latency: 1 clock
stream rate: 1 (no improvement here)
- Memory (DDR3-1333):
Latency: 13 clocks
stream rate: 2 clock ticks.
- Disk (7200rpm)
Latency: 100 tocks
Stream rate: 23
Let’s see what this does to our code.
Adding Stream Rate¶
The stream rate improves performance in our test.
When we move a block of data, say n bytes large, the total time needed works out as follows:
time = 1 * access_time + (n-1) * stream time
If the stream time is the same as the access time, the time is just n *
access time
, which is what we used in the previous example code.
Note
There is no new code for this experiment. The stream delay calculation is already in the code, I just made it equal to the access time for the last experiment! Lazy me!
Now, considering this improvement, our total processing time looks like this:
EX3: process time: 204800
Looks like we improved things by a factor of three. Not bad!
Non-Sequential processing¶
Unfortunately, programmers do not always do the “right thing”
Let’s consider a two dimensional array that we need to process:
#include <iostream>
const int nrows = 5;
const int ncols = 10;
int addr;
int data[nrows][ncols];
int sum1, sum2 = 0;
int main( void ) {
for(int i = 0; i < nrows; i++) {
for (int j = 0; j < ncols; j++) {
data[i][j] = 1;
sum1 += data[i][j];
addr = i * ncols + j;
sum2 += data[0][addr];
}
}
std::cout << sum1 << std::endl;
std::cout << sum2 << std::endl;
}
That code looks a bit strange. I am building up a sum1
by placing a one in
each memory location, then adding it to the sum
variable. This is just a
count of the cells.
Look at that silly code that calculates sum2
. What in the world is going on
there.
Note
I am cheating here, I am over indexing the column index on purpose. I am also always aiming the processor at column zero. The effect is to access each item in the array using the address I calculate from the row number and the column number. This is exactly what the compiler does, after laying out your array in the correct form. The accesses are always using an address to fetch data inside the processor!
If you are not convinced that this will work, run the code!
Compare that to this code:
#include <iostream>
const int nrows = 5;
const int ncols = 10;
int addr;
int data[nrows][ncols];
int sum1, sum2 = 0;
int main( void ) {
for(int j = 0; j < ncols; j++) {
for (int i = 0; i < nrows; i++) {
data[i][j] = 1;
sum1 += data[i][j];
addr = i * ncols + j;
sum2 += data[0][addr];
}
}
std::cout << sum1 << std::endl;
std::cout << sum2 << std::endl;
}
Do you see the difference?
The first example access the data column by column with a single row, then moves on to the next row. Based on how an array is stored in memory (row major) this is back to our sequential memory access time.
On the other hand, the second example access data row by row, working down one column at a time. This is not going to be as efficient.
The reason why should be clear by now. If we are not going to fetch the next byte in sequence from memory, we lose the advantage of that stream rate for data transfer. Every fetch if going to incur the full access time delay.
The Final Code¶
Just for reference, here is the complete program I used for this lecture:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 | #include <iostream>
#include <cstdint>
#include <string>
//set up data areas
const uint64_t raddr_bits = 6;
const uint64_t reg_size = 1 << raddr_bits;
const uint64_t maddr_bits = 16;
const uint64_t mem_size = 1 << maddr_bits;
const uint64_t daddr_bits = 18;
const uint64_t data_size = 1 << daddr_bits;
// byte arrays
uint8_t registers[reg_size];
uint8_t memory[mem_size];
uint8_t data[data_size];
// data area management structure
struct Store {
std::string name;
uint8_t *data;
uint64_t addr_bits;
uint64_t size;
uint64_t start_address;
uint8_t access_time;
uint8_t stream_time;
};
// access times
const uint8_t reg_access = 1;
const uint8_t reg_stream = 1;
const uint8_t mem_access = 10;
const uint8_t mem_stream = 2;
const uint8_t disk_access = 100;
const uint8_t disk_stream = 24;
// initialize an array with a sequence of numbers
void init( uint8_t *array, uint64_t size, int n) {
for(uint64_t i=n; i<size;i++) {
array[i] = i;
}
}
// general purpose load routine (from area2 to area1)
uint64_t load(
Store area1,
Store area2,
uint64_t block_size) {
uint64_t time = 0;
uint64_t a1_addr = area1.start_address;
uint64_t a2_addr = area2.start_address;
for(uint64_t i=0; i < block_size; i++) {
// check for transfer faults
if(a1_addr + 1 >= area1.size) { //area 1 fault
std::cout << area1.name << " fault" << std::endl;
std::exit(1);
}
if(a2_addr + 1 >= area2.size) { //area 2 fault
std::cout << area2.name << " fault" << std::endl;
std::exit(1);
}
// no faults, do the transfer
area1.data[a1_addr + i] = area2.data[a2_addr + i];
if(i == 0) // access time delay
time += area2.access_time; // a2 is always slower
else
time += area2.stream_time;
}
return time;
}
int main( void ) {
std::cout << "Register Size: " << reg_size << std::endl;
std::cout << "Memory Size : " << mem_size << std::endl;
std::cout << "Data Size : " << data_size << std::endl;
// initialize the data area
init(data,data_size);
// variable to track time data
uint64_t rtime;
uint64_t mtime;
uint64_t dtime;
// experiment 1: process data in different areas:
rtime = mtime = dtime = 0;
for(uint64_t i=0; i < data_size; i++) {
rtime += reg_access;
mtime += mem_access;
dtime += disk_access;
}
std::cout << "Time to process data in registers: " << rtime << std::endl;
std::cout << "Time to process data in memory : " << mtime << std::endl;
std::cout << "Time to process data on disk : " << dtime << std::endl;
// experiment 2: process data memory array through registers
rtime = 0;
Store a1 = {
"REG",
registers,
raddr_bits,
reg_size,
0,
reg_access,
reg_stream};
Store a2 = {
"MEM",
memory,
maddr_bits,
mem_size,
0,
mem_access,
mem_stream};
uint32_t tag;
uint32_t offset;
uint32_t current_tag = 99; // force initial load
// process the data
for(uint64_t i = 0; i < a2.size; i++) {
tag = i >> a1.addr_bits;
offset = i & (a1.size - 1);
if(tag != current_tag) {
rtime += load(a1,a2,a1.size);
current_tag = tag;
std::cout << "\t\tloading tag " << tag << " ";
std::cout << a2.name << " address = " << i << std::endl;
}
a1.data[i] = 0;
rtime += a1.access_time;
}
std::cout << "EX2: process time: " << rtime << std::endl;
}
|