Memory Access Patterns ###################### If you have not figured this out by now, computer architecture is tied to the software it runs. Designers study programming patterns, and use that data to find ways to improve performance. Let's look at some code, and see how we might process data. .. note:: Some of this code is silly. We could just as easily do a few calculations and come up with the data we will uncover. However, building the code shown here is helpful in exploring aspects of modern architecture. Processing Big Data ******************* The rage today is working on "big data". We are not quite sure what that data is, but we are told it is "BIG". In my career, I have processed a lot of "big data", long before that term had even been mentioned. In my work as a ``Computational Fluid Dynamicist``, I routinely worked on piles of data, Gigabytes big. So big, in fact, that we had a hard time moving it back and forth into the machine to do the calculations. Let's build a simple model of such work. Our model will be like all models, a small version of the real thing. In these experiments, the actual processing is not important. The fetching of data from some storage area for processing is important. Therefore, we will set up our model data storage areas as arrays of data items, say 64-bits wide. Our arrays can be any size, the "bigger" the better, but we will keep things under control for our model. We will need three different storage areas for this work: * Registers - small * Memory - fairly big, but slower * Disk - huge but very slow Just for fun, let's size these data areas based on the number of bits it takes to address all of them. (Hey we are studying computer architecture, and the address bus is an important part of this!) Data Stores =========== Here is a start on setting up the system's storage areas: .. literalinclude:: code/caches.cpp :lines: 1-16 I wonder how big they are! .. literalinclude:: code/caches.cpp :lines: 75-78,135 Here is the output: .. code-block:: text Register Size: 64 Memory Size : 65536 Data Size : 262144 (Hey, these is not really big! They are tiny by today's standards, but back in 1975, when I started building my own computers, this was what we had available. Access Patterns *************** In reviewing Gauss's Busy Work code, we see that we have several access patterns to look at. Actually, there are only two. One involves strictly sequential access to memory, and the other is fairly random access. We hop all over the data store fetching our data. If we want to model a real computer in these experiments, we need to consider how the different devices work. Each one has some total capacity, that much should be obvious. But there are two other characteristics we need to consider as well: Access Time =========== Every device takes some time to get ready. We deliver an address to the store, and we wait for the result. The delay is called the ``access time``. For a simple device that time may be all we need. But other devices work differently. For example if our data store is a rotating hard disk, the address reaches the device, and we need to wait while the read-write head positions to the right track, and then wait for the disk to spin around into the right spot on that track. We can read our data at that moment. Stream Time =========== If we access totally random data stored on this disk, the access time is the delay we will experience (although it will not be a fixed time (depending on where the next data item lives). However, if the next fetch is located next to the first fetch, we can do better. In fact, we can "stream" data from sequential storage locations much faster than what we will see for random accesses. The reads happen as the disk spins, we never move the read-write head. This faster access rate can be measured by a different term, we will call it the ``stream_time``, a measure of the time between sequential fetches after the first fetch is ready. Modeling a Storage Device ========================= Let's create a simple data structure (a C++ ``struct``) to store all of this information about a store: .. literalinclude:: code/caches.cpp :lines: 18-27 .. note:: In case you have not seen this feature of C++, a ``struct`` is basically a class with no methods. In fact, some programmers never use this old pattern, inherited from "C", and set up classes with no methods. Internally, they are the same thing. After doing some intense "Googling", I came up with this set of values for the delays in accessing our three data storage areas: .. literalinclude:: code/caches.cpp :lines: 29-35 Here is the code needed to set up the first two of our data stores using these structures: .. literalinclude:: code/caches.cpp :lines: 102-117 Utility Routines ================ To assist in our work, we need a utility routine to initialize a data area with a sequence of numbers. Here is code to do tat: .. literalinclude:: code/caches.cpp :lines: 37-42 We will hand this routine a management structure, and it will initialize that array for us. Modeling Memory Access ********************** We will skip the first version of Gauss's Busy Work code we showed earlier, and start off with the one-dimensional array version. The heart of this code was a simple loop that accessed each data item in order to do the work. Experiment 1: Modeling Random Access ==================================== If our data fetches are random, all fetches will result in the delay specified by the ``access _time`` variable. To model this, we set up a simple loop that looks like this: .. literalinclude:: code/caches.cpp :lines: 88-98 We do not really need to fetch th data, each fetch will happen in the time specified by the access time, so we just calculate that time here to get a baseline number for reference. The code examines the time it would take to fetch each item from each data store: Here is the output: .. code-block:: text Time to process data in registers: 262144 Time to process data in memory : 2621440 Time to process data on disk : 26214400 Experiment 2: Moving from Memory to Registers ============================================= In this next experiment, we need to work harder. We want to process all of the numbers stored in memory, using the registers. Since we have more memory than registers, we need to pull the data in from the memory in blocks. Once the data is in our registers, we can do the work. Our program code is unaware of all of this, it is simple running through a loop adding up numbers. In doing that it generates a sequence of addresses that head off to the controller. In this experiment, the raw data is in memory, so our addresses represent locations in memory. Obviously those addresses will be beyond anything available for the registers, so we need to translate the addresses. Here is the idea: Break each address up into two parts: * offset - the low bites (equal to register address bit size) * tag: all of the other bits. If you view the memory area as a set of blocks, each one exactly the same size as our register area, then a memory address ends up looking like this: * offset, - index into any block * tag: block number We can watch the tag part of an address and check to see if that tag is currently loaded in the registers. If so, we add in the number from the register indicated by the ``offset`` part. If the tag does not match, we need to load that block into the register store. In all of this, we will use the access time needed to measure our time. We will add in the streaming improvement later. Here is a routine to load a block: .. literalinclude:: code/caches.cpp :lines: 44-71 The interesting part of this code is checking to see that the transfer will work. If we step out of bounds in either block, we generate a fault message, and exit our program. Processing Loop --------------- The processing loop is simple. We will not do any actual processing, just set up the data fetches. We can figure out the data we will get easily. If we are going to process n bytes from slow memory into faster memory, then process the data from faster memory, the total time to do the work is: * n * access_time1 + n * access_time 2. As an example, if we move 65566 bytes from memory, which has an access time of 10 clocks, it will take 655360 clocks to complete the data transfers. If the data must all be processed through the registers, which have an access time of 1 clock tick, it will take 65536 clock ticks to do the work. The total is 655360+65536 = 720896. Let's hope our code works properly! .. note:: This seems silly. The time it takes to do this processing is longer than just accessing the data directly from the slower of the two. In our example, we are modeling moving data from memory into registers where we will do the actual work. Registers are faster than the memory, but we have very few registers, and much more memory. The real story is a bit more complex, this is just a start. Lets build some code and see if we get the right numbers: .. literalinclude:: code/caches.cpp :lines: 100-134 Running this code give a lot of output, most of which is generated by the loads of 64 bytes at a time into the registers. The last few lines tell the story: .. code-block:: text loading tag 1020 MEM address = 65280 loading tag 1021 MEM address = 65344 loading tag 1022 MEM address = 65408 loading tag 1023 MEM address = 65472 EX2: process time: 720896 Hey, we got the right number. That was sure a lot of work to generate a number we could just figure out by hand. But, hey, we are programmers here, this was much more interesting. The Real Story ************** Storage devices to not work in such a simple way. The ``access time`` is the time it takes to get that first byte ready to move across a bus to the destination. If our access into this device was totally random, then that access time might be the number we need to calculate throughput. However, if we fetch data sequentially the device can do much better. It can stream data at a much higher rate. RAM has its own clock. A typical memory module is clocked at a rate of 1.333GHz which works out to about half the speed of the processor. Therefore, the burst transfer rate works out to one byte every two clock ticks. (twice as slow as transfers inside the chip. Of course, this only works if the data access is sequential. Introduce random access, and we are back to using the latency numbers. Here are some numbers we can work with: * Registers: * Latency: 1 clock * stream rate: 1 (no improvement here) * Memory (DDR3-1333): * Latency: 13 clocks * stream rate: 2 clock ticks. * Disk (7200rpm) * Latency: 100 tocks * Stream rate: 23 Let's see what this does to our code. Adding Stream Rate ****************** The stream rate improves performance in our test. When we move a block of data, say n bytes large, the total time needed works out as follows: * time = 1 * access_time + (n-1) * stream time If the stream time is the same as the access time, the time is just ``n * access time``, which is what we used in the previous example code. .. note:: There is no new code for this experiment. The stream delay calculation is already in the code, I just made it equal to the access time for the last experiment! Lazy me! Now, considering this improvement, our total processing time looks like this: .. code-block:: text EX3: process time: 204800 Looks like we improved things by a factor of three. Not bad! Non-Sequential processing ************************* Unfortunately, programmers do not always do the "right thing" Let's consider a two dimensional array that we need to process: .. literalinclude:: code/stride1.cpp That code looks a bit strange. I am building up a ``sum1`` by placing a one in each memory location, then adding it to the ``sum`` variable. This is just a count of the cells. Look at that silly code that calculates ``sum2``. What in the world is going on there. .. note:: I am cheating here, I am over indexing the column index on purpose. I am also always aiming the processor at column zero. The effect is to access each item in the array using the address I calculate from the row number and the column number. This is exactly what the compiler does, after laying out your array in the correct form. The accesses are always using an address to fetch data inside the processor! If you are not convinced that this will work, run the code! Compare that to this code: .. literalinclude:: code/stride2.cpp Do you see the difference? The first example access the data column by column with a single row, then moves on to the next row. Based on how an array is stored in memory (row major) this is back to our sequential memory access time. On the other hand, the second example access data row by row, working down one column at a time. This is not going to be as efficient. The reason why should be clear by now. If we are not going to fetch the next byte in sequence from memory, we lose the advantage of that stream rate for data transfer. Every fetch if going to incur the full access time delay. The Final Code ************** Just for reference, here is the complete program I used for this lecture: .. literalinclude:: code/caches.cpp :linenos: