HW6: How Fast is My Code?

For some reason, we all want our computers to be FAST! So, as programmers, we need a way to figure out just how fast our programs (or parts of our programs) are. While we could pull out a stop-watch and time things ourselves, it is much more efficient to let the machine help you get this information.

Note

For this assignment, you do not need to submit any code, just a document in your homework repo showing the results you got on your system. You will run the code shown here and record your resuts in a file in homework/hw6.

Tick Tick

As we discussed earlier, the processor is driven by a simple clock circuit that generates an on-off (0-1) pattern at some pretty fast rate. The manufacturer of your computer decides how fast that will be, based on how fast the actual chip they select can go. (Geeks have been known to over-clock their processors to make those space aliens go away faster, but that has its own risks in terms of potential damage to the chip!)

How can a machine tell us how fast it is going?

Determining Clock Speed

All modern computers, at least those desktop or laptop systems) have several timing systems active in them. The processor clock is probably the most important, but it stops when power is pulled from the system. To enable the computer to track the time of day, there is another clock in the system, this one run by batteries so it will continue to tick away while the machine is powered off.

Real-Time Clock

On the motherboard of your system, you will find a crystal controlled chip that ticks at a rate of 32768 Hz (a common rate, used in many watches). This chip provides a signal to the processor at a rate of 1024 Hz. The chip has registers that keep track of the current time and date used by your system in many ways. Unfortunately, this chip is known to “drift” off of the real time, so most systems use Network Time Synchronization (or NTP) to update the registers in this chip using more accurate sources provided by a number of servers on the Internet.

Counter/Timer Chip

There is another chip on the motherboard generating a signal that is sent to the processor for time keeping purposes, but this one, like the main system clock, only runs when power is available.

The CTC chip is clocked at a rate of 14.31818 MHz, and gets divided down to generate a signal at a rate of 1.193182 MHz. This gets further divided to give a final signal at a rate of 18.2065 Hz.

Signals generated by this chip are used by the operating system in deciding what job gets the processor’s attention. We will learn more about how this happens when we study interrupts later in the course.

Time-Stamp Counter

The most accurate time source is the chip, itself. The Pentium processor maintains a 64-bit counter that is incremented every time the system clock ticks. When the system is powered on, or reset, this counter returns to zero. Even with a fast system clock, with 64-bits it will take a while to get to the biggest number you can hold in 64-bits. (See HW 3: Time for a Time Out!) to find out how long!)

There is an instruction you can use to access this counter: RDTSC. When you process this instruction, two internal “registers” get set (see the Intel docs):

  • RAX = low 32 bits of counter (high 32 bits are all zero)
  • RDX = high 32 bits of counter (high 32 bits are all zero)

Both of these “registers” hold 64 bits.

Note

It is possible for the operating system to disable this instruction, and it is possible that each processor in a multi-code chip might have different counter values. Synchronizing the counters in all cores can be done with a suitable program (we will not explore that issue here).

There are a few issues when using this instruction to see how fast things are going. For one, the magic in the processor allows it to be working on more than one instruction at a time, making the count less useful for calculating the time one instruction takes. We can defeat the chip by forcing it to flush out the places where this work is happening, effectively rendering it a simple machine where we can time single instructions. A common way to do this is to use the CPUID instruction right before calling the RTDSC instruction. A side effect of CPUID is to do the flushing job we want.

Here is a bit of C code we can use to get the processor to return the TSC values:

rdtcs.c
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
// rdtsc.c - read the RDTSC register on 64-bit processors

#include "rdtsc.h"

unsigned long long rdtsc(void) {
    unsigned long long hi = 0, lo = 0;
    asm volatile("rdtsc" : "=a"(lo), "=d"(hi));
    return ((unsigned long long)lo | (((unsigned long long)hi) << 32));
}

rdtsc.h
// rdtsc.h - read the RDTSC register on 64-bit processors

unsigned long long rdtsc(void);

Note

This bit of trickery shows you that you can drop from C into assembly language without moving completely to assembly language. You need to know what is going on, though, otherwise you might really mess up your program!

As a demonstration of using this instruction to time a single instruction, let’s try this code:

text_mov.c
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
// time_move.c - simple experiment to time a single mov instruction
#include <stdio.h>
#include "rdtsc.h"

int main(int argc, char *argv[]) {
    unsigned long long start, stop;

    start = rdtsc();
    asm volatile(
        "mov $0, %rax"
    );
    stop = rdtsc();
    printf("MOV took %lld cycles\n", stop-start);
}

And here is what we get:

$ gcc -o test_mov test_mov.c rtdcs.c
$ ./test_mov
$ MOV took 27 cycles

That does not seem right, the processor should be able to do this job much more quickly. Actually, it does. It should be able to do this job in just one clock cycle. So, what is wrong with our experiment?

The answer is a bit complicated. If we disassembly the code produced by this test, we see this:

timetest_mov.s
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
	.cstring
LC0:
	.ascii "MOV took %lld cycles\12\0"
	.text
	.globl _main
_main:
LFB1:
	pushq	%rbp
LCFI0:
	movq	%rsp, %rbp
LCFI1:
	subq	$32, %rsp
	movl	%edi, -20(%rbp)
	movq	%rsi, -32(%rbp)
	movl	$0, %eax
	call	_rtdsc
	cltq
	movq	%rax, -8(%rbp)
# 9 "test_mov.c" 1
	mov $0, %rax
# 0 "" 2
	movl	$0, %eax
	call	_rtdsc
	cltq
	movq	%rax, -16(%rbp)
	movq	-16(%rbp), %rax
	subq	-8(%rbp), %rax
	movq	%rax, %rsi
	leaq	LC0(%rip), %rdi
	movl	$0, %eax
	call	_printf
	leave
LCFI2:
	ret
LFE1:
	.section __TEXT,__eh_frame,coalesced,no_toc+strip_static_syms+live_support
EH_frame1:
	.set L$set$0,LECIE1-LSCIE1
	.long L$set$0
LSCIE1:
	.long	0
	.byte	0x1
	.ascii "zR\0"
	.byte	0x1
	.byte	0x78
	.byte	0x10
	.byte	0x1
	.byte	0x10
	.byte	0xc
	.byte	0x7
	.byte	0x8
	.byte	0x90
	.byte	0x1
	.align 3
LECIE1:
LSFDE1:
	.set L$set$1,LEFDE1-LASFDE1
	.long L$set$1
LASFDE1:
	.long	LASFDE1-EH_frame1
	.quad	LFB1-.
	.set L$set$2,LFE1-LFB1
	.quad L$set$2
	.byte	0
	.byte	0x4
	.set L$set$3,LCFI0-LFB1
	.long L$set$3
	.byte	0xe
	.byte	0x10
	.byte	0x86
	.byte	0x2
	.byte	0x4
	.set L$set$4,LCFI1-LCFI0
	.long L$set$4
	.byte	0xd
	.byte	0x6
	.byte	0x4
	.set L$set$5,LCFI2-LCFI1
	.long L$set$5
	.byte	0xc
	.byte	0x7
	.byte	0x8
	.align 3
LEFDE1:
	.subsections_via_symbols

That is a lot more code than I wanted to test. It looks like the test, itself, generated some “overhead” that we need to account for. One way to do this is to run the test a second time, this time doing everything except the one instruction we are interested in timing. Here is a modified version of the program that does this:

test_mov2.c
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
// time_move.c - simple experiment to time a single mov instruction
#include <stdio.h>
#include "rdtsc.h"

int main(int argc, char *argv[]) {
    unsigned long long start, stop, basic, overhead;

    start = rdtsc();
    asm volatile(
        "mov $0, %rax"
    );
    stop = rdtsc();
    basic = stop - start;

    start = rdtsc();
    stop = rdtsc();
    overhead = stop - start;
    printf("MOV took %lld cycles\n", basic-overhead);
}

Running this give the following:

$ gcc -o test_mov2 test_mov2.c rdtsc.c
$ ./test_mov2
MOV took 6 cycles

Well, that is better, but it is still wrong! We should be seeing numbers closer to one. In fact, running this a number of times, I saw values ranging from 0 to 2753 cycles. What is going on?

Multi-tasking

The answer is simple. It is all those “other” programs running in your system that are causing all the pain. In the simple code we ran, the operating system might interrupt our code and hand control to a totally different program, one that consumes a bunch of cycles to get done. If this happens, the count we finally get will be totally wrong. Is there any way to avoid this problem?

Note

Probably, the master of this kind of timing is Agner Fog (AgnerFog) who has worked very hard to produce accurate timing data for a bunch of processors. The code he uses is available from his web site.

In order to do this job accurately, we have to force the operating system to stop interrupting our job, and deal with a number of internal issues such as the systems ability to do a form of parallel processing as it processes individual instructions. Rather than go through all of that now, I will refer you to Agner’s website, and we will use his data as a source for the number of cycles each instruction takes to complete. We will use our RDTSC routine when we need to do timing calculations in our example code.

Checking Your System Clock

Using the RDTSC routine we presented, we can get the frequency of our system clock with reasonable accuracy using this code:

freq.c
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
#include <stdio.h>

#include "rdtsc.h"

int main(void) {
    unsigned long long int start, stop;    
    start = rdtsc();
    sleep(1);
    stop = rdtsc();    
    printf("clock frequency = %llu\n", stop - start);
}

This routine assumes that the system sleep command, which uses the CTC counter to do its job, is accurate enough to actually stall the program for exactly one second. On my system, this is what I got:

gcc -o freq freq.c rdtsc.c
./freq
clock frequency = 2641482962

According to Apple, my MacBook Pro is running at 2.7GHz, which is close enough for me.

The Makefile

Just to complete this lecture, here is a Makefile that will build all the code.

Makefile
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
FILES = test_mov.c test_mov2.c rdtsc.c freq.c
EXECS = test_mov test_mov2 freq

OBJS	= $(FILES:.c=.o)

all: $(EXECS)

test_mov:	$(OBJS)
	gcc -o $@ test_mov.o rdtsc.o

test_mov2:	$(OBJS)
	gcc -o $@ test_mov2.o rdtsc.o

freq:	$(OBJS)
	gcc -o $@ freq.o rdtsc.o

%.o:	%.c
	gcc -c -o $@ $<

clean:
	rm -f $(EXECS) $(OBJS)

What to Turn In

Obviously, you do not need to create any new code for this project, but you do need to set up this experiment and see what you get on your development machine. To complete the homework, just record your results in a text file and include that in your homework repo. Be sure to run the instruction timing code a few times to see how it varies.