Name: A. Very-Good-Student Student ID: u5053499 Lab Group: Afterhours Type in your answer to the following questions WITHIN this document. ----------------------------------------------------------------------- Q1: What is your estimate for the resolution (Tr) and overhead (Tc) of gettimeofday? ** Resolution of 1us, Overhead much less than that. Overhead is the minimum time observed - in this case we see timings of "0" which indicates that the overhead is so small that the resolution is not high enough to even measure it. Resolution is the smallest difference in the times observed. Occasionally you may see higher numbers, this is because the program is being executed on a multiuser machine and may get swapped out or interrupted between two (wall) timing calls. ----------------------------------------------------------------------- Q2: Copy walltime.c to mywalltime.c and modify it to determine the overhead and resolution of elapsetime. What are the values of (Tr) and overhead (Tc) for this routine? (Note the makefile will compile and link your file mywalltime.c with libunknown.a correctly ** Possibly depending on the machine you are using, but you should see a something like: 1 0 1 1 1 0 1 1 1 0 1 1 1 0 1 1 0 0 1 1 0 0 1 1 0 1 1 1 0 1 1 1 0 1 1 1 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 1 1 1 0 1 1 1 0 Indicating that the resolution is 1us (smallest number) and the overhead is much less than 1us (thus observed as 0). This is a similar result as seen on partch. ----------------------------------------------------------------------- Q3: What modifications to walltime.c must be made in order to estimate the resolution and overhead of the process time? Can we simply replace gettimeofday with something else? ** The resolution of the CPU time and SYS time are specified by the frequency of the timer interrupts "tick = sysconf(_SC_CLK_TCK)" This value is usually 100 for x86 systems, therefore the resolution is 1/100 seconds. The Linux scheduler relies on timer interrupts. The overhead is much smaller than the resolution and will turn out to be zero if we use a similar test as wall time. As the wall time is much more accurate, it can be used to estimate the overhead of times(): while (i < ASIZE) gettimeofday(tp1, NULL); cpu_time(&cpu1, &sys1); gettimeofday(tp2, NULL); reports an overhead of around 1 microseconds. ----------------------------------------------------------------------- Q4: What units are the CPU, SYS and CPU+SYS time reported in? (You should make an educated guess based on the sort of time the program takes to run - eg does it seem to take hours to run! Then look at the code and read the relevant man pages). ** Seconds ----------------------------------------------------------------------- Q5: Briefly state what you understand to be the meaning of CPU time and SYS time. ** CPU time is the time that your process is executing on the cpu and doing instructions that are part of your code SYS time is the time that your process is executing on the cpu but is doing instructions that are part of the O/S - for example executing system calls to do I/O operations. A large system time compared to user CPU time would be an indication of some sort of problem - maybe your code is swapping, or executing traps to the O/S because of some error. NOTE both of these are different from the elapsed time or wallclock time as they only represent time that your executable is using the CPU - not time that it may be swapped out waiting. ----------------------------------------------------------------------- Q6: We can time programs using the Unix time utility (do man -s1 time for details). Use this utility and complete the following table comparing the results from time with the CPU+SYS time. Run 3 experiments for each data size, or more if your results are inconsistent between runs. (You might want to consider doing something like "repeat 7 time linear 128 | egrep '(CPU\+SYS|%)' ") ** Timings on Partch (Dual-Core AMD Opteron) ------------------------------------------------------------------- --------------Input Dimension---------- 80 80 80 160 160 160 ------------------------------------------------------------------- via code CPU time = 0.00 0.00 0.00 0.02 0.02 0.02 SYS time = 0.00 0.00 0.00 0.00 0.00 0.00 CPU+SYS time = 0.00 0.00 0.00 0.02 0.02 0.02 ------------------------------------------------------------------- --------------Input Dimension---------- 320 320 320 640 640 640 ------------------------------------------------------------------- CPU time = 0.20 0.16 0.16 1.33 1.33 1.35 SYS time = 0.00 0.00 0.00 0.01 0.01 0.01 CPU+SYS time = 0.20 0.16 0.16 1.34 1.34 1.36 ------------------------------------------------------------------- --------------Input Dimension---------- 1280 1280 1280 ------------------------------------------------------------------- CPU time = 10.49 10.51 10.48 SYS time = 0.01 0.00 0.03 CPU+SYS time = 10.50 10.51 10.51 ------------------------------------------------------------------- ----------------------------------------------------------------------- Q7: Why are the results from CPU+SYS different from the time results (you may need to inspect the code to confirm your thoughts). ** Because the time utility records the entire time that is taken by your command from when you first press the enter key until everything ends. The CPU+SYS time, however, is just the time required to execute the gauss_elim routine - see code:- ttot0=cpu_time(&tcpu0, &tsys0); gauss_elim(amatrix, bvector, nsize, xvector, szvector, ordvector); ttot1=cpu_time(&tcpu1, &tsys1); ----------------------------------------------------------------------- Q8: Using the above data, plus perhaps additional timing runs, deduce how the computational cost of linear varies with the input dimension n. You must include experimental justification for your answer. (We are looking for an expression of the form O(nk) where k is an integer). ** Time for 80 is less than the resolution of time. From 160->320 time goes up by 0.16/0.02 or close to a factor of 8. From 320->640, 1.33/0.16 ~ 8. From 640->1280, 10.48/1.33 ~ 8. If we assume scaling as O(n^k) then as we double problem size we would expect time to increase by 2^k. 2^3 = 8 ... so it looks like O(n^3) scaling ----------------------------------------------------------------------- Q9: Which routine is the dominant routine and what fraction of the total CPU is it taking? ** Typical flat profile running on Partch with problem size of 800. Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls s/call s/call name 99.53 2.14 2.14 1 2.14 2.14 elim 0.47 2.15 0.01 1 0.01 0.01 largest 0.00 2.15 0.00 799 0.00 0.00 pivot 0.00 2.15 0.00 2 0.00 0.00 cpu_time 0.00 2.15 0.00 1 0.00 2.15 gauss_elim 0.00 2.15 0.00 1 0.00 0.00 subst From this it is evident that elim takes 99.53% of the total time. ----------------------------------------------------------------------- Q10: How many times are the routines elim, pivot and largest called? From small section of profile shown above we see the number of calls elim - 1 pivot - 799 largest - 1 ----------------------------------------------------------------------- Q11: Use the profile output to construct a calling graph for those routines contained in linear.c, i.e. show how each routine is called and approximately how much time is spent in a given routine when it is called from the parent routine. You should verify your calling graph by inspecting the code. main(0.04) gauss_elim(0.00) elim(4.45) pivot(0.05) largest(0.01) subst(0.00) cpu_time(0.0) ----------------------------------------------------------------------- Q12: Which line of the code is executed the most times (cut out the source code and report the number of times this line is executed). For ./linear_gcov 800 170346800: 144: a[ord[i]*n+j]=a[ord[i]*n+j]-factor*a[ord[k]*n+j]; ----------------------------------------------------------------------- Q13: Is the location of this line consistent with your gprof profiles? Yes this is in routine elim - which gprof showed to be taking nearly all of the time. ----------------------------------------------------------------------- Q14: How does the number of times this line is executed change as you vary the problem size? (NOTE, between gcov runs you should remove any .gcda data files that relate to the previous run - else your results will be a combination of all previous gcov runs). Is this consistent with your earlier conclusions concerning the overall scaling of the code. Is it also consistent with the loop structure around this line? (You may want to run gcov profiling for a variety of problem sizes). ** Consider followint 3 problem sizes n=100 328350 -> a[ord[i]*n+j]=a[ord[i]*n+j]-factor*a[ord[k]*n+j]; n=200 2646700 -> a[ord[i]*n+j]=a[ord[i]*n+j]-factor*a[ord[k]*n+j]; n=400 21253400 -> a[ord[i]*n+j]=a[ord[i]*n+j]-factor*a[ord[k]*n+j]; We deduced early that doubling the problem size increased the time by a factor of 8. Thus consider multipling the line counts by 8, we get 328350*8= 2626800 is close to observed n=200 value of 2646700 2646700*8=21173600 is close to observed n=400 value of 21253400 This is all consistent with the loop structure that shows three loops running over something that relates to n LOOP1 for (k=0; k Collect Experiment > Data to collect [Tab] > Hardware Counter Overflow Profiling [Select] > Add Counter > Counter Name > D$Misses. However, if you run the experiment there will be an error. To get this to work, you need to click "Preview Command" after you selected D$Misses. The GUI will generate the necessary "collect" command for gathering performance data on the command line. You will then need to run "collect ..." on the command line (deleting the part which says "-M CT8.1" this is very important). Once you called collect, then you can open the experiment that was generated, and view the D$Misses. ----------------------------------------------------------------------- Q19: What is the name of the machine you are considering? ** Partch ----------------------------------------------------------------------- Q20: What Operating system and what version of it is running. How did you tell? ** Linux 2.6.32-24 u3965617@partch:~$ uname -a Linux partch 2.6.32-24-server #43-Ubuntu SMP Thu Sep 16 16:05:42 UTC 2010 x86_64 GNU/Linux ----------------------------------------------------------------------- Q21: What speed processor (MHZ) is on this machine and how did you determine this information? ** 4 x Dual-Core 2.8 GHz AMD Opteron(tm) Processor 2220 Using "cat /proc/cpuinfo" ----------------------------------------------------------------------- Q22: How much memory is on this machine and how did you determine this? ** "cat /proc/meminfo" MemTotal: 8265096 kB (~8GB) ----------------------------------------------------------------------- Q23: How many CPUs does it have and how did you determine this? ** Four CPUs using "cat /proc/meminfo" ----------------------------------------------------------------------- Q24: What is size of the level 1 data cache and level 2 cache (also known as ecache or external cache) and how did you gain this information? ** L1 Instr Cache - 32K (64 bytes/line) L1 Data Cache - 32K L2 Cache - 1MB (64 bytes/line) Using "dmesg | grep CPU" [ 0.002107] CPU: L1 I Cache: 64K (64 bytes/line), D cache 64K (64 bytes/line) [ 0.002109] CPU: L2 Cache: 1024K (64 bytes/line) [ 0.002111] CPU: Physical Processor ID: 0 [ 0.002112] CPU: Processor Core ID: 0 [ 0.002115] mce: CPU supports 5 MCE banks etc .. OR by typing "lscpu" Architecture: x86_64 CPU op-mode(s): 64-bit CPU(s): 4 Thread(s) per core: 1 Core(s) per socket: 2 CPU socket(s): 2 NUMA node(s): 2 Vendor ID: AuthenticAMD CPU family: 15 Model: 65 Stepping: 3 CPU MHz: 1000.000 Virtualization: AMD-V L1d cache: 64K L1i cache: 64K L2 cache: 1024K