1.2 A) Graph| 10 n | 5.26 e s | 3.57 t p | 2.70 e | 2.17 e | 1.82 d | 1.56 u | 1.37 p | 1.22 | 1.1 1 |0 0 ------------------------------------------ 0 10 20 30 40 50 60 70 80 90 100 percentage of vectorization B) 55.5555...% C) 11.1111...% D) 88.8888...% E) 1.0/((1-.7)+(.7/10))=2.7027027027 (Current speedup from 70% vect at x10) 1.0/((1-.7)+(.7/20))=2.98507462687 (Speedup from 70% vect at x20) 1.0/((1-.74)+(.74/10))=2.99401197605 (Speedup from 74% vect at x10) Going to 74% vectorization beats the hardware improvement. Try the compiler approach first. (Then again, that's converting almost 1/5 of the currently non-vectorized code to vectorized, which isn't necessarily trivial. But probably still cheaper than tweaking the hardware.) 1.3 run(enhanced)=1.0 run(unenhanced)=0.5+(10*0.5)=5.5 fraction(enhanced)=1-(0.5/5.5)=10/11 speedup=1.0/((1.0/11)+((10.0/11)/10)) A) Speedup=5.5 B) 10/11 1.6 Potentially different instruction sets. CISC vs RISC, FPU vs emulation, etc. The average work done per instruction, and hence instruction count to perform a task, could vary wildly. 1.7 Um, don't I need a clock speed to do this? Assuming 100 mhz for both processors: A) 1.08*100M/10 = 10.8 million instructions on the RISC. 13.6*100M/6 = 226.7 million instructions on the embedded. B) 10 MIPS and 16.67 MIPS respectively. C) (226.7-10.8)*1000000/195578=1104 instructions. 1.11 There's a reason I'm not a math major, but if you really want me to try to gratuitously whip up a proof on an abstract property of an equation... (a+b)/2 > sqrt(a*b) (square both sides) (a+b)(a+b)/4 > a*b (times 4, and simplify) a^2 + 2ab + b^2 > 4ab (-2ab) a^2 + b^2 > 2ab (divide by ab) a/b + b/a > 2 The ratio that's bigger than one has infinity to go to, but the ratio that's smaller than one only has the distance between 1 and 0, but they traverse their respective domains in a asymptotic variation of proportionately. So the bigger than one ratio is bigger than one by a larger amount than the smaller than one ratio is smaller than one, so adding them together is bigger than 2. The geometric and arithmetic means are equal when all the numbers being averaged are the same. 2, 2, and 2, for example. 1.17 Using M for "Million"... A) 120M=(I+(F*Y))/W 80M=(I+F)/B B) 120M=(I+(8M*50))/4 480M-400M=I 80M=I C) 80M=(80M+8M)/B B=88M/80M B=1.1 D) 80M/120M=2/3 (time spent doing integer instructions) 1.1-(2/3)=13/30 (time spent doing floating point instructions) 8M/(13/30) (Float instructions done in that time, normalize to 1 sec) 18.46 MFLOPS. E) Yes. It does the work in 1.1 seconds instead of 4. 1.19 #include int main(int argc, char *argv[]) { int i; for(i=0;i<100000000;i++); } [landley@localhost arch]$ time ./a.out real 0m1.246s user 0m1.139s sys 0m0.004s 100M/1.246=80.26M [landley@grelber landley]$ time ./a.out real 0m0.443s user 0m0.440s sys 0m0.000s 100M/0.443=225.73M According to the SPEC Install guide for Unix, I need to get an install CD from somewhere. (It's not downloadable, that I can find...) But I'd guess that any real-world test wouldn't stay in L1 cache so nicely, and wouldn't give the branch predictor such an easy time either. 1.20 #include int main(int argc, char *argv[]) { int i; float f=0.0; for(i=0;i<100000000;i++) f+=1.0; } [landley@localhost arch]$ time ./a.out real 0m1.376s user 0m1.259s sys 0m0.006s Should I normalize out the integer padding? Page 72 doesn't say... Either: 100MFLOP/1.376 seconds = 72.67 MFLOPS or: 100MFLOP/(1.376-1.246) = 769.23 MFLOPS [landley@grelber landley]$ time ./a.out real 0m1.002s user 0m0.980s sys 0m0.000s 100MFLOP/1.002 seconds = 99.80 MFLOPS or 100MFLOP/(1.002-0.443) = 178.89 MFLOPS And again, my laptop doesn't have this benchmark, and the install CD is not downloadable. I need to go get a login for a univeristy unix account... Problem 9 has a similar problem: it compiles fine on my machine up until it tries to run "condor_compile", which is not part of Red Hat 9...