Apple A9
Nội Dung Chính
Apple A9
Apple A9 (Twister), 1850 MHz, RAM: 2 GB. iPhone SE.
- L1 Data cache = 64 KB, 64 B/line, 4-WAY.
- L1 Instruction cache = ? KB, 64 B/line, ?-WAY.
- L2 Cache = 3 MB (per 2 cores), 64 B/line, 12-WAY.
- L3 Cache = 4 MB (per 2 cores), 64 B/line, ?-WAY. Exclusive (Victim) with L2 Cache.
- L1 Data Cache Latency = 3 cycles for simple access via pointer
- L1 Data Cache Latency (arm64) = 4 cycles for access with complex address calculation (size_t n, *p; n = p[n]).
- L1 Data Cache Latency (arm32) = 6 cycles for access with complex address calculation (size_t n, *p; n = p[n]).
- L2 Cache Latency = 16 cycles
- L2 Cache Latency = 16.5 cycles (random)
- L3 Cache Latency = 45 cycles + 24 ns = 89 cycles
- RAM Latency = 45 cycles + 120 ns (sequential)
- RAM Latency = 45 cycles + 120 ns (random, more than 1 chain)
- RAM Latency = 45 cycles + 160 ns (random, 1 chain)
Notes:
arm32: Instruction “ldr r1, [r2, r3, lsl #2]” is slow (6 cycles). Probably Apple A9 is not optimized for arm32 code.
arm64: CLANG can produce slow instruction “ldr w1, [x2, w3, uxtw #2]” (6 cycles for A9)
for array access commands: v = array[uint32_index]. Read more about it here : CLANG uxtw->lsl hack.
16 KB pages mode (ARM64)
- Data TLB L1: 128 items. 4-way? Miss penalty = 7 cycles. Parallel miss: 3 cycles per access
- Data TLB L2: 1024 items. ?-way. Miss penalty = 29 cycles (page walk to L2 cache ? ).
Parallel miss: 12 cycles per access (including L2 cache access for data and page walk).
Size Latency Increase Description 64 K 3 128 K 10 7 + 13 (L2) 256 K 13 3 512 K 15 2 1 M 16 1 2 M 16 0 4 M 29 + 7 ns 13 + 7 ns + 29 + 24 ns (L3) + 7 (L1 TLB miss) 8 M 41 + 35 ns 12 + 28 ns 16 M 47 + 91 ns 6 + 56 ns + 136 ns (RAM) 32 M 64 +125 ns 17 + 34 ns + 29 (L2 TLB miss) 64 M 73 +142 ns 9 + 17 ns 128 M 77 +150 ns 4 + 8 ns 256 M 79 +155 ns 2 + 5 ns 512 M 80 +158 ns 1 + 3 ns
MISC
- Branch misprediction penalty = 14 cycles (arm64)
- Branch misprediction penalty = 16 cycles (arm32)
- Branch history table = 16K items or more (for 8 branches code).
- arm32 : 32-bit loads: 4-bytes range cross penalty = 1 cycle
- arm64 : 64-bit loads: 64-bytes range cross penalty = 3.75 cycles
- L1 B/W (Parallel Random Read) = 0.54 cycles per one access
- L1 Write throughput = 0.5 cycles per one write
- L2->L1 B/W (Parallel Random Read) = 3.0 cycles per cache line
- L2->L1 B/W (Read, 64 bytes step) = 2.9 cycles per cache line
- L2->L1 B/W (Read, 64 Bytes step – pointer chasing) = 4.3 cycles per cache line (HW prefetch)
- L2 Write (Write, 64 bytes step) = 2.7 cycles per write
- L3->L1 B/W (Parallel Random Read) = 11 cycles per cache line
- L3->L1 B/W (Read, 64 bytes step) = 7.2 cycles per cache line
- L3->L1 B/W (Read, 64 Bytes step – pointer chasing) = 6.9 cycles per cache line (HW prefetch)
- L3 Write (Write, 64 bytes step) = 7 cycles per write
- RAM Read B/W (Parallel Random Read) = 11 ns / access line
- RAM Read B/W (Read, 8-64 Bytes step) = 14-16 GB/s
- RAM Read B/W (Read, 64 Bytes step – pointer chasing) = 12 GB/s (HW prefetch)
- RAM Write B/W (Write, 8-64 Bytes step) = 7500 – 7900 MB/s
7-Zip Benchmark
Notes:
7z b : MIPS values are normalized with Intel Core 2 cpu.
7z b -mm=* : MIPS and Effectiveness values are normalized with AMD K8 cpu.
## iOS 10.2 ## vanilla 16.04 + {CpuArch.h,7zCrcOpt.c,XzCrc64.c,XzCrc64Opt.c,Sha1.c,Sha256.c,Aes.c} from 17.00 ## + __builtin_bswap{16,32,64} + CrcUpdateT8 # clang-4 -arch arm64 -mcpu=cyclone -O3 7z b -mmt1 7-Zip (a) [64] 16.04 : Copyright (c) 1999-2016 Igor Pavlov : 2016-10-04 p7zip Version 16.04-hash17+crct8 (locale=C,Utf16=off,HugeFiles=on,64 bits,2 CPUs LE) LE CPU Freq: 504 909 1180 1350 1613 1805 1824 1834 1838 RAM size: 2009 MB, # CPU hardware threads: 2 RAM usage: 435 MB, # Benchmark threads: 1 Compressing | Decompressing Dict Speed Usage R/U Rating | Speed Usage R/U Rating KiB/s % MIPS MIPS | KiB/s % MIPS MIPS 22: 2558 99 2520 2489 | 29396 99 2530 2510 23: 2365 100 2423 2411 | 28876 99 2519 2500 24: 2246 99 2429 2415 | 28215 100 2489 2477 25: 2168 99 2489 2476 | 27281 99 2445 2428 ---------------------------------- | ------------------------------ Avr: 99 2465 2448 | 99 2496 2479 Tot: 99 2480 2463 7z b -mmt2 7-Zip (a) [64] 16.04 : Copyright (c) 1999-2016 Igor Pavlov : 2016-10-04 p7zip Version 16.04-hash17+crct8 (locale=C,Utf16=off,HugeFiles=on,64 bits,2 CPUs LE) LE CPU Freq: 499 898 1188 1389 1641 1833 1845 1835 1842 RAM size: 2009 MB, # CPU hardware threads: 2 RAM usage: 441 MB, # Benchmark threads: 2 Compressing | Decompressing Dict Speed Usage R/U Rating | Speed Usage R/U Rating KiB/s % MIPS MIPS | KiB/s % MIPS MIPS 22: 5505 172 3116 5356 | 57155 197 2471 4880 23: 5087 164 3160 5183 | 55978 198 2450 4846 24: 5123 169 3254 5508 | 54363 198 2415 4773 25: 5043 169 3405 5759 | 52711 198 2373 4692 ---------------------------------- | ------------------------------ Avr: 169 3234 5452 | 198 2427 4797 Tot: 183 2830 5125 7z b -mmt4 7-Zip (a) [64] 16.04 : Copyright (c) 1999-2016 Igor Pavlov : 2016-10-04 p7zip Version 16.04-hash17+crct8 (locale=C,Utf16=off,HugeFiles=on,64 bits,2 CPUs LE) LE CPU Freq: 513 916 1196 1407 1650 1830 1843 1844 1843 RAM size: 2009 MB, # CPU hardware threads: 2 RAM usage: 882 MB, # Benchmark threads: 4 Compressing | Decompressing Dict Speed Usage R/U Rating | Speed Usage R/U Rating KiB/s % MIPS MIPS | KiB/s % MIPS MIPS 22: 6141 195 3060 5974 | 55483 197 2405 4734 23: 5930 197 3064 6042 | 54274 197 2380 4696 24: 5723 196 3136 6154 | 52653 197 2347 4622 25: 5590 197 3248 6383 | 51315 197 2314 4567 ---------------------------------- | ------------------------------ Avr: 196 3127 6138 | 197 2361 4655 Tot: 197 2744 5397 7z b -mm=* -mmt1 7-Zip (a) [64] 16.04 : Copyright (c) 1999-2016 Igor Pavlov : 2016-10-04 p7zip Version 16.04-hash17+crct8 (locale=C,Utf16=off,HugeFiles=on,64 bits,2 CPUs LE) LE CPU Freq: 492 889 1184 1388 1641 1832 1844 1835 1843 RAM size: 2009 MB, # CPU hardware threads: 2 RAM usage: 225 MB, # Benchmark threads: 1 Method Speed Usage R/U Rating E/U Effec KiB/s % MIPS MIPS % % CPU 100 1845 1843 CPU 100 1840 1844 CPU 100 1845 1843 100 100 LZMA:x1 13026 99 4794 4762 260 258 29111 100 2378 2370 129 129 LZMA:x5:mt1 2250 100 2822 2812 153 153 28349 100 2401 2391 130 130 LZMA:x5:mt2 5152 169 3811 6436 207 349 28425 99 2411 2398 131 130 Deflate:x1 30030 100 3820 3813 207 207 90040 100 2803 2798 152 152 Deflate:x5 9445 100 3655 3637 198 197 89790 100 2794 2787 152 151 Deflate:x7 3352 100 3722 3715 202 202 90721 99 2833 2815 154 153 Deflate64:x5 8757 100 3797 3784 206 205 89858 100 2820 2811 153 153 BZip2:x1 5101 100 3094 3082 168 167 24657 100 2682 2673 146 145 BZip2:x5 4354 100 3643 3634 198 197 20641 100 4065 4051 221 220 BZip2:x5:mt2 7782 192 3383 6495 184 352 32565 173 3692 6391 200 347 BZip2:x7 1314 100 3417 3405 185 185 20490 100 4029 4018 219 218 PPMD:x1 3942 100 4092 4077 222 221 3259 100 3845 3839 209 208 PPMD:x5 2575 99 4388 4364 238 237 2175 100 4089 4076 222 221 Delta:4 530765 100 3273 3261 178 177 347619 100 2142 2136 116 116 BCJ 1214521 100 4999 4975 271 270 1218895 100 4982 4993 270 271 AES256CBC:1 113274 100 2792 2784 152 151 116539 100 2868 2864 156 155 AES256CBC:2 CRC32:1 177702 100 1298 1294 70 70 CRC32:4 602622 100 1348 1345 73 73 CRC32:8 902418 100 1227 1224 67 66 CRC64 531344 100 1091 1088 59 59 SHA256 166773 100 3410 3402 185 185 SHA1 376511 100 3532 3524 192 191 BLAKE2sp 243593 100 5384 5359 292 291 CPU 100 1833 1829 ------------------------------------------------------ Tot: 110 3129 3491 173 189 7z b -mm=* -mmt2 7-Zip (a) [64] 16.04 : Copyright (c) 1999-2016 Igor Pavlov : 2016-10-04 p7zip Version 16.04-hash17+crct8 (locale=C,Utf16=off,HugeFiles=on,64 bits,2 CPUs LE) LE CPU Freq: 503 901 1194 1390 1646 1833 1825 1845 1844 RAM size: 2009 MB, # CPU hardware threads: 2 RAM usage: 450 MB, # Benchmark threads: 2 Method Speed Usage R/U Rating E/U Effec KiB/s % MIPS MIPS % % CPU 198 1791 3552 CPU 198 1793 3557 CPU 198 1790 3539 101 200 LZMA:x1 25117 197 4653 9182 263 519 56757 198 2335 4622 132 261 LZMA:x5:mt1 4710 197 2982 5885 169 333 54343 198 2318 4583 131 259 LZMA:x5:mt2 5700 196 3626 7121 205 402 54251 198 2316 4575 131 259 Deflate:x1 58068 198 3730 7373 211 417 175370 198 2757 5449 156 308 Deflate:x5 18566 198 3613 7149 204 404 175615 198 2760 5452 156 308 Deflate:x7 6505 198 3646 7208 206 407 176203 197 2774 5468 157 309 Deflate64:x5 17054 197 3732 7370 211 416 175343 198 2774 5485 157 310 BZip2:x1 9878 198 3020 5968 171 337 48153 198 2636 5220 149 295 BZip2:x5 7959 198 3361 6643 190 375 32866 197 3267 6451 185 365 BZip2:x5:mt2 7599 197 3220 6342 182 358 28021 197 2794 5500 158 311 BZip2:x7 2396 194 3205 6209 181 351 32661 197 3258 6405 184 362 PPMD:x1 7656 198 4008 7919 226 447 6326 198 3771 7450 213 421 PPMD:x5 4779 197 4109 8099 232 458 3992 197 3800 7482 215 423 Delta:4 1032456 198 3200 6343 181 358 583218 196 1829 3583 103 202 BCJ 2360323 197 4900 9668 277 546 2379895 198 4930 9748 279 551 AES256CBC:1 218206 198 2707 5363 153 303 227263 198 2817 5585 159 316 AES256CBC:2 CRC32:1 346198 198 1273 2520 72 142 CRC32:4 1175244 198 1323 2623 75 148 CRC32:8 1759603 198 1203 2386 68 135 CRC64 1033592 198 1070 2117 60 120 SHA256 326090 198 3354 6652 190 376 SHA1 727875 198 3448 6813 195 385 BLAKE2sp 474195 198 5261 10432 297 589 CPU 198 1793 3552 ------------------------------------------------------ Tot: 197 3013 5945 170 336
CLANG uxtw->lsl hack
If C code uses 32-bit unsigned integer variable as index to access array:
v = array[uint32_index];
CLANG 3.7 / 4.0 can produce instruction like this:
ldr w1, [x2, w3, uxtw #2] // 6 cycles at Apple A9
But we can replace that instruction to similar instruction:
ldr w1, [x2, x3, lsl #2] // 4 cycles at Apple A9
These instructions are not equal for 100%. But all 7-Zip’s benchmark tests work OK after hack.
Some benchmarks from 7-Zip work 10-20% faster after hack on Apple A9, and average gain is 2-3%.
# clang-4 -arch arm64 -mcpu=cyclone -O3 # sed -i -e '/\(st\|ld\)r.*[xw].*x.*w.* uxtw #/ {s/w\([^,]*\), uxtw/x\1, lsl/}' # -e '/\(st\|ld\)rb.*w[^,]*, uxtw\]/ {s/w\([^,]*\), uxtw/x\1/}' *.s 7z b -mmt1 7-Zip (a) [64] 16.04 : Copyright (c) 1999-2016 Igor Pavlov : 2016-10-04 p7zip Version 16.04-hash17+crct8-lsl-v3 (locale=C,Utf16=off,HugeFiles=on,64 bits,2 CPUs LE) LE CPU Freq: 495 884 1180 1380 1641 1831 1822 1843 1844 RAM size: 2009 MB, # CPU hardware threads: 2 RAM usage: 435 MB, # Benchmark threads: 1 Compressing | Decompressing Dict Speed Usage R/U Rating | Speed Usage R/U Rating KiB/s % MIPS MIPS | KiB/s % MIPS MIPS 22: 2594 99 2543 2524 | 30455 99 2614 2600 23: 2380 99 2438 2426 | 30022 100 2612 2599 24: 2264 100 2446 2435 | 29202 100 2575 2564 25: 2188 100 2511 2499 | 28167 100 2519 2507 ---------------------------------- | ------------------------------ Avr: 99 2485 2471 | 100 2580 2567 Tot: 99 2532 2519 7z b -mm=* -mmt1 7-Zip (a) [64] 16.04 : Copyright (c) 1999-2016 Igor Pavlov : 2016-10-04 p7zip Version 16.04-hash17+crct8-lsl-v3 (locale=C,Utf16=off,HugeFiles=on,64 bits,2 CPUs LE) LE CPU Freq: 505 904 1197 1394 1646 1833 1843 1837 1843 RAM size: 2009 MB, # CPU hardware threads: 2 RAM usage: 225 MB, # Benchmark threads: 1 Method Speed Usage R/U Rating E/U Effec KiB/s % MIPS MIPS % % CPU 100 1841 1839 CPU 100 1845 1843 CPU 100 1843 1843 100 100 LZMA:x1 13202 99 4868 4826 264 262 29969 100 2445 2440 133 132 LZMA:x5:mt1 2270 100 2843 2836 154 154 29049 100 2461 2450 134 133 LZMA:x5:mt2 5236 170 3839 6542 208 355 29242 100 2473 2466 134 134 Deflate:x1 30416 100 3863 3862 210 210 93075 100 2903 2892 158 157 Deflate:x5 9489 100 3668 3654 199 198 93232 100 2906 2894 158 157 Deflate:x7 3382 100 3764 3748 204 203 93939 100 2920 2915 158 158 Deflate64:x5 8769 100 3803 3790 206 206 92910 100 2908 2906 158 158 BZip2:x1 5265 100 3193 3181 173 173 25094 99 2736 2720 148 148 BZip2:x5 4513 100 3773 3766 205 204 20853 100 4110 4093 223 222 BZip2:x5:mt2 7996 192 3482 6674 189 362 32607 170 3766 6400 204 347 BZip2:x7 1347 100 3503 3491 190 189 21122 100 4143 4142 225 225 PPMD:x1 3959 100 4108 4095 223 222 3279 100 3872 3861 210 210 PPMD:x5 2578 100 4388 4370 238 237 2197 100 4132 4117 224 223 Delta:4 530891 100 3275 3262 178 177 356465 100 2193 2190 119 119 BCJ 1204041 99 4970 4932 270 268 1221370 100 5009 5003 272 271 AES256CBC:1 128023 100 3146 3146 171 171 131847 100 3251 3240 176 176 AES256CBC:2 CRC32:1 224780 100 1637 1636 89 89 CRC32:4 698401 100 1563 1559 85 85 CRC32:8 994911 100 1353 1349 73 73 CRC64 631082 100 1295 1292 70 70 SHA256 167135 100 3426 3410 186 185 SHA1 374018 100 3512 3501 191 190 BLAKE2sp 242960 100 5326 5345 289 290 CPU 100 1846 1843 ------------------------------------------------------ Tot: 110 3196 3567 176 194
Apple A9 at Wikipedia