Apple A9

Apple A9

Apple A9 (Twister), 1850 MHz, RAM: 2 GB. iPhone SE.

  • L1 Data cache = 64 KB, 64 B/line, 4-WAY.
  • L1 Instruction cache = ? KB, 64 B/line, ?-WAY.
  • L2 Cache = 3 MB (per 2 cores), 64 B/line, 12-WAY.
  • L3 Cache = 4 MB (per 2 cores), 64 B/line, ?-WAY. Exclusive (Victim) with L2 Cache.
  • L1 Data Cache Latency = 3 cycles for simple access via pointer
  • L1 Data Cache Latency (arm64) = 4 cycles for access with complex address calculation (size_t n, *p; n = p[n]).
  • L1 Data Cache Latency (arm32) = 6 cycles for access with complex address calculation (size_t n, *p; n = p[n]).
  • L2 Cache Latency = 16 cycles
  • L2 Cache Latency = 16.5 cycles (random)
  • L3 Cache Latency = 45 cycles + 24 ns = 89 cycles
  • RAM Latency = 45 cycles + 120 ns (sequential)
  • RAM Latency = 45 cycles + 120 ns (random, more than 1 chain)
  • RAM Latency = 45 cycles + 160 ns (random, 1 chain)

Notes:

arm32: Instruction “ldr r1, [r2, r3, lsl #2]” is slow (6 cycles). Probably Apple A9 is not optimized for arm32 code.

arm64: CLANG can produce slow instruction “ldr w1, [x2, w3, uxtw #2]” (6 cycles for A9)
for array access commands: v = array[uint32_index]. Read more about it here : CLANG uxtw->lsl hack.

16 KB pages mode (ARM64)

  • Data TLB L1: 128 items. 4-way? Miss penalty = 7 cycles. Parallel miss: 3 cycles per access
  • Data TLB L2: 1024 items. ?-way. Miss penalty = 29 cycles (page walk to L2 cache ? ).
    Parallel miss: 12 cycles per access (including L2 cache access for data and page walk).
  Size       Latency        Increase   Description

  64 K     3                           
 128 K    10               7           + 13 (L2)
 256 K    13               3           
 512 K    15               2   
   1 M    16               1
   2 M    16               0           
   4 M    29 +  7 ns      13 +  7 ns   + 29 + 24 ns (L3)  + 7 (L1 TLB miss)        
   8 M    41 + 35 ns      12 + 28 ns   
  16 M    47 + 91 ns       6 + 56 ns   + 136 ns (RAM)
  32 M    64 +125 ns      17 + 34 ns   + 29 (L2 TLB miss)
  64 M    73 +142 ns       9 + 17 ns
 128 M    77 +150 ns       4 +  8 ns   
 256 M    79 +155 ns       2 +  5 ns
 512 M    80 +158 ns       1 +  3 ns

MISC

  • Branch misprediction penalty = 14 cycles (arm64)
  • Branch misprediction penalty = 16 cycles (arm32)
  • Branch history table = 16K items or more (for 8 branches code).
  • arm32 : 32-bit loads: 4-bytes range cross penalty = 1 cycle
  • arm64 : 64-bit loads: 64-bytes range cross penalty = 3.75 cycles
  • L1 B/W (Parallel Random Read) = 0.54 cycles per one access
  • L1 Write throughput = 0.5 cycles per one write
  • L2->L1 B/W (Parallel Random Read) = 3.0 cycles per cache line
  • L2->L1 B/W (Read, 64 bytes step) = 2.9 cycles per cache line
  • L2->L1 B/W (Read, 64 Bytes step – pointer chasing) = 4.3 cycles per cache line (HW prefetch)
  • L2 Write (Write, 64 bytes step) = 2.7 cycles per write
  • L3->L1 B/W (Parallel Random Read) = 11 cycles per cache line
  • L3->L1 B/W (Read, 64 bytes step) = 7.2 cycles per cache line
  • L3->L1 B/W (Read, 64 Bytes step – pointer chasing) = 6.9 cycles per cache line (HW prefetch)
  • L3 Write (Write, 64 bytes step) = 7 cycles per write
  • RAM Read B/W (Parallel Random Read) = 11 ns / access line
  • RAM Read B/W (Read, 8-64 Bytes step) = 14-16 GB/s
  • RAM Read B/W (Read, 64 Bytes step – pointer chasing) = 12 GB/s (HW prefetch)
  • RAM Write B/W (Write, 8-64 Bytes step) = 7500 – 7900 MB/s

7-Zip Benchmark

Notes:

7z b : MIPS values are normalized with Intel Core 2 cpu.

7z b -mm=* : MIPS and Effectiveness values are normalized with AMD K8 cpu.


## iOS 10.2

## vanilla 16.04 + {CpuArch.h,7zCrcOpt.c,XzCrc64.c,XzCrc64Opt.c,Sha1.c,Sha256.c,Aes.c} from 17.00 
## + __builtin_bswap{16,32,64} + CrcUpdateT8

# clang-4 -arch arm64 -mcpu=cyclone -O3


7z b -mmt1

7-Zip (a) [64] 16.04 : Copyright (c) 1999-2016 Igor Pavlov : 2016-10-04
p7zip Version 16.04-hash17+crct8 (locale=C,Utf16=off,HugeFiles=on,64 bits,2 CPUs LE)

LE
CPU Freq:   504   909  1180  1350  1613  1805  1824  1834  1838

RAM size:    2009 MB,  # CPU hardware threads:   2
RAM usage:    435 MB,  # Benchmark threads:      1

                       Compressing  |                  Decompressing
Dict     Speed Usage    R/U Rating  |      Speed Usage    R/U Rating
         KiB/s     %   MIPS   MIPS  |      KiB/s     %   MIPS   MIPS

22:       2558    99   2520   2489  |      29396    99   2530   2510
23:       2365   100   2423   2411  |      28876    99   2519   2500
24:       2246    99   2429   2415  |      28215   100   2489   2477
25:       2168    99   2489   2476  |      27281    99   2445   2428
----------------------------------  | ------------------------------
Avr:              99   2465   2448  |               99   2496   2479
Tot:              99   2480   2463


7z b -mmt2

7-Zip (a) [64] 16.04 : Copyright (c) 1999-2016 Igor Pavlov : 2016-10-04
p7zip Version 16.04-hash17+crct8 (locale=C,Utf16=off,HugeFiles=on,64 bits,2 CPUs LE)

LE
CPU Freq:   499   898  1188  1389  1641  1833  1845  1835  1842

RAM size:    2009 MB,  # CPU hardware threads:   2
RAM usage:    441 MB,  # Benchmark threads:      2

                       Compressing  |                  Decompressing
Dict     Speed Usage    R/U Rating  |      Speed Usage    R/U Rating
         KiB/s     %   MIPS   MIPS  |      KiB/s     %   MIPS   MIPS

22:       5505   172   3116   5356  |      57155   197   2471   4880
23:       5087   164   3160   5183  |      55978   198   2450   4846
24:       5123   169   3254   5508  |      54363   198   2415   4773
25:       5043   169   3405   5759  |      52711   198   2373   4692
----------------------------------  | ------------------------------
Avr:             169   3234   5452  |              198   2427   4797
Tot:             183   2830   5125


7z b -mmt4

7-Zip (a) [64] 16.04 : Copyright (c) 1999-2016 Igor Pavlov : 2016-10-04
p7zip Version 16.04-hash17+crct8 (locale=C,Utf16=off,HugeFiles=on,64 bits,2 CPUs LE)

LE
CPU Freq:   513   916  1196  1407  1650  1830  1843  1844  1843

RAM size:    2009 MB,  # CPU hardware threads:   2
RAM usage:    882 MB,  # Benchmark threads:      4

                       Compressing  |                  Decompressing
Dict     Speed Usage    R/U Rating  |      Speed Usage    R/U Rating
         KiB/s     %   MIPS   MIPS  |      KiB/s     %   MIPS   MIPS

22:       6141   195   3060   5974  |      55483   197   2405   4734
23:       5930   197   3064   6042  |      54274   197   2380   4696
24:       5723   196   3136   6154  |      52653   197   2347   4622
25:       5590   197   3248   6383  |      51315   197   2314   4567
----------------------------------  | ------------------------------
Avr:             196   3127   6138  |              197   2361   4655
Tot:             197   2744   5397


7z b -mm=* -mmt1

7-Zip (a) [64] 16.04 : Copyright (c) 1999-2016 Igor Pavlov : 2016-10-04
p7zip Version 16.04-hash17+crct8 (locale=C,Utf16=off,HugeFiles=on,64 bits,2 CPUs LE)

LE
CPU Freq:   492   889  1184  1388  1641  1832  1844  1835  1843

RAM size:    2009 MB,  # CPU hardware threads:   2
RAM usage:    225 MB,  # Benchmark threads:      1


Method           Speed Usage    R/U Rating   E/U Effec
                 KiB/s     %   MIPS   MIPS     %     %

CPU                      100   1845   1843
CPU                      100   1840   1844
CPU                      100   1845   1843   100   100

LZMA:x1          13026    99   4794   4762   260   258
                 29111   100   2378   2370   129   129
LZMA:x5:mt1       2250   100   2822   2812   153   153
                 28349   100   2401   2391   130   130
LZMA:x5:mt2       5152   169   3811   6436   207   349
                 28425    99   2411   2398   131   130
Deflate:x1       30030   100   3820   3813   207   207
                 90040   100   2803   2798   152   152
Deflate:x5        9445   100   3655   3637   198   197
                 89790   100   2794   2787   152   151
Deflate:x7        3352   100   3722   3715   202   202
                 90721    99   2833   2815   154   153
Deflate64:x5      8757   100   3797   3784   206   205
                 89858   100   2820   2811   153   153
BZip2:x1          5101   100   3094   3082   168   167
                 24657   100   2682   2673   146   145
BZip2:x5          4354   100   3643   3634   198   197
                 20641   100   4065   4051   221   220
BZip2:x5:mt2      7782   192   3383   6495   184   352
                 32565   173   3692   6391   200   347
BZip2:x7          1314   100   3417   3405   185   185
                 20490   100   4029   4018   219   218
PPMD:x1           3942   100   4092   4077   222   221
                  3259   100   3845   3839   209   208
PPMD:x5           2575    99   4388   4364   238   237
                  2175   100   4089   4076   222   221
Delta:4         530765   100   3273   3261   178   177
                347619   100   2142   2136   116   116
BCJ            1214521   100   4999   4975   271   270
               1218895   100   4982   4993   270   271
AES256CBC:1     113274   100   2792   2784   152   151
                116539   100   2868   2864   156   155
AES256CBC:2 

CRC32:1         177702   100   1298   1294    70    70
CRC32:4         602622   100   1348   1345    73    73
CRC32:8         902418   100   1227   1224    67    66
CRC64           531344   100   1091   1088    59    59
SHA256          166773   100   3410   3402   185   185
SHA1            376511   100   3532   3524   192   191
BLAKE2sp        243593   100   5384   5359   292   291

CPU                      100   1833   1829
------------------------------------------------------
Tot:                     110   3129   3491   173   189


7z b -mm=* -mmt2

7-Zip (a) [64] 16.04 : Copyright (c) 1999-2016 Igor Pavlov : 2016-10-04
p7zip Version 16.04-hash17+crct8 (locale=C,Utf16=off,HugeFiles=on,64 bits,2 CPUs LE)

LE
CPU Freq:   503   901  1194  1390  1646  1833  1825  1845  1844

RAM size:    2009 MB,  # CPU hardware threads:   2
RAM usage:    450 MB,  # Benchmark threads:      2


Method           Speed Usage    R/U Rating   E/U Effec
                 KiB/s     %   MIPS   MIPS     %     %

CPU                      198   1791   3552
CPU                      198   1793   3557
CPU                      198   1790   3539   101   200

LZMA:x1          25117   197   4653   9182   263   519
                 56757   198   2335   4622   132   261
LZMA:x5:mt1       4710   197   2982   5885   169   333
                 54343   198   2318   4583   131   259
LZMA:x5:mt2       5700   196   3626   7121   205   402
                 54251   198   2316   4575   131   259
Deflate:x1       58068   198   3730   7373   211   417
                175370   198   2757   5449   156   308
Deflate:x5       18566   198   3613   7149   204   404
                175615   198   2760   5452   156   308
Deflate:x7        6505   198   3646   7208   206   407
                176203   197   2774   5468   157   309
Deflate64:x5     17054   197   3732   7370   211   416
                175343   198   2774   5485   157   310
BZip2:x1          9878   198   3020   5968   171   337
                 48153   198   2636   5220   149   295
BZip2:x5          7959   198   3361   6643   190   375
                 32866   197   3267   6451   185   365
BZip2:x5:mt2      7599   197   3220   6342   182   358
                 28021   197   2794   5500   158   311
BZip2:x7          2396   194   3205   6209   181   351
                 32661   197   3258   6405   184   362
PPMD:x1           7656   198   4008   7919   226   447
                  6326   198   3771   7450   213   421
PPMD:x5           4779   197   4109   8099   232   458
                  3992   197   3800   7482   215   423
Delta:4        1032456   198   3200   6343   181   358
                583218   196   1829   3583   103   202
BCJ            2360323   197   4900   9668   277   546
               2379895   198   4930   9748   279   551
AES256CBC:1     218206   198   2707   5363   153   303
                227263   198   2817   5585   159   316
AES256CBC:2 

CRC32:1         346198   198   1273   2520    72   142
CRC32:4        1175244   198   1323   2623    75   148
CRC32:8        1759603   198   1203   2386    68   135
CRC64          1033592   198   1070   2117    60   120
SHA256          326090   198   3354   6652   190   376
SHA1            727875   198   3448   6813   195   385
BLAKE2sp        474195   198   5261  10432   297   589

CPU                      198   1793   3552
------------------------------------------------------
Tot:                     197   3013   5945   170   336



CLANG uxtw->lsl hack

If C code uses 32-bit unsigned integer variable as index to access array:

  v = array[uint32_index];

CLANG 3.7 / 4.0 can produce instruction like this:

  ldr w1, [x2, w3, uxtw #2]    // 6 cycles at Apple A9

But we can replace that instruction to similar instruction:

  ldr w1, [x2, x3, lsl #2]     // 4 cycles at Apple A9

These instructions are not equal for 100%. But all 7-Zip’s benchmark tests work OK after hack.

Some benchmarks from 7-Zip work 10-20% faster after hack on Apple A9, and average gain is 2-3%.


# clang-4 -arch arm64 -mcpu=cyclone -O3

# sed -i -e '/\(st\|ld\)r.*[xw].*x.*w.* uxtw #/ {s/w\([^,]*\), uxtw/x\1, lsl/}'
#        -e '/\(st\|ld\)rb.*w[^,]*, uxtw\]/ {s/w\([^,]*\), uxtw/x\1/}' *.s

7z b -mmt1

7-Zip (a) [64] 16.04 : Copyright (c) 1999-2016 Igor Pavlov : 2016-10-04
p7zip Version 16.04-hash17+crct8-lsl-v3 (locale=C,Utf16=off,HugeFiles=on,64 bits,2 CPUs LE)

LE
CPU Freq:   495   884  1180  1380  1641  1831  1822  1843  1844

RAM size:    2009 MB,  # CPU hardware threads:   2
RAM usage:    435 MB,  # Benchmark threads:      1

                       Compressing  |                  Decompressing
Dict     Speed Usage    R/U Rating  |      Speed Usage    R/U Rating
         KiB/s     %   MIPS   MIPS  |      KiB/s     %   MIPS   MIPS

22:       2594    99   2543   2524  |      30455    99   2614   2600
23:       2380    99   2438   2426  |      30022   100   2612   2599
24:       2264   100   2446   2435  |      29202   100   2575   2564
25:       2188   100   2511   2499  |      28167   100   2519   2507
----------------------------------  | ------------------------------
Avr:              99   2485   2471  |              100   2580   2567
Tot:              99   2532   2519


7z b -mm=* -mmt1

7-Zip (a) [64] 16.04 : Copyright (c) 1999-2016 Igor Pavlov : 2016-10-04
p7zip Version 16.04-hash17+crct8-lsl-v3 (locale=C,Utf16=off,HugeFiles=on,64 bits,2 CPUs LE)

LE
CPU Freq:   505   904  1197  1394  1646  1833  1843  1837  1843

RAM size:    2009 MB,  # CPU hardware threads:   2
RAM usage:    225 MB,  # Benchmark threads:      1


Method           Speed Usage    R/U Rating   E/U Effec
                 KiB/s     %   MIPS   MIPS     %     %

CPU                      100   1841   1839
CPU                      100   1845   1843
CPU                      100   1843   1843   100   100

LZMA:x1          13202    99   4868   4826   264   262
                 29969   100   2445   2440   133   132
LZMA:x5:mt1       2270   100   2843   2836   154   154
                 29049   100   2461   2450   134   133
LZMA:x5:mt2       5236   170   3839   6542   208   355
                 29242   100   2473   2466   134   134
Deflate:x1       30416   100   3863   3862   210   210
                 93075   100   2903   2892   158   157
Deflate:x5        9489   100   3668   3654   199   198
                 93232   100   2906   2894   158   157
Deflate:x7        3382   100   3764   3748   204   203
                 93939   100   2920   2915   158   158
Deflate64:x5      8769   100   3803   3790   206   206
                 92910   100   2908   2906   158   158
BZip2:x1          5265   100   3193   3181   173   173
                 25094    99   2736   2720   148   148
BZip2:x5          4513   100   3773   3766   205   204
                 20853   100   4110   4093   223   222
BZip2:x5:mt2      7996   192   3482   6674   189   362
                 32607   170   3766   6400   204   347
BZip2:x7          1347   100   3503   3491   190   189
                 21122   100   4143   4142   225   225
PPMD:x1           3959   100   4108   4095   223   222
                  3279   100   3872   3861   210   210
PPMD:x5           2578   100   4388   4370   238   237
                  2197   100   4132   4117   224   223
Delta:4         530891   100   3275   3262   178   177
                356465   100   2193   2190   119   119
BCJ            1204041    99   4970   4932   270   268
               1221370   100   5009   5003   272   271
AES256CBC:1     128023   100   3146   3146   171   171
                131847   100   3251   3240   176   176
AES256CBC:2 

CRC32:1         224780   100   1637   1636    89    89
CRC32:4         698401   100   1563   1559    85    85
CRC32:8         994911   100   1353   1349    73    73
CRC64           631082   100   1295   1292    70    70
SHA256          167135   100   3426   3410   186   185
SHA1            374018   100   3512   3501   191   190
BLAKE2sp        242960   100   5326   5345   289   290

CPU                      100   1846   1843
------------------------------------------------------
Tot:                     110   3196   3567   176   194


Apple A9 at Wikipedia