This site will be unavailable for up to two hours from 2300 UTC on Wednesday 2021-03-03 owing to server maintenance.  

Cortex-M7 instruction cycle counts, timings, and dual-issue combinations

Quinapalus Home :: Things Technical :: Cortex-M7 instruction cycle counts, timings, and dual-issue combinations

Introduction

This page gives cycle counts and timing information for various combinations of instructions executed on an ARM Cortex-M7 core. It also indicates which combinations of instructions can be ‘dual issued’—that is, executed simultaneously by the core. Unlike the Cortex-M0, -M3 and -M4 cores, ARM does not appear to make this information for the Cortex-M7 core public. This is unfortunate because the complexity of the core means that without these details it is very difficult to write optimised assembler code. Using the information in the table below just to reorder the instructions in your code may be able to speed it up by a factor of two or, in some cases, even more. Although the information below is far from complete it is, I hope, sufficient to assist in the optimisation of code in many common cases. There may be some inaccuracies; use at your own risk.

Impatient?

The conclusions are at the bottom of the page.

Method

Experiments were carried out on an STM32H730 low-cost microcontroller running at 480 MHz with data and instruction caches enabled. For each instruction or pair of instructions listed in the table below two functions were created, the first containing four consecutive copies of the instructions, the second fourteen copies. Each function was bracketed by a pair of isb (‘instruction synchronisation barrier’) instructions. In each case the address of the opening isb was aligned to a 32-byte boundary. The execution times of the two functions was measured using the SysTick counter, and the difference divided by ten to obtain a cycle count. The average cycle count over a thousand runs is reported in the table. This approach minimises the effect of surrounding instructions on the sequence of interest and ensures that all execution is from cache. As some kind of confirmation of this, all results were integers when rounded to two decimal places. Comments on the results, and in particular evidence contradicting any of them, are welcome at the e-mail address on the home page.

Results

Instruction sequence Cycle count
(see above)
Remarks Dual issue?
Arithmetic and logical
eor r0,r1,r2       ; eor r3,r4,r5       1 logical instructions with no dependencies Y
adc r0,r1,r2       ; adc r3,r4,r5       1 arithmetic instructions with no dependencies Y
eor r0,r1,r2       ; eor r1,r2,r0       2 mutual register dependency N
adc r0,r1,r2       ; adc r1,r2,r0       2 mutual register dependency N
adcs r0,r1,r2      ; adcs r3,r4,r5      2 mutual carry dependency N
eor r0,r1,r2       ; eor r3,r4,r5,ror#7 1 one shifted operand Y
adc r0,r1,r2       ; adc r3,r4,r5,ror#7 1 Y
adc r0,r1,r2,ror#7 ; adc r3,r4,r5       1 Y
eor r0,r1,r2,ror#3 ; eor r3,r4,r5,ror#7 2 two shifted operands N
adc r0,r1,r2,ror#3 ; adc r3,r4,r5,ror#7 2 N
adc r0,r1,#2       ; adc r3,r4,r5,ror#7 1 one simple immediate, one shifted operand Y
adc r0,r1,#0x124   ; adc r3,r4,r5,ror#7 2 one shifted immediate, one shifted operand N
adc r0,r1,#0x124   ; adc r3,r4,r5,#2    1 one shifted immediate, one simple immediate Y
adc r0,r1,#0x124   ; adc r3,r4,r5,#0x2382 two shifted immediates N
eor r0,r1,r2,ror#3 ; eor r1,r2,r0,ror#7 3 shifting result of previous operation adds one cycle N
eor r0,r1,r2,ror#3 ; eor r2,r1,r0,ror#7 4 shifting result of previous operation adds one cycle N
Multiply and divide
mul r0,r1,r2       ; eor r3,r4,r5       1 multiply + logical Y
mul r0,r1,r2       ; adc r3,r4,r5       1 multiply + arithmetic Y
mul r0,r1,r2       ; adc r3,r4,r5,ror#7 1 multiply + arithmetic with shift Y
mul r0,r1,r2       ; mul r3,r4,r5       2 multiply + multiply N
mul r0,r1,r2       ; adc r3,r4,r0       2 multiply + register dependency N
mul r0,r1,r2       ; adc r1,r4,r0       3 multiply + mutual register dependency N
mla r0,r1,r2,r3    ; eor r4,r5,r6       1 multiply-accumulate + logical Y
mla r0,r1,r2,r3    ; adc r4,r5,r6       1 multiply-accumulate + arithmetic Y
mla r0,r1,r2,r3    ; adc r4,r5,r6,ror#7 1 multiply-accumulate + arithmetic with shift Y
mla r0,r1,r2,r0                         1 multiply-accumulate + accumulate dependency -
mla r0,r1,r0,r2                         2 multiply-accumulate + multiplier/multiplicand dependency -
mla r0,r1,r2,r3    ; adc r1,r4,r0       3 multiply-accumulate + mutual register dependency N
umull r0,r1,r2,r3                       1 long multiply -
umull r0,r1,r2,r3  ; adc r4,r5,r6       1 long multiply + arithmetic Y
umull r0,r1,r0,r2                       2 long multiply + register dependency -
umlal r0,r1,r2,r3                       1 long multiply-accumulate + accumulate dependency -
umlal r0,r1,r2,r3  ; umlal r2,r3,r0,r1  4 long multiply-accumulate + multiplier/multiplicand dependency N
umlal r0,r1,r2,r3  ; adc r4,r5,r6       1 long multiply-accumulate + accumulate dependency + arithmetic Y
udiv r0,r1,r2                           3+⌈s/2⌉s=number of significant bits in result; divide by 0 takes 3 cycles-
udiv r0,r1,r2      ; adc r3,r4,r5       4+⌈s/2⌉ N
Load
ldr r1,[r0]                             1 -
ldr r2,[r0]        ; ldr r3,[r1]        2 N
ldr r1,[r0]        ; adc r2,#1          1 load + independent arithmetic Y
ldr r1,[r0]        ; mul r2,r3,r4       1 load + independent multiply Y
ldr r1,[r0]        ; mla r2,r3,r4,r2    1 load + independent multiply-accumulate Y
ldr r1,[r0,#4]!    ; mla r2,r3,r4,r2    1 load + independent multiply-accumulate Y
ldr r1,[r0]        ; mla r2,r1,r3,r2    2 load + dependent multiply-accumulate N
ldr r1,[r0]        ; add r0,r1          3 data-to-address dependency N
ldr r1,[r0,r2]     ; add r0,r1          3 data-to-address dependency N
ldm r0,{r1-r2}                          1 two words per cycle -
ldm r0,{r1-r3}                          2 -
ldm r0,{r1-r4}                          2 -
ldm r0,{r1-r5}                          3 -
ldm r0,{r1-r6}                          3 -
ldm r0,{r1-r7}                          4 -
ldm r0,{r1-r8}                          4 -
ldm r0,{r1-r2}     ; adc r3,#1          2 N
ldm r0,{r1-r2}     ; mul r3,r4,r5       2 N
Single-precision floating point
vadd.f s0,s1,s2                         1 no dependencies -
vadd.f s0,s1,s0                         3 dependency -
vadd.f s0,s1,s2    ; vadd.f s3,s4,s5    2 N
vadd.f s0,s1,s2    ; vmul.f s3,s4,s5    2 N
vadd.f s0,s1,s2    ; ldr r1,[r0]        1 Y
vadd.f s0,s1,s2    ; ldm r0,{r1-r2}     2 N
vadd.f s0,s1,s2    ; vldr.f s3,[r0]     1 Y
vadd.f s0,s1,s2    ; vldr.f s0,[r0]     2 data dependency N
vmul.f s0,s1,s2                         1 no dependencies -
vmul.f s0,s1,s0                         3 dependency -
vdiv.f s0,s1,s2                         16 no dependencies -
vdiv.f s0,s1,s0                         18 dependency -
vdiv.f s0,s0,s1                         18 dependency -
vsqrt.f s0,s1                           14 no dependencies -
vsqrt.f s0,s0                           16 dependency -
vmla.f s0,s1,s2                         3 accumulate dependency only -
vmla.f s0,s1,s0                         6 multiplier/multiplicand dependency -
vfma.f s0,s1,s2                         3 accumulate dependency only -
vfma.f s0,s1,s0                         5 multiplier/multiplicand dependency -
Double-precision floating point
vadd.d d0,d1,d2                         2 no dependencies -
vadd.d d0,d1,d0                         4 dependency -
vadd.d d0,d1,d2    ; vadd.d d3,d4,d5    4 N
vadd.d d0,d1,d2    ; vmul.d d3,d4,d5    7 N
vadd.d d0,d1,d2    ; ldr r1,[r0]        2 Y
vadd.d d0,d1,d2    ; ldm r0,{r1-r2}     3 N
vadd.d d0,d1,d2    ; vldr.d d3,[r0]     3 N
vadd.d d0,d1,d2    ; vldr.d d0,[r0]     3 data dependency N
vmul.d d0,d1,d2                         5 no dependencies -
vmul.d d0,d1,d0                         7 dependency -
vdiv.d d0,d1,d2                         30 no dependencies -
vdiv.d d0,d1,d0                         32 dependency -
vdiv.d d0,d0,d1                         32 dependency -
vsqrt.d d0,d1                           28 no dependencies -
vsqrt.d d0,d0                           30 dependency -
vmla.d d0,d1,d2                         11 accumulate dependency only -
vmla.d d0,d1,d0                         11 multiplier/multiplicand dependency -
vfma.d d0,d1,d2                         10 accumulate dependency only -
vfma.d d0,d1,d0                         10 multiplier/multiplicand dependency -

Conclusions

Because the following rules are inferred from a relatively small number of tests, they are necessarily approximate and incomplete.
  1. Two integer arithmetic, logical or single-register load operations will be dual issued if:
    • the second does not use the result (including flags) of the first;
    • they do not both involve a shift (or use a large immediate constant that implies a shift);
    • they are not both loads;
    • they are not both multiplies; and
    • neither is a division.
  2. An integer load can dual-issue with a floating-point arithmetic operation.
  3. A floating-point load can dual-issue with a single-precision floating-point arithmetic operation.
  4. Shifting the result of the previous instruction incurs a one-cycle result delay.
  5. Integer multiplications and multiply-accumulate operations can be issued on every cycle but have a result delay of two cycles.
  6. Integer multiply-accumulate operations into the same accumulator can run at one per cycle.

Throughput and latency summary

Instruction Throughput
(in cycles)
Latency
(in cycles)
Remarks
arithmetic and logical 0.5 or 1 1  
mul, umull 1 2  
mla, umlal 1 1 from addend
  2 from multiplier/multiplicand
udiv 3+⌈s/2⌉ 3+⌈s/2⌉ s=number of significant bits in result
vadd.f 1 3  
vmul.f 1 3  
vdiv.f 16 18  
vsqrt.f 14 16  
vmla.f 3 6 from multiplier/multiplicand
vfma.f 3 5 from multiplier/multiplicand
vadd.d 2 4  
vmul.d 5 7  
vdiv.d 30 32  
vsqrt.d 28 30  
vmla.d 11 11 from multiplier/multiplicand
vfma.d 10 10 from multiplier/multiplicand

This page most recently updated Wed 24 Feb 16:34:42 GMT 2021
Word Matcher

Options...
Type a pattern, e.g.
h???o
into the box and click ‘Go!’ to see a list of matching words. More...


Qxw screen
Qxw is a free (GPL) crossword construction program. New! Release 20200708 for both Linux and Windows. Non-Roman alphabets, batch mode, multiplex lights, answer treatments, circular and hex grids, jumbled entries, lots more besides. More...

Practical Signal Processing front cover
My book, ‘Practical Signal Processing’, is published by Cambridge University Press. You can order it directly from them, or via amazon.co.uk or amazon.com. Paperback edition now also available. Browse before you buy at Google Books. Wydanie polskie.

If you find this site useful or diverting, please consider a donation to NASS (a UK registered charity), to KickAS (in the US), or to a similar body in your own country.

Copyright ©2004–2021.
All trademarks used are hereby acknowledged.