This site will be unavailable for up to two hours from 2300 UTC on Wednesday 2021-03-03 owing to server maintenance.
Cortex-M7 instruction cycle counts, timings, and dual-issue combinations
Introduction
This page gives cycle counts and timing information for various combinations
of instructions executed on an ARM Cortex-M7 core. It also indicates
which combinations of instructions can be ‘dual issued’—that is,
executed simultaneously by the core.
Unlike the Cortex-M0, -M3 and -M4 cores, ARM does not appear
to make this information for the Cortex-M7 core public.
This is unfortunate because the complexity of the core means that without
these details it is very difficult to
write optimised assembler code. Using the information in the table
below just to reorder the instructions in your code may be able to speed it up by a factor
of two or, in some cases, even more.
Although the information below is far from complete it is, I hope,
sufficient to assist in the optimisation of code in many common cases. There may
be some inaccuracies; use at your own risk.
Impatient?
The conclusions are at the bottom of the page.
Method
Experiments were carried out on an STM32H730 low-cost microcontroller running
at 480 MHz with data and instruction caches enabled. For each instruction or
pair of instructions listed in the table below two functions were created, the
first containing four consecutive copies of the instructions, the second fourteen copies.
Each function was bracketed by a pair of isb (‘instruction synchronisation
barrier’) instructions. In each case the address of the opening
isb was aligned to a 32-byte boundary. The execution times of the two
functions was measured using the SysTick counter, and the difference divided by ten
to obtain a cycle count. The average cycle count over a thousand runs is reported
in the table. This approach minimises the effect of surrounding instructions on
the sequence of interest and ensures that all execution is from cache. As some
kind of confirmation of this, all results were integers when rounded to
two decimal places.
Comments on the results, and in particular evidence contradicting any of them,
are welcome at the e-mail address on the home page.
Results
Instruction sequence |
Cycle count (see above) |
Remarks |
Dual issue? |
Arithmetic and logical |
eor r0,r1,r2 ; eor r3,r4,r5 | 1 | logical instructions with no dependencies | Y |
adc r0,r1,r2 ; adc r3,r4,r5 | 1 | arithmetic instructions with no dependencies | Y |
eor r0,r1,r2 ; eor r1,r2,r0 | 2 | mutual register dependency | N |
adc r0,r1,r2 ; adc r1,r2,r0 | 2 | mutual register dependency | N |
adcs r0,r1,r2 ; adcs r3,r4,r5 | 2 | mutual carry dependency | N |
eor r0,r1,r2 ; eor r3,r4,r5,ror#7 | 1 | one shifted operand | Y |
adc r0,r1,r2 ; adc r3,r4,r5,ror#7 | 1 | | Y |
adc r0,r1,r2,ror#7 ; adc r3,r4,r5 | 1 | | Y |
eor r0,r1,r2,ror#3 ; eor r3,r4,r5,ror#7 | 2 | two shifted operands | N |
adc r0,r1,r2,ror#3 ; adc r3,r4,r5,ror#7 | 2 | | N |
adc r0,r1,#2 ; adc r3,r4,r5,ror#7 | 1 | one simple immediate, one shifted operand | Y |
adc r0,r1,#0x124 ; adc r3,r4,r5,ror#7 | 2 | one shifted immediate, one shifted operand | N |
adc r0,r1,#0x124 ; adc r3,r4,r5,#2 | 1 | one shifted immediate, one simple immediate | Y |
adc r0,r1,#0x124 ; adc r3,r4,r5,#0x238 | 2 | two shifted immediates | N |
eor r0,r1,r2,ror#3 ; eor r1,r2,r0,ror#7 | 3 | shifting result of previous operation adds one cycle | N |
eor r0,r1,r2,ror#3 ; eor r2,r1,r0,ror#7 | 4 | shifting result of previous operation adds one cycle | N |
Multiply and divide |
mul r0,r1,r2 ; eor r3,r4,r5 | 1 | multiply + logical | Y |
mul r0,r1,r2 ; adc r3,r4,r5 | 1 | multiply + arithmetic | Y |
mul r0,r1,r2 ; adc r3,r4,r5,ror#7 | 1 | multiply + arithmetic with shift | Y |
mul r0,r1,r2 ; mul r3,r4,r5 | 2 | multiply + multiply | N |
mul r0,r1,r2 ; adc r3,r4,r0 | 2 | multiply + register dependency | N |
mul r0,r1,r2 ; adc r1,r4,r0 | 3 | multiply + mutual register dependency | N |
mla r0,r1,r2,r3 ; eor r4,r5,r6 | 1 | multiply-accumulate + logical | Y |
mla r0,r1,r2,r3 ; adc r4,r5,r6 | 1 | multiply-accumulate + arithmetic | Y |
mla r0,r1,r2,r3 ; adc r4,r5,r6,ror#7 | 1 | multiply-accumulate + arithmetic with shift | Y |
mla r0,r1,r2,r0 | 1 | multiply-accumulate + accumulate dependency | - |
mla r0,r1,r0,r2 | 2 | multiply-accumulate + multiplier/multiplicand dependency | - |
mla r0,r1,r2,r3 ; adc r1,r4,r0 | 3 | multiply-accumulate + mutual register dependency | N |
umull r0,r1,r2,r3 | 1 | long multiply | - |
umull r0,r1,r2,r3 ; adc r4,r5,r6 | 1 | long multiply + arithmetic | Y |
umull r0,r1,r0,r2 | 2 | long multiply + register dependency | - |
umlal r0,r1,r2,r3 | 1 | long multiply-accumulate + accumulate dependency | - |
umlal r0,r1,r2,r3 ; umlal r2,r3,r0,r1 | 4 | long multiply-accumulate + multiplier/multiplicand dependency | N |
umlal r0,r1,r2,r3 ; adc r4,r5,r6 | 1 | long multiply-accumulate + accumulate dependency + arithmetic | Y |
udiv r0,r1,r2 | 3+⌈s/2⌉ | s=number of significant bits in result; divide by 0 takes 3 cycles | - |
udiv r0,r1,r2 ; adc r3,r4,r5 | 4+⌈s/2⌉ | | N |
Load |
ldr r1,[r0] | 1 | | - |
ldr r2,[r0] ; ldr r3,[r1] | 2 | | N |
ldr r1,[r0] ; adc r2,#1 | 1 | load + independent arithmetic | Y |
ldr r1,[r0] ; mul r2,r3,r4 | 1 | load + independent multiply | Y |
ldr r1,[r0] ; mla r2,r3,r4,r2 | 1 | load + independent multiply-accumulate | Y |
ldr r1,[r0,#4]! ; mla r2,r3,r4,r2 | 1 | load + independent multiply-accumulate | Y |
ldr r1,[r0] ; mla r2,r1,r3,r2 | 2 | load + dependent multiply-accumulate | N |
ldr r1,[r0] ; add r0,r1 | 3 | data-to-address dependency | N |
ldr r1,[r0,r2] ; add r0,r1 | 3 | data-to-address dependency | N |
ldm r0,{r1-r2} | 1 | two words per cycle | - |
ldm r0,{r1-r3} | 2 | | - |
ldm r0,{r1-r4} | 2 | | - |
ldm r0,{r1-r5} | 3 | | - |
ldm r0,{r1-r6} | 3 | | - |
ldm r0,{r1-r7} | 4 | | - |
ldm r0,{r1-r8} | 4 | | - |
ldm r0,{r1-r2} ; adc r3,#1 | 2 | | N |
ldm r0,{r1-r2} ; mul r3,r4,r5 | 2 | | N |
Single-precision floating point |
vadd.f s0,s1,s2 | 1 | no dependencies | - |
vadd.f s0,s1,s0 | 3 | dependency | - |
vadd.f s0,s1,s2 ; vadd.f s3,s4,s5 | 2 | | N |
vadd.f s0,s1,s2 ; vmul.f s3,s4,s5 | 2 | | N |
vadd.f s0,s1,s2 ; ldr r1,[r0] | 1 | | Y |
vadd.f s0,s1,s2 ; ldm r0,{r1-r2} | 2 | | N |
vadd.f s0,s1,s2 ; vldr.f s3,[r0] | 1 | | Y |
vadd.f s0,s1,s2 ; vldr.f s0,[r0] | 2 | data dependency | N |
vmul.f s0,s1,s2 | 1 | no dependencies | - |
vmul.f s0,s1,s0 | 3 | dependency | - |
vdiv.f s0,s1,s2 | 16 | no dependencies | - |
vdiv.f s0,s1,s0 | 18 | dependency | - |
vdiv.f s0,s0,s1 | 18 | dependency | - |
vsqrt.f s0,s1 | 14 | no dependencies | - |
vsqrt.f s0,s0 | 16 | dependency | - |
vmla.f s0,s1,s2 | 3 | accumulate dependency only | - |
vmla.f s0,s1,s0 | 6 | multiplier/multiplicand dependency | - |
vfma.f s0,s1,s2 | 3 | accumulate dependency only | - |
vfma.f s0,s1,s0 | 5 | multiplier/multiplicand dependency | - |
Double-precision floating point |
vadd.d d0,d1,d2 | 2 | no dependencies | - |
vadd.d d0,d1,d0 | 4 | dependency | - |
vadd.d d0,d1,d2 ; vadd.d d3,d4,d5 | 4 | | N |
vadd.d d0,d1,d2 ; vmul.d d3,d4,d5 | 7 | | N |
vadd.d d0,d1,d2 ; ldr r1,[r0] | 2 | | Y |
vadd.d d0,d1,d2 ; ldm r0,{r1-r2} | 3 | | N |
vadd.d d0,d1,d2 ; vldr.d d3,[r0] | 3 | | N |
vadd.d d0,d1,d2 ; vldr.d d0,[r0] | 3 | data dependency | N |
vmul.d d0,d1,d2 | 5 | no dependencies | - |
vmul.d d0,d1,d0 | 7 | dependency | - |
vdiv.d d0,d1,d2 | 30 | no dependencies | - |
vdiv.d d0,d1,d0 | 32 | dependency | - |
vdiv.d d0,d0,d1 | 32 | dependency | - |
vsqrt.d d0,d1 | 28 | no dependencies | - |
vsqrt.d d0,d0 | 30 | dependency | - |
vmla.d d0,d1,d2 | 11 | accumulate dependency only | - |
vmla.d d0,d1,d0 | 11 | multiplier/multiplicand dependency | - |
vfma.d d0,d1,d2 | 10 | accumulate dependency only | - |
vfma.d d0,d1,d0 | 10 | multiplier/multiplicand dependency | - |
Conclusions
Because the following rules are inferred from a relatively small number of tests, they are
necessarily approximate and incomplete.
- Two integer arithmetic, logical or single-register load operations will be dual issued if:
- the second does not use the result (including flags) of the first;
- they do not both involve a shift (or use a large immediate constant that implies a shift);
- they are not both loads;
- they are not both multiplies; and
- neither is a division.
- An integer load can dual-issue with a floating-point arithmetic operation.
- A floating-point load can dual-issue with a single-precision floating-point arithmetic operation.
- Shifting the result of the previous instruction incurs a one-cycle result delay.
- Integer multiplications and multiply-accumulate operations can be issued on every cycle but have a result delay of two cycles.
- Integer multiply-accumulate operations into the same accumulator can run at one per cycle.
Throughput and latency summary
Instruction |
Throughput (in cycles) |
Latency (in cycles) |
Remarks |
arithmetic and logical | 0.5 or 1 | 1 | |
mul, umull | 1 | 2 | |
mla, umlal | 1 | 1 | from addend |
| | 2 | from multiplier/multiplicand |
udiv | 3+⌈s/2⌉ | 3+⌈s/2⌉ | s=number of significant bits in result |
vadd.f | 1 | 3 | |
vmul.f | 1 | 3 | |
vdiv.f | 16 | 18 | |
vsqrt.f | 14 | 16 | |
vmla.f | 3 | 6 | from multiplier/multiplicand |
vfma.f | 3 | 5 | from multiplier/multiplicand |
vadd.d | 2 | 4 | |
vmul.d | 5 | 7 | |
vdiv.d | 30 | 32 | |
vsqrt.d | 28 | 30 | |
vmla.d | 11 | 11 | from multiplier/multiplicand |
vfma.d | 10 | 10 | from multiplier/multiplicand |
This page most recently updated
Wed 24 Feb 16:34:42 GMT 2021
|
Word Matcher
Qxw is a free (GPL) crossword construction program.
New! Release 20200708 for both Linux and Windows. Non-Roman alphabets, batch mode, multiplex lights,
answer treatments, circular and hex grids, jumbled entries, lots more besides.
More...
If you find this site useful or diverting, please consider a donation to
NASS (a UK registered
charity), to KickAS
(in the US), or to a similar body in your own
country.
Copyright ©2004–2021.
All trademarks used are hereby acknowledged.
|