Previously, our pattern-matching for generating load/store addresses was
somewhat limited. For example, it could not use a register-extend
address mode to handle the following CLIF:
```
v2760 = uextend.i64 v985
v2761 = load.i64 notrap aligned readonly v1
v1018 = iadd v2761, v2760
store v1017, v1018
```
This PR adds more general support for address expressions made up of
additions and extensions. In particular, it pattern-matches a tree of
64-bit `iadd`s, optionally with `uextend`/`sextend` from 32-bit values
at the leaves, to collect the list of all addends that form the address.
It also collects all offsets at leaves, combining them.
It applies a series of heuristics to make the best use of the
available addressing modes, filling the load/store itself with as many
64-bit registers, zero/sign-extended 32-bit registers, and/or an offset,
then computing the rest with add instructions as necessary. It attempts
to make use of immediate forms (add-immediate or subtract-immediate)
whenever possible, and also uses the built-in extend operators on add
instructions when possible. There are certainly cases where this is not
optimal (i.e., does not generate the strictly shortest sequence of
instructions), but it should be good enough for most code.
Using `perf stat` to measure instruction count (runtime only, on
wasmtime, after populating the cache to avoid measuring compilation),
this impacts `bz2` as follows:
```
pre:
1006.410425 task-clock (msec) # 1.000 CPUs utilized
113 context-switches # 0.112 K/sec
1 cpu-migrations # 0.001 K/sec
5,036 page-faults # 0.005 M/sec
3,221,547,476 cycles # 3.201 GHz
4,000,670,104 instructions # 1.24 insn per cycle
<not supported> branches
27,958,613 branch-misses
1.006071348 seconds time elapsed
post:
963.499525 task-clock (msec) # 0.997 CPUs utilized
117 context-switches # 0.121 K/sec
0 cpu-migrations # 0.000 K/sec
5,081 page-faults # 0.005 M/sec
3,039,687,673 cycles # 3.155 GHz
3,837,761,690 instructions # 1.26 insn per cycle
<not supported> branches
28,254,585 branch-misses
0.966072682 seconds time elapsed
```
In other words, this reduces instruction count by 4.1% on `bz2`.
This crate contains the core Cranelift code generator. It translates code from an intermediate representation into executable machine code.