Currently the "sequential" and "parallel" benchmarks reports somewhat
different timings. For sequential it's time-to-instantiate, but for
parallel it's time-to-instantiate-10k instances. The parallelism in the
parallel benchmark can also theoretically be affected by rayon's
work-stealing. For example if rayon doesn't actually do any work
stealing at all then this ends up being a sequential test again.
Otherwise though it's possible for some threads to finish much earlier
as rayon isn't guaranteed to keep threads busy.
This commit applies a few updates to the benchmark:
* First an `InstancePre<T>` is now used instead of a `Linker<T>` to
front-load type-checking and avoid that on each instantiation (and
this is generally the fastest path to instantiate right now).
* Next the instantiation benchmark is changed to measure one
instantiation-per-iteration to measure per-instance instantiation to
better compare with sequential numbers.
* Finally rayon is removed in favor of manually creating background
threads that infinitely do work until we tell them to stop. These
background threads are guaranteed to be working for the entire time
the benchmark is executing and should theoretically exhibit what the
situation that there's N units of work all happening at once.
I also applied some minor updates here such as having the parallel
instantiation defined conditionally for multiple modules as well as
upping the limits of the pooling allocator to handle a large module
(rustpython.wasm) that I threw at it.