wasmtime

Author	SHA1	Message	Date
Trevor Elliott	9dc4f1a83c	s390x: Move the value out of the casloop_val_reg with mov_preg (#5430 ) The casloop_emit function in the s390x backend was using the fixed non-allocatable register %r0 directly with move instructions, which produced a panic in the regalloc2 checker (#5425). This PR changes the casloop_result function to use mov_preg instead of copy_reg to fetch the result, as it's not viewed by regalloc2 as a move. Fixes #5425	2022-12-14 13:06:35 -08:00
Chris Fallin	8383e4b6bd	egraph opt rules: do `(icmp cc x x) == {0,1}` only for integer types. (#5438 ) We could do these for vectors too, in theory, but for now let's fix the bug by applying the equivalence only for integer types. Fixes #5437.	2022-12-14 19:50:42 +00:00
Ulrich Weigand	df923f18ca	Remove MachInst::gen_constant (#5427 ) * aarch64: constant generation cleanup Add support for MOVZ and MOVN generation via ISLE. Handle f32const, f64const, and nop instructions via ISLE. No longer call Inst::gen_constant from lower.rs. * riscv64: constant generation cleanup Handle f32const, f64const, and nop instructions via ISLE. * s390x: constant generation cleanup Fix rule priorities for "imm" term. Only handle 32-bit stack offsets; no longer use load_constant64. * x64: constant generation cleanup No longer call Inst::gen_constant from lower.rs or abi.rs. * Refactor LowerBackend::lower to return InstOutput No longer write to the per-insn output registers; instead, return an InstOutput vector of temp registers holding the outputs. This will allow calling LowerBackend::lower multiple times for the same instruction, e.g. to rematerialize constants. When emitting the primary copy of the instruction during lowering, writing to the per-insn registers is now done in lower_clif_block. As a result, the ISLE lower_common routine is no longer needed. In addition, the InsnOutput type and all code related to it can be removed as well. * Refactor IsleContext to hold a LowerBackend reference Remove the "triple", "flags", and "isa_flags" fields that are copied from LowerBackend to each IsleContext, and instead just hold a reference to LowerBackend in IsleContext. This will allow calling LowerBackend::lower from within callbacks in src/machinst/isle.rs, e.g. to rematerialize constants. To avoid having to pass LowerBackend references through multiple functions, eliminate the lower_insn_to_regs subroutines in those targets that still have them, and just inline into the main lower routine. This also eliminates lower_inst.rs on aarch64 and riscv64. Replace all accesses to the removed IsleContext fields by going through the LowerBackend reference. * Remove MachInst::gen_constant This addresses the problem described in issue https://github.com/bytecodealliance/wasmtime/issues/4426 that targets currently have to duplicate code to emit constants between the ISLE logic and the gen_constant callback. After the various cleanups in earlier patches in this series, the only remaining user of get_constant is put_value_in_regs in Lower. This can now be removed, and instead constant rematerialization can be performed in the put_in_regs ISLE callback by simply directly calling LowerBackend::lower on the instruction defining the constant (using a different output register). Since the check for egraph mode is now no longer performed in put_value_in_regs, the Lower::flags member becomes obsolete. Care needs to be taken that other calls directly to the Lower::put_value_in_regs routine now handle the fact that no more rematerialization is performed. All such calls in target code already historically handle constants themselves. The remaining call site in the ISLE gen_call_common helper can be redirected to the ISLE put_in_regs callback. The existing target implementations of gen_constant are then unused and can be removed. (In some target there may still be further opportunities to remove duplication between ISLE and some local Rust code - this can be left to future patches.)	2022-12-13 13:00:04 -08:00
Chris Fallin	92ce79366c	riscv64: remove `valueregs_2_reg` extractor. (#5426 ) This extractor had a side-effect of invoking `put_in_regs`, which is not supposed to be invoked until the pattern-matching commits to evaluating a rule right-hand side (i.e., cannot backtrack). In this case the side-effect was mostly benign (in theory it could have caused additional values to be computed needlessly), but in general we should be careful to keep side-effects out of the left-hand side to enable further optimizations and work on islec. The implicit conversion from `Value` to `Reg` turns out to be enough to make the rules in question work, so we can simply remove the use of the extractor in this case.	2022-12-13 11:47:20 -08:00
Trevor Elliott	a5ecb5e647	x64: Share a zero in the ushr translation on x64 to free up a register (#5424 ) Share a zero value in the translation of ushr for i128. This increases the lifetime of the value by a few instructions, and reduces the number of registers used in the translation by one, which seems like an acceptable trade-off.	2022-12-12 18:15:43 -08:00
Chris Fallin	9397ea1abe	Cranelift: implement general select_spectre_guard fallbacks. (#5420 ) When adding some optimization rules for `icmp` in the egraph infrastructure, we ended up creating a path to legal CLIF but with patterns unsupported by three of our four backends: specifically, `select_spectre_guard` with a general truthy input, rather than an `icmp`. In #5206 we discussed replacing `select_spectre_guard` with something more specific, and that could still be a long-term solution here, but doing so now would interfere with ongoing refactoring of heap access lowering, so I've opted not to do so. (In that issue I was concerned about complexity and didn't see the need but with this fuzzbug I'm starting to feel a bit differently; maybe we should remove this non-orthogonal op in the long run.) Fixes #5417.	2022-12-12 17:13:34 -08:00
Nick Fitzgerald	f2e1eaa847	cranelift-filetest: Add support for Wasm-to-CLIF translation filetests (#5412 ) This adds support for `.wat` tests in `cranelift-filetest`. The test runner translates the WAT to Wasm and then uses `cranelift-wasm` to translate the Wasm to CLIF. These tests are always precise output tests. The test expectations can be updated by running tests with the `CRANELIFT_TEST_BLESS=1` environment variable set, similar to our compile precise output tests. The test's expected output is contained in the last comment in the test file. The tests allow for configuring the kinds of heaps used to implement Wasm linear memory via TOML in a `;;!` comment at the start of the test. To get ISA and Cranelift flags parsing available in the filetests crate, I had to move the `parse_sets_and_triple` helper from the `cranelift-tools` binary crate to the `cranelift-reader` crate, where I think it logically fits. Additionally, I had to make some more bits of `cranelift-wasm`'s dummy environment `pub` so that I could properly wrap and compose it with the environment used for the `.wat` tests. I don't think this is a big deal, but if we eventually want to clean this stuff up, we can probably remove the dummy environments completely, remove `translate_module`, and fold them into these new test environments and test runner (since Wasmtime isn't using those things anyways).	2022-12-12 19:31:29 +00:00
Chris Fallin	244dce93f6	Fix optimization rules for narrow types: wrap i8 results to 8 bits. (#5409 ) * Fix optimization rules for narrow types: wrap i8 results to 8 bits. This fixes #5405. In the egraph mid-end's optimization rules, we were rewriting e.g. imuls of two iconsts to an iconst of the result, but without masking off the high bits (beyond the result type's width). This was producing iconsts with set high bits beyond their types' width, which is not legal. In addition, this PR adds some optimizations to the algebraic rules to recognize e.g. `x == x` (and all other integer comparison operators) and resolve to 1 or 0 as appropriate. * Review feedback. * Review feedback, again.	2022-12-09 22:29:25 +00:00
Ulrich Weigand	e913cf3647	Remove IFLAGS/FFLAGS types (#5406 ) All instructions using the CPU flags types (IFLAGS/FFLAGS) were already removed. This patch completes the cleanup by removing all remaining instructions that define values of CPU flags types, as well as the types themselves. Specifically, the following features are removed: - The IFLAGS and FFLAGS types and the SpecialType category. - Special handling of IFLAGS and FFLAGS in machinst/isle.rs and machinst/lower.rs. - The ifcmp, ifcmp_imm, ffcmp, iadd_ifcin, iadd_ifcout, iadd_ifcarry, isub_ifbin, isub_ifbout, and isub_ifborrow instructions. - The writes_cpu_flags instruction property. - The flags verifier pass. - Flags handling in the interpreter. All of these features are currently unused; no functional change intended by this patch. This addresses https://github.com/bytecodealliance/wasmtime/issues/3249.	2022-12-09 13:42:03 -08:00
Jamey Sharp	8bbd9bb228	aarch64: Test instruction selection for `bmask` (#5396 ) I copied the `bmask` precise-output tests from x64 and used CRANELIFT_TEST_BLESS=1 to generate this test. I don't know aarch64 well enough to decide if this output is correct. However, for I128 it is identical to the previous I128-only precise-output tests, and the existing runtests for bmask pass on aarch64, so I think it's likely correct.	2022-12-08 10:22:23 -08:00
Trevor Elliott	c5379051c4	Enable the ssa verifier in debug builds (#5354 ) Enable regalloc2's SSA verifier in debug builds to check for any outstanding reuse of virtual registers in def constraints. As fuzzing enables debug_assertions, this will enable the SSA verifier when fuzzing as well.	2022-12-07 12:22:51 -08:00
Nick Fitzgerald	f0c4b6f3a1	Cranelift: Implement `iadd_cout` on x64 for 32- and 64-bit integers (#5285 ) * Split the `iadd_cout` runtests by type * Implement `iadd_cout` for 32- and 64-bit values on x64 * Delete trailing whitespace in `riscv/lower.isle`	2022-12-07 19:54:14 +00:00
Chris Fallin	f980defe17	egraph support: rewrite to work in terms of CLIF data structures. (#5382 ) * egraph support: rewrite to work in terms of CLIF data structures. This work rewrites the "egraph"-based optimization framework in Cranelift to operate on aegraphs (acyclic egraphs) represented in the CLIF itself rather than as a separate data structure to which and from which we translate the CLIF. The basic idea is to add a new kind of value, a "union", that is like an alias but refers to two other values rather than one. This allows us to represent an eclass of enodes (values) as a tree. The union node allows for a value to have multiple representations: either constituent value could be used, and (in well-formed CLIF produced by correct optimization rules) they must be equivalent. Like the old egraph infrastructure, we take advantage of acyclicity and eager rule application to do optimization in a single pass. Like before, we integrate GVN (during the optimization pass) and LICM (during elaboration). Unlike the old egraph infrastructure, everything stays in the DataFlowGraph. "Pure" enodes are represented as instructions that have values attached, but that are not placed into the function layout. When entering "egraph" form, we remove them from the layout while optimizing. When leaving "egraph" form, during elaboration, we can place an instruction back into the layout the first time we elaborate the enode; if we elaborate it more than once, we clone the instruction. The implementation performs two passes overall: - One, a forward pass in RPO (to see defs before uses), that (i) removes "pure" instructions from the layout and (ii) optimizes as it goes. As before, we eagerly optimize, so we form the entire union of optimized forms of a value before we see any uses of that value. This lets us rewrite uses to use the most "up-to-date" form of the value and canonicalize and optimize that form. The eager rewriting and acyclic representation make each other work (we could not eagerly rewrite if there were cycles; and acyclicity does not miss optimization opportunities only because the first time we introduce a value, we immediately produce its "best" form). This design choice is also what allows us to avoid the "parent pointers" and fixpoint loop of traditional egraphs. This forward optimization pass keeps a scoped hashmap to "intern" nodes (thus performing GVN), and also interleaves on a per-instruction level with alias analysis. The interleaving with alias analysis allows alias analysis to see the most optimized form of each address (so it can see equivalences), and allows the next value to see any equivalences (reuses of loads or stored values) that alias analysis uncovers. - Two, a forward pass in domtree preorder, that "elaborates" pure enodes back into the layout, possibly in multiple places if needed. This tracks the loop nest and hoists nodes as needed, performing LICM as it goes. Note that by doing this in forward order, we avoid the "fixpoint" that traditional LICM needs: we hoist a def before its uses, so when we place a node, we place it in the right place the first time rather than moving later. This PR replaces the old (a)egraph implementation. It removes both the cranelift-egraph crate and the logic in cranelift-codegen that uses it. On `spidermonkey.wasm` running a simple recursive Fibonacci microbenchmark, this work shows 5.5% compile-time reduction and 7.7% runtime improvement (speedup). Most of this implementation was done in (very productive) pair programming sessions with Jamey Sharp, thus: Co-authored-by: Jamey Sharp <jsharp@fastly.com> * Review feedback. * Review feedback. * Review feedback. * Bugfix: cprop rule: `(x + k1) - k2` becomes `x - (k2 - k1)`, not `x - (k1 - k2)`. Co-authored-by: Jamey Sharp <jsharp@fastly.com>	2022-12-06 14:58:57 -08:00
Trevor Elliott	293bb5b334	riscv64: Only emit jumps at the end of basic blocks (#5381 ) This PR fixes two bugs in the riscv64 backend, where branch instructions were emitted in the middle of a basic block: Constant emission, where the constants are inlined into the vcode and are jumped over at runtime, The BrTableCheck pseudo-instruction, which was always emitted before a BrTable instruction, and would handle jumping to the default label. The first bug was resolved by introducing two new psuedo instructions, LoadConst32 and LoadConst64. Both of these instructions serve to delay the original encoding to emission time, after regalloc2 has run. The second bug was fixed by removing the BrTableCheck instruction. As it was always emitted directly before BrTable, it was easier to remove it and merge the two into a single instruction.	2022-12-06 10:54:10 -08:00
Trevor Elliott	353a681671	Avoid reusing a register during constant loading (#5379 ) Avoid reusing a register when loading a constant, allocating a temporary instead.	2022-12-05 18:37:53 -08:00
Trevor Elliott	7d28d586da	riscv64: Don't reuse registers when loading constants (#5376 ) Rework the constant loading functions in the riscv64 backend to generate fresh temporaries instead of reusing the destination register.	2022-12-05 16:51:52 -08:00
Trevor Elliott	817c2b205c	riscv64: Use a temporary when translating shift amount (#5375 ) Use a temporary when translating the shift amount, instead of reusing the destination register.	2022-12-05 20:54:14 +00:00
Trevor Elliott	b475b9bd19	Terminate blocks with a single branch in riscv64 (#5374 ) Ensure that we're terminating blocks with a single branch instruction, when testing I128 values against zero.	2022-12-05 20:13:28 +00:00
Trevor Elliott	6aea8e0d7e	Don't reuse destination registers when lowering splat on aarch64 (#5370 )	2022-12-05 08:18:49 -08:00
Trevor Elliott	2e9b0802ab	aarch64: Rework amode compilation to produce SSA code (#5369 ) Rework the compilation of amodes in the aarch64 backend to stop reusing registers and instead generate fresh virtual registers for intermediates. This resolves some SSA checker violations with the aarch64 backend, and as a nice side-effect removes some unnecessary movs in the generated code.	2022-12-02 01:23:15 +00:00
Trevor Elliott	d54a27d0ea	Allocate temporary intermediates when loading constants on aarch64 (#5366 ) As loading constants on aarch64 can take up to 4 instructions, we need to plumb through some additional registers. Rather than pass a fixed list of registers in, pass an allocation function.	2022-12-01 22:29:36 +00:00
Nick Fitzgerald	4510a4a805	Cranelift: mark post-legalization trapping blocks as cold (#5367 ) Trapping is a rare event.	2022-12-01 12:46:26 -08:00
Nick Fitzgerald	79f7fa6079	Cranelift: implement `heap_{load,store}` instruction legalization (#5351 ) * Cranelift: implement `heap_{load,store}` instruction legalization This does not remove `heap_addr` yet, but it does factor out the common bounds-check-and-compute-the-native-address functionality that is shared between all of `heap_{addr,load,store}`. Finally, this adds a missing optimization for when we can dedupe explicit bounds checks for static memories and Spectre mitigations. * Cranelift: Enable `heap_load_store_*` run tests on all targets	2022-11-30 19:12:49 +00:00
Alex Crichton	830885383f	Implement inline stack probes for AArch64 (#5353 ) * Turn off probestack by default in Cranelift The probestack feature is not implemented for the aarch64 and s390x backends and currently the on-by-default status requires the aarch64 and s390x implementations to be a stub. Turning off probestack by default allows the s390x and aarch64 backends to panic with an error message to avoid providing a false sense of security. When the probestack option is implemented for all backends, however, it may be reasonable to re-enable. * aarch64: Improve codegen for AMode fallback Currently the final fallback for finalizing an `AMode` will generate both a constant-loading instruction as well as an `add` instruction to the base register into the same temporary. This commit improves the codegen by removing the `add` instruction and folding the final add into the finalized `AMode`. This changes the `extendop` used but both registers are 64-bit so shouldn't be affected by the extending operation. * aarch64: Implement inline stack probes This commit implements inline stack probes for the aarch64 backend in Cranelift. The support here is modeled after the x64 support where unrolled probes are used up to a particular threshold after which a loop is generated. The instructions here are similar in spirit to x64 except that unlike x64 the stack pointer isn't modified during the unrolled loop to avoid needing to re-adjust it back up at the end of the loop. * Enable inline probestack for AArch64 and Riscv64 This commit enables inline probestacks for the AArch64 and Riscv64 architectures in the same manner that x86_64 has it enabled now. Some more testing was additionally added since on Unix platforms we should be guaranteed that Rust's stack overflow message is now printed too. * Enable probestack for aarch64 in cranelift-fuzzgen * Address review comments * Remove implicit stack overflow traps from x64 backend This commit removes implicit `StackOverflow` traps inserted by the x64 backend for stack-based operations. This was historically required when stack overflow was detected with page faults but Wasmtime no longer requires that since it's not suitable for wasm modules which call host functions. Additionally no other backend implements this form of implicit trap-code additions so this is intended to synchronize the behavior of all the backends. This fixes a test added prior for aarch64 to properly abort the process instead of accidentally being caught by Wasmtime. * Fix a style issue	2022-11-30 12:30:00 -06:00
Alex Crichton	86acb9a438	Use workspace inheritance for some more dependencies (#5349 ) Deduplicate some dependency directives through `[workspace.dependencies]`	2022-11-29 22:32:56 +00:00
Nick Fitzgerald	2ad3f78624	Cranelift: fix `heap_{load,store}` test generator script (#5348 ) Was missing some '$' characters and so was comparing string literals against string literals instead of variable values against string literals. Regenerated tests to fix them and add missing tests.	2022-11-29 20:53:14 +00:00
Nick Fitzgerald	913a2ec8c8	Cranelift: consider heap's guard pages when legalizing `heap_addr` (#5335 ) * Cranelift: consider heap's guard pages when legalizing `heap_addr` Fixes #5328 * Update comment to align more directly with implementation * Add legalization tests for `heap_addr` and offset guard pages	2022-11-29 19:54:25 +00:00
Afonso Bordado	ec342c20e3	cranelift: Add `iadd_cout` lowerings for aarch64 (#5177 ) * cranelift: Add `iadd_cout`/`isub_bout` i128 tests * aarch64: Add `iadd_cout` lowerings * fuzzgen: Add `iadd_cout`	2022-11-29 10:58:44 -08:00
Trevor Elliott	368004428a	Fix rule shadowing instances in x64 and aarch64 backends (#5334 ) Fix shadowing identified in #5322 for imul and swiden_high/swiden_low/uwiden_high/uwiden_low combinations in the x64 backend, and remove some redundant rules from the aarch64 dynamic neon ruleset. Additionally, add tests to the x64 backend showing that the imul specializations are firing.	2022-11-28 15:48:34 -08:00
Nick Fitzgerald	d0d3245a35	Cranelift: Add `heap_load` and `heap_store` instructions (#5300 ) * Cranelift: Define `heap_load` and `heap_store` instructions * Cranelift: Implement interpreter support for `heap_load` and `heap_store` * Cranelift: Add a suite runtests for `heap_{load,store}` There are so many knobs we can twist for heaps and I wanted to exhaustively test all of them, so I wrote a script to generate the tests. I've checked in the script in case we want to make any changes in the future, but I don't think it is worth adding this to CI to check that scripts are up to date or anything like that. * Review feedback	2022-11-21 23:00:39 +00:00
Trevor Elliott	54cfa4df34	cranelift: Fix implicit pointer argument register use (#5301 ) * Fix arg handling to write to VRegs instead of physical regs * Make is_included_in_clobbers required, and handle Args on x64 and riscv64	2022-11-18 16:47:03 -08:00
Jun Ryung Ju	e5f93d9ec0	cranelift: Support `bnot`, `band`, `bor`, `bxor` for x86_64. (#5036 ) * Support `bnot`, `band`, `bor`, `bxor` for x86_64. * Fix-up to handle `B{8,16,32,64}` type on bitops * Fix-up conflict.	2022-11-18 07:45:54 -08:00
Trevor Elliott	07bd8bf34a	Remove unnecessary moves in x64 gen_memcpy (#5277 ) Remove some unnecessary moves in the x64 gen_memcpy implementation -- the call instruction that's generated will already constrain the args to those registers.	2022-11-16 10:33:00 -08:00
Afonso Bordado	a793648eb2	cranelift: Fix `fdemote` on the interpreter (#5158 ) * cranelift: Cleanup `fdemote`/`fpromote` tests * cranelift: Fix `fdemote`/`fpromote` instruction docs The verifier fails if the input and output types are the same for these instructions * cranelift: Fix `fdemote`/`fpromote` in the interpreter * fuzzgen: Add `fdemote`/`fpromote`	2022-11-15 22:22:00 +00:00
Nick Fitzgerald	9967782726	Cranelift(Aarch64): Optimize lowering of `icmp`s with immediates (#5252 ) We can encode more constants into 12-bit immediates if we do the following rewrite for comparisons with odd constants: A >= B + 1 ==> A - 1 >= B ==> A > B	2022-11-15 09:18:55 -08:00
Nick Fitzgerald	c2a7ea7e24	Cranelift: de-duplicate bounds checks in legalizations (#5190 ) * Cranelift: Add the `DataFlowGraph::display_value_inst` convenience method * Cranelift: Add some `trace!` logs to some parts of legalization * Cranelift: de-duplicate bounds checks in legalizations When both (1) "dynamic" memories that need explicit bounds checks and (2) spectre mitigations that perform bounds checks are enabled, reuse the same bounds checks between the two legalizations. This reduces the overhead of explicit bounds checks and spectre mitigations over using virtual memory guard pages with spectre mitigations from ~1.9-2.1x overhead to ~1.6-1.8x overhead. That is about a 14-19% speed up for when dynamic memories and spectre mitigations are enabled. <details> ``` execution :: instructions-retired :: benchmarks/spidermonkey/benchmark.wasm Δ = 3422455129.47 ± 120159.49 (confidence = 99%) virtual-memory-guards.so is 2.09x to 2.09x faster than bounds-checks.so! [6563931659 6564063496.07 6564301535] bounds-checks.so [3141492675 3141608366.60 3141895249] virtual-memory-guards.so execution :: instructions-retired :: benchmarks/bz2/benchmark.wasm Δ = 338716136.87 ± 1.38 (confidence = 99%) virtual-memory-guards.so is 2.08x to 2.08x faster than bounds-checks.so! [651961494 651961495.47 651961497] bounds-checks.so [313245357 313245358.60 313245362] virtual-memory-guards.so execution :: instructions-retired :: benchmarks/pulldown-cmark/benchmark.wasm Δ = 22742944.07 ± 331.73 (confidence = 99%) virtual-memory-guards.so is 1.87x to 1.87x faster than bounds-checks.so! [48841295 48841567.33 48842139] bounds-checks.so [26098439 26098623.27 26099479] virtual-memory-guards.so ``` </details> <details> ``` execution :: instructions-retired :: benchmarks/spidermonkey/benchmark.wasm Δ = 2465900207.27 ± 146476.61 (confidence = 99%) virtual-memory-guards.so is 1.78x to 1.78x faster than de-duped-bounds-checks.so! [5607275431 5607442989.13 5607838342] de-duped-bounds-checks.so [3141445345 3141542781.87 3141711213] virtual-memory-guards.so execution :: instructions-retired :: benchmarks/bz2/benchmark.wasm Δ = 234253620.20 ± 2.33 (confidence = 99%) virtual-memory-guards.so is 1.75x to 1.75x faster than de-duped-bounds-checks.so! [547498977 547498980.93 547498985] de-duped-bounds-checks.so [313245357 313245360.73 313245363] virtual-memory-guards.so execution :: instructions-retired :: benchmarks/pulldown-cmark/benchmark.wasm Δ = 16605659.13 ± 315.78 (confidence = 99%) virtual-memory-guards.so is 1.64x to 1.64x faster than de-duped-bounds-checks.so! [42703971 42704284.40 42704787] de-duped-bounds-checks.so [26098432 26098625.27 26099234] virtual-memory-guards.so ``` </details> <details> ``` execution :: instructions-retired :: benchmarks/bz2/benchmark.wasm Δ = 104462517.13 ± 7.32 (confidence = 99%) de-duped-bounds-checks.so is 1.19x to 1.19x faster than bounds-checks.so! [651961493 651961500.80 651961532] bounds-checks.so [547498981 547498983.67 547498989] de-duped-bounds-checks.so execution :: instructions-retired :: benchmarks/spidermonkey/benchmark.wasm Δ = 956556982.80 ± 103034.59 (confidence = 99%) de-duped-bounds-checks.so is 1.17x to 1.17x faster than bounds-checks.so! [6563930590 6564019842.40 6564243651] bounds-checks.so [5607307146 5607462859.60 5607677763] de-duped-bounds-checks.so execution :: instructions-retired :: benchmarks/pulldown-cmark/benchmark.wasm Δ = 6137307.87 ± 247.75 (confidence = 99%) de-duped-bounds-checks.so is 1.14x to 1.14x faster than bounds-checks.so! [48841303 48841472.93 48842000] bounds-checks.so [42703965 42704165.07 42704718] de-duped-bounds-checks.so ``` </details> * Update test expectations * Add a test for deduplicating bounds checks between dynamic memories and spectre mitigations * Define a struct for the Spectre comparison instead of using a tuple * More trace logging for heap legalization	2022-11-15 08:47:22 -08:00
Trevor Elliott	dece901d16	Use regalloc constraints for sse blend operations (#5251 ) Instead of using xmm0 explicitly for the mask argument to instructions like blendvpd, use regalloc constraints to constrain it to xmm0 instead.	2022-11-14 16:44:34 -08:00
Afonso Bordado	ff46bbaebf	cranelift: Fix `iadd_carry`/`iadd_cout` in the interpreter (#5176 )	2022-11-14 10:18:28 -08:00
Trevor Elliott	0367fbc2d4	cranelift: Rework pinned register lowering (#5249 ) Rework pinned register lowering to avoid the use of pinned virtual registers, instead using the MovFromPReg and MovToPReg pseudo instructions.	2022-11-10 16:19:25 -08:00
Nick Fitzgerald	fc62d4ad65	Cranelift: Make `heap_addr` return calculated `base + index + offset` (#5231 ) * Cranelift: Make `heap_addr` return calculated `base + index + offset` Rather than return just the `base + index`. (Note: I've chosen to use the nomenclature "index" for the dynamic operand and "offset" for the static immediate.) This move the addition of the `offset` into `heap_addr`, instead of leaving it for the subsequent memory operation, so that we can Spectre-guard the full address, and not allow speculative execution to read the first 4GiB of memory. Before this commit, we were effectively doing load(spectre_guard(base + index) + offset) Now we are effectively doing load(spectre_guard(base + index + offset)) Finally, this also corrects `heap_addr`'s documented semantics to say that it returns an address that will trap on access if `index + offset + access_size` is out of bounds for the given heap, rather than saying that the `heap_addr` itself will trap. This matches the implemented behavior for static memories, and after https://github.com/bytecodealliance/wasmtime/pull/5190 lands (which is blocked on this commit) will also match the implemented behavior for dynamic memories. * Update heap_addr docs * Factor out `offset + size` to a helper	2022-11-09 19:53:51 +00:00
Trevor Elliott	b077854b57	Generate SSA code from returns (#5172 ) Modify return pseudo-instructions to have pairs of registers: virtual and real. This allows us to constrain the virtual registers to the real ones specified by the abi, instead of directly emitting moves to those real registers.	2022-11-08 16:00:49 -08:00
Ulrich Weigand	3e5938e65a	Support big- and little-endian lane order with bitcast (#5196 ) Add a MemFlags operand to the bitcast instruction, where only the `big` and `little` flags are accepted. These define the lane order to be used when casting between types of different lane counts. Update all users to pass an appropriate MemFlags argument. Implement lane swaps where necessary in the s390x back-end. This is the final part necessary to fix https://github.com/bytecodealliance/wasmtime/issues/4566.	2022-11-07 14:41:10 -08:00
Ulrich Weigand	fba2287c54	Fix mprotect failures by enabling cranelift-jit selinux-fix (#5204 ) The sample program in cranelift/filetests/src/function_runner.rs would abort with an mprotect failure under certain circumstances, see https://github.com/bytecodealliance/wasmtime/pull/4453#issuecomment-1303803222 Root cause was that enabling PROT_EXEC on the main process heap may be prohibited, depending on Linux distro and version. This only shows up in the doc test sample program because the main clif-util is multi-threaded and therefore allocations will happen on glibc's per-thread heap, which is allocated via mmap, and not the main process heap. Work around the problem by enabling the "selinux-fix" feature of the cranelift-jit crate dependency in the filetests. Note that this didn't compile out of the box, so a separate fix is also required and provided as part of this PR. Going forward, it would be preferable to always use mmap to allocate the backing memory for JITted code.	2022-11-04 14:01:37 -07:00
11evan	387426e7f4	cranelift: improve syscall error/oom handling in JIT module (#5173 ) * cranelift: improve syscall error/oom handling in JIT module The JIT module has several places where it `expect`s or `panic`s on syscall or allocator errors. For example, `mmap` and `mprotect` can fail if Linux `vm.max_map_count` is not high enough, and some users may wish to handle this error rather than immediately crashing. This commit plumbs these errors upward as new `ModuleError` types, so that callers of jit module functions like `finalize_definitions` and `define_function` can handle them (or just `unwrap()`, as desired). * cranelift: Remove ModuleError::Syscall variant Syscall errors can just be folded into the generic Backend error, which is an anyhow::Error * cranelift-jit: return io::ErrorKind::OutOfMemory for alloc failure Just using `io::Error::last_os_error()` is not correct as global allocator impls are not required to set errno	2022-11-03 16:59:41 -07:00
Ulrich Weigand	137a8b710f	Move bitselect->vselect optimization to x64 back-end (#5191 ) The simplifier was performing an optimization to replace bitselect with vselect if the all bytes of the condition mask could be shown to be all ones or all zeros. This optimization only ever made any difference in codegen on the x64 target. Therefore, move this optimization to the x64 back-end and perform it in ISLE instead. Resulting codegen should be unchanged, with slightly improved compile time. This also eliminates a few endian-dependent bitcast operations.	2022-11-03 20:17:36 +00:00
Afonso Bordado	3ef30b5b67	cranelift: Rename `i{min,max}` to `s{min,max}` (#5187 ) This brings these instructions with our general naming convention of signed instructions being prefixed with `s`.	2022-11-03 18:20:33 +00:00
Afonso Bordado	2c69b94744	cranelift: Add support for `bswap.i128` (#5186 ) * fuzzgen: Request only one variable for bswap This was included by accident. Bswap only has one input, instead of two. * cranelift: Add `bswap.i128` support Adds support only for x86, AArch64, S390X. RISCV does not yet have bswap.	2022-11-03 18:03:37 +00:00
Trevor Elliott	aeceea28e2	Remove trapif and trapff (#5162 ) This branch removes the trapif and trapff instructions, in favor of using an explicit comparison and trapnz. This moves us closer to removing iflags and fflags, but introduces the need to implement instructions like iadd_cout in the x64 and aarch64 backends.	2022-11-03 09:25:11 -07:00
Ulrich Weigand	961107ec63	Merge raw_bitcast and bitcast (#5175 ) - Allow bitcast for vectors with differing lane widths - Remove raw_bitcast IR instruction - Change all users of raw_bitcast to bitcast - Implement support for no-op bitcast cases across backends This implements the second step of the plan outlined here: https://github.com/bytecodealliance/wasmtime/issues/4566#issuecomment-1234819394	2022-11-02 10:16:27 -07:00
Trevor Elliott	09d8df6fab	Switch to `x64_rbp` to avoid the use of a pinned register (#5168 ) Avoid a use of preg_rpb in the x64 backend, using x64_rbp instead.	2022-11-01 13:23:33 -07:00

1 2 3 4 5 ...

1387 Commits