Tweak Ion's Backtracking Allocator spill costs
Categories
(Core :: JavaScript Engine: JIT, enhancement, P1)
Tracking
()
People
(Reporter: yulia, Assigned: yulia)
References
(Blocks 1 open bug)
Details
Attachments
(7 files)
48 bytes,
text/x-phabricator-request
|
Details | Review | |
48 bytes,
text/x-phabricator-request
|
Details | Review | |
48 bytes,
text/x-phabricator-request
|
Details | Review | |
48 bytes,
text/x-phabricator-request
|
Details | Review | |
9.35 KB,
patch
|
Details | Diff | Splinter Review | |
10.26 KB,
patch
|
Details | Diff | Splinter Review | |
1.23 KB,
patch
|
Details | Diff | Splinter Review |
Some of the work I've been doing has resulted in this small change on the register allocator. This overlapped with Julian's work in bug 1758274, to introduce spill weighting by loop depth, also referred to as the "frequency". The change seems to have no negative consequences on perfherder, but improves the rust-fannkuch benchmark by about 20%, and aligns us more with the literature and also the llvm implementation of the same register allocator (see https://llvm.org/doxygen/LiveIntervals_8cpp_source.html#l00857)
Assignee | ||
Comment 1•1 year ago
|
||
This moves the calculation of the def spill weight from computeSpillWeight,
and caches it as part of the range definition. If a range contains the definition, it
will have set (while setting hasDefinition) a spill cost that takes into account the depth.
This is roughly inspired by the same code for LLVM's backtracking register allocator:
https://llvm.org/doxygen/CalcSpillWeights_8cpp_source.html#l00201
The end result is roughly the same as the original algorithm, however we end up with
a performance improvement on rust-fannkuch of about 8% just from this change in the
embenchen test suite. rust-fannkuch is one large long running loop so that may be why.
The other test that is affected is the copy test, which slows down by about 3%, but
this looks like it may be noise. Otherwise, performance might be lost as a
result of the def weights being out of sync with the use weights. If a
def occurs in a loop of any size, it will be weighed disproportionally to the uses.
Another contributing factor is that defs, if they occur without uses, are stores -- those
are cheaper than loads, so right now the weight is incorrect as we have no depth
calculation for uses. That is corrected in the next patch.
An interesting observation: simply removing any weight for defs has the best performance
but its likely that it is only in the tests we have.
There is no noticable impact on CI tests (found here:
https://treeherder.mozilla.org/perfherder/compare?originalProject=try&originalRevision=d6bf54b3e00e9fc5e96a7e4c3f226031bf3e245a&newProject=try&newRevision=a95de146f7655404143fa77069f99b285a75f3ae )
Assignee | ||
Comment 2•1 year ago
|
||
This is part 2 of using loop depth to calculate spill weight. The work here balances the def loop depth
calculation introduced in the previous patch. Introducing estimated frequency for use related spill weights account for an
additional 12 % improvement on the rust-fannkuch test.
There is no change related to our CI tests:
https://treeherder.mozilla.org/perfherder/compare?originalProject=try&originalRevision=d6bf54b3e00e9fc5e96a7e4c3f226031bf3e245a&newProject=try&newRevision=09a22f83bb140c23cdb0ec7a6a83b8552aa81d54&page=1
Depends on D210269
Assignee | ||
Comment 3•1 year ago
|
||
This patch removes some complexity related to spill weights: namely the calculation of
spill weight according to policy. It looks like this was introduced with the first
version of the register allocator.
As far as I can tell, modifying the spill weight by policy does not seem to result in
better register allocation, and this strategy is not used by LLVM :
https://llvm.org/doxygen/LiveIntervals_8cpp_source.html#l00857
CI tests can be found here:
https://treeherder.mozilla.org/perfherder/compare?originalProject=try&originalRevision=d6bf54b3e00e9fc5e96a7e4c3f226031bf3e245a&newProject=try&newRevision=a6ad16a841c83f342aa06f4560577278af04aa0a&page=1&framework=13
Depends on D210270
Assignee | ||
Comment 4•1 year ago
|
||
This patch takes, whole-sale, Julian's work from
https://bugzilla.mozilla.org/show_bug.cgi?id=1758274
The cache allows us to not recalculate spill weights during register allocation unless we need to
(such as there has been a spill).
In my testing on embenchen, we see a very modest improvement across the tests of ~1%. This patch is
optimal.
Depends on D210271
Updated•1 year ago
|
Updated•1 year ago
|
Assignee | ||
Comment 5•1 year ago
•
|
||
Concretely, embenchen reports the following results (comparison is of the average of 9 runs):
test new old diff
---------------------------------------
box2d 589 587 1.003
bullet 703 703 1.0
conditionals 156 156 1.0
copy 486 486 1.0
corrections 388 388 1.0
fannkuch 2016 2020 0.998
fasta 804 806 0.998
fib 942 941 1.001
ifs 171 171 1.0
binarytrees 2467 2469 0.999
memops 786 786 1.0
primes 219 217 1.009
raybench 803 809 0.993
rust-fannkuch 1557 1897 0.821 < -- main improvement of this patch stack
skinning 414 414 1.0
zlib 822 821 1.001
Assignee | ||
Comment 6•1 year ago
•
|
||
There is also a slight improvement on one of the files in https://bugzilla.mozilla.org/show_bug.cgi?id=1884572. Doesn't solve the problem of that bug, but has an unexpected benefit.
before
Benchmarking, the higher ops/sec the better.
Firefox 127.0 on OS X 10.15.
Compress:
- zhipeng-jia/snappyjs x 6.89 ops/sec ±0.33% (22 runs sampled): 558 MB/s
- pierrec/node-lz4 x 6.23 ops/sec ±2.37% (19 runs sampled): 504 MB/s
- gorhill/lz4-block.js x 14.21 ops/sec ±1.15% (28 runs sampled): 1150 MB/s
- gorhill/lz4-block.wasm x 12.62 ops/sec ±0.60% (36 runs sampled): 1022 MB/s
Done.
Uncompress:
- zhipeng-jia/snappyjs x 9.34 ops/sec ±0.46% (27 runs sampled): 756 MB/s
- pierrec/node-lz4 x 8.58 ops/sec ±1.65% (26 runs sampled): 694 MB/s
- gorhill/lz4-block.js x 15.82 ops/sec ±0.82% (31 runs sampled): 1281 MB/s
- gorhill/lz4-block.wasm x 49.05 ops/sec ±1.79% (53 runs sampled): 3972 MB/s
Done.
after:
Benchmarking, the higher ops/sec the better.
Firefox 127.0 on OS X 10.15.
Compress:
- zhipeng-jia/snappyjs x 6.92 ops/sec ±0.39% (22 runs sampled): 560 MB/s
- pierrec/node-lz4 x 6.19 ops/sec ±0.48% (20 runs sampled): 501 MB/s
- gorhill/lz4-block.js x 14.15 ops/sec ±0.84% (28 runs sampled): 1146 MB/s
- gorhill/lz4-block.wasm x 12.93 ops/sec ±0.54% (36 runs sampled): 1047 MB/s
Done.
Uncompress:
- zhipeng-jia/snappyjs x 9.39 ops/sec ±0.48% (27 runs sampled): 760 MB/s
- pierrec/node-lz4 x 8.81 ops/sec ±1.02% (26 runs sampled): 713 MB/s
- gorhill/lz4-block.js x 16.19 ops/sec ±0.46% (31 runs sampled): 1311 MB/s
- gorhill/lz4-block.wasm x 49.46 ops/sec ±0.32% (53 runs sampled): 4006 MB/s
Done.
Updated•1 year ago
|
Comment 7•1 month ago
|
||
Here are three more patches derived from the original set of 4.
-
bug1896581-1-UnscaledSW.diff-2: changes spill weight from size_t to uint64_t
so as to avoid dynamic range concerns on 32-bit targets, and wraps it up in a
new typeUnscaledSW
so it can't be confused with any other numeric type. -
bug1896581-2-scale-by-depth.diff-2: adds scaling of spill costs by loop
depth, by a factor of 8. Adds a new typeScaledSW
so the compiler can
enforce separation of scaled vs unscaled spill weights. -
bug1896581-3-simplify-calc.diff-2: is (I think) equivalent to the original
"simplify spill weights calculation" patch.
I didn't work with the spill-weight-cache stuff since that doesn't affect the
resulting allocations. I also skipped the spill-weights-as-float thing, as
that doesn't seem like such a good idea in hindsight. (What if a spill weight
goes below zero as a result of rounding effects?)
Anyways, at least for 11 of the wasm tests in JetStream3, I failed to get any
perf improvement out of this. Numbers -- larger are better -- taken from the
"Score:" values, Intel Tiger Lake, 3 GHz, 3rd best of 10 runs. Note,
simplified
is only the simplified-calculation thing; it doesn't include
depth-scaling.
basis scaled simplified
HashSet-wasm 125.728 125.574 125.251
tsf-wasm 159.627 158.006 158.002
quicksort-wasm 133.530 133.262 132.553
gcc-loops-wasm 134.870 134.701 134.884
richards-wasm 192.936 192.605 193.011
sqlite3-wasm 64.049 64.045 63.991
tfjs-wasm 1.166 1.168 1.166
tfjs-wasm-simd 2.802 2.827 2.813
argon2-wasm 97.385 97.022 97.212
8bitbench-wasm 16.906 16.875 17.033
zlib-wasm 76.796 76.749 76.549
Comment 8•1 month ago
|
||
Comment 9•1 month ago
|
||
Comment 10•1 month ago
|
||
Description
•