Open Bug 1896581 Opened 1 year ago Updated 1 month ago

Tweak Ion's Backtracking Allocator spill costs

Categories

(Core :: JavaScript Engine: JIT, enhancement, P1)

enhancement

Tracking

()

People

(Reporter: yulia, Assigned: yulia)

References

(Blocks 1 open bug)

Details

Attachments

(7 files)

Some of the work I've been doing has resulted in this small change on the register allocator. This overlapped with Julian's work in bug 1758274, to introduce spill weighting by loop depth, also referred to as the "frequency". The change seems to have no negative consequences on perfherder, but improves the rust-fannkuch benchmark by about 20%, and aligns us more with the literature and also the llvm implementation of the same register allocator (see https://llvm.org/doxygen/LiveIntervals_8cpp_source.html#l00857)

This moves the calculation of the def spill weight from computeSpillWeight,
and caches it as part of the range definition. If a range contains the definition, it
will have set (while setting hasDefinition) a spill cost that takes into account the depth.

This is roughly inspired by the same code for LLVM's backtracking register allocator:
https://llvm.org/doxygen/CalcSpillWeights_8cpp_source.html#l00201

The end result is roughly the same as the original algorithm, however we end up with
a performance improvement on rust-fannkuch of about 8% just from this change in the
embenchen test suite. rust-fannkuch is one large long running loop so that may be why.

The other test that is affected is the copy test, which slows down by about 3%, but
this looks like it may be noise. Otherwise, performance might be lost as a
result of the def weights being out of sync with the use weights. If a
def occurs in a loop of any size, it will be weighed disproportionally to the uses.
Another contributing factor is that defs, if they occur without uses, are stores -- those
are cheaper than loads, so right now the weight is incorrect as we have no depth
calculation for uses. That is corrected in the next patch.

An interesting observation: simply removing any weight for defs has the best performance
but its likely that it is only in the tests we have.

There is no noticable impact on CI tests (found here:
https://treeherder.mozilla.org/perfherder/compare?originalProject=try&originalRevision=d6bf54b3e00e9fc5e96a7e4c3f226031bf3e245a&newProject=try&newRevision=a95de146f7655404143fa77069f99b285a75f3ae )

This is part 2 of using loop depth to calculate spill weight. The work here balances the def loop depth
calculation introduced in the previous patch. Introducing estimated frequency for use related spill weights account for an
additional 12 % improvement on the rust-fannkuch test.

There is no change related to our CI tests:
https://treeherder.mozilla.org/perfherder/compare?originalProject=try&originalRevision=d6bf54b3e00e9fc5e96a7e4c3f226031bf3e245a&newProject=try&newRevision=09a22f83bb140c23cdb0ec7a6a83b8552aa81d54&page=1

Depends on D210269

This patch removes some complexity related to spill weights: namely the calculation of
spill weight according to policy. It looks like this was introduced with the first
version of the register allocator.

As far as I can tell, modifying the spill weight by policy does not seem to result in
better register allocation, and this strategy is not used by LLVM :
https://llvm.org/doxygen/LiveIntervals_8cpp_source.html#l00857

CI tests can be found here:
https://treeherder.mozilla.org/perfherder/compare?originalProject=try&originalRevision=d6bf54b3e00e9fc5e96a7e4c3f226031bf3e245a&newProject=try&newRevision=a6ad16a841c83f342aa06f4560577278af04aa0a&page=1&framework=13

Depends on D210270

This patch takes, whole-sale, Julian's work from
https://bugzilla.mozilla.org/show_bug.cgi?id=1758274

The cache allows us to not recalculate spill weights during register allocation unless we need to
(such as there has been a spill).

In my testing on embenchen, we see a very modest improvement across the tests of ~1%. This patch is
optimal.

CI results:
https://treeherder.mozilla.org/perfherder/compare?originalProject=try&originalRevision=d6bf54b3e00e9fc5e96a7e4c3f226031bf3e245a&newProject=try&newRevision=c806b09d6d6e5d7d14cd4ed0b252056f4881f7a5&framework=13&page=3

Depends on D210271

Attachment #9401595 - Attachment description: Bug 1896581 - Cache def spill weight by depth; r=jseward → Bug 1896581 - Cache def spill weight with depth; r=jseward
Attachment #9401595 - Attachment description: Bug 1896581 - Cache def spill weight with depth; r=jseward → Bug 1896581 - Cache def spill weight by depth; r=jseward

Concretely, embenchen reports the following results (comparison is of the average of 9 runs):

test		new	old	diff
---------------------------------------
box2d		589	587	1.003
bullet		703	703	1.0
conditionals	156	156	1.0
copy		486	486	1.0
corrections	388	388	1.0
fannkuch	2016	2020	0.998
fasta		804	806	0.998
fib		942	941	1.001
ifs		171	171	1.0
binarytrees	2467	2469	0.999
memops		786	786	1.0
primes		219	217	1.009
raybench	803	809	0.993
rust-fannkuch	1557	1897	0.821	< -- main improvement of this patch stack
skinning	414	414	1.0
zlib		822	821	1.001

There is also a slight improvement on one of the files in https://bugzilla.mozilla.org/show_bug.cgi?id=1884572. Doesn't solve the problem of that bug, but has an unexpected benefit.

before

Benchmarking, the higher ops/sec the better.
Firefox 127.0 on OS X 10.15.

Compress:
  -   zhipeng-jia/snappyjs x 6.89 ops/sec ±0.33% (22 runs sampled): 558 MB/s
  -       pierrec/node-lz4 x 6.23 ops/sec ±2.37% (19 runs sampled): 504 MB/s
  -   gorhill/lz4-block.js x 14.21 ops/sec ±1.15% (28 runs sampled): 1150 MB/s
  - gorhill/lz4-block.wasm x 12.62 ops/sec ±0.60% (36 runs sampled): 1022 MB/s
  Done.

Uncompress:
  -   zhipeng-jia/snappyjs x 9.34 ops/sec ±0.46% (27 runs sampled): 756 MB/s
  -       pierrec/node-lz4 x 8.58 ops/sec ±1.65% (26 runs sampled): 694 MB/s
  -   gorhill/lz4-block.js x 15.82 ops/sec ±0.82% (31 runs sampled): 1281 MB/s
  - gorhill/lz4-block.wasm x 49.05 ops/sec ±1.79% (53 runs sampled): 3972 MB/s
  Done.

after:

Benchmarking, the higher ops/sec the better.
Firefox 127.0 on OS X 10.15.

Compress:
  -   zhipeng-jia/snappyjs x 6.92 ops/sec ±0.39% (22 runs sampled): 560 MB/s
  -       pierrec/node-lz4 x 6.19 ops/sec ±0.48% (20 runs sampled): 501 MB/s
  -   gorhill/lz4-block.js x 14.15 ops/sec ±0.84% (28 runs sampled): 1146 MB/s
  - gorhill/lz4-block.wasm x 12.93 ops/sec ±0.54% (36 runs sampled): 1047 MB/s
  Done.

Uncompress:
  -   zhipeng-jia/snappyjs x 9.39 ops/sec ±0.48% (27 runs sampled): 760 MB/s
  -       pierrec/node-lz4 x 8.81 ops/sec ±1.02% (26 runs sampled): 713 MB/s   
  -   gorhill/lz4-block.js x 16.19 ops/sec ±0.46% (31 runs sampled): 1311 MB/s
  - gorhill/lz4-block.wasm x 49.46 ops/sec ±0.32% (53 runs sampled): 4006 MB/s
  Done.
See Also: → 1884572
Blocks: sm-opt-jits
Severity: -- → N/A
Priority: -- → P1

Here are three more patches derived from the original set of 4.

  • bug1896581-1-UnscaledSW.diff-2: changes spill weight from size_t to uint64_t
    so as to avoid dynamic range concerns on 32-bit targets, and wraps it up in a
    new type UnscaledSW so it can't be confused with any other numeric type.

  • bug1896581-2-scale-by-depth.diff-2: adds scaling of spill costs by loop
    depth, by a factor of 8. Adds a new type ScaledSW so the compiler can
    enforce separation of scaled vs unscaled spill weights.

  • bug1896581-3-simplify-calc.diff-2: is (I think) equivalent to the original
    "simplify spill weights calculation" patch.

I didn't work with the spill-weight-cache stuff since that doesn't affect the
resulting allocations. I also skipped the spill-weights-as-float thing, as
that doesn't seem like such a good idea in hindsight. (What if a spill weight
goes below zero as a result of rounding effects?)

Anyways, at least for 11 of the wasm tests in JetStream3, I failed to get any
perf improvement out of this. Numbers -- larger are better -- taken from the
"Score:" values, Intel Tiger Lake, 3 GHz, 3rd best of 10 runs. Note,
simplified is only the simplified-calculation thing; it doesn't include
depth-scaling.

                  basis     scaled    simplified

HashSet-wasm     125.728    125.574    125.251   
tsf-wasm         159.627    158.006    158.002   
quicksort-wasm   133.530    133.262    132.553   
gcc-loops-wasm   134.870    134.701    134.884   
richards-wasm    192.936    192.605    193.011   
sqlite3-wasm      64.049     64.045     63.991    
tfjs-wasm          1.166      1.168      1.166     
tfjs-wasm-simd     2.802      2.827      2.813     
argon2-wasm       97.385     97.022     97.212    
8bitbench-wasm    16.906     16.875     17.033    
zlib-wasm         76.796     76.749     76.549    
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size:

OSZAR »