A Curious Benchmark

Hi hackers! I recently saw a curious benchmark passed round on the #moarvm channel. (I understand that this originates from Ovid during his Future of Perl 5 and 6 presentation, but I haven't seen it myself, so if I'm wrong, don't hesitate to correct me). It's curious because it runs on perl5 and perl6, and because the difference is… significant:

# reciprocal.pl
my $x = 0;
$x += 1/$_ for 1..50_000_000;
print "$x\n";

Perl 5 (5.24.1), on my laptop (Intel i7-4700HQ CPU @ 2.40GHz, 16GB ram), takes approximately 2.7s to print the value 18.3047492382933. Perl 6, on the same script, takes just shy of 2m18s, or nearly 140s. This is 51 times as much. Perl 5 is not known as a particularly fast language, but in this case perl 6 is really very slow indeed.

Let's try and do a little better. As pointed out by timotimo, the benchmark above is a pessimal case for rationals numbers, which perl 6 uses by default. Perl 5 uses floating point throughout. So we can do better by explicitly using floating point calculations in perl6:

# reciprocal.pl6
my $x = 0e0;
$x += 1e0/$_.Num for 1..50_000_000;
say $x;

This takes approximately 30s on my machine, approximately 5 times faster, but still over 11 times slower than perl5. (NB: for all these numbers I'm posting here, I didn't run exhaustive tests and calculate the statistics, but I feel like this is reliable).

We can typically avoid some overhead in Perl 6 if we avoid using scalar containers by means of binding. We can avoid the dynamic lookup of $_ by replacing the for with a while loop. And we can skip the cast from Int to Num by using a Num iterator value. That gives us the following code:

# reciprocal-while.pl6
my $x := 0e0;
my $i := 0e0;
while (($i := $i + 1e0) < 5e7) {
    $x := $x + 1e0/$i;
}
say $x;

This reduces the run time to approximately 26.5s. So instead of well over 11 times slower than perl 5, perl 6 is now a little less than 10 times slower.
I tried using native types, but that increased the run time to up to 36s. Native type performance (except for native typed arrays) has so far not met expectations for perl 6. (I understand that this is due to excessive boxing and unboxing, unfortunately). So it seems that I failed to make a perl6 program that performs comparably to perl 5.

And yet…

With all due respect to the perl 5 (and perl 6) developers, I think MoarVM ought to be able to do better. MoarVM has support for native-typed values and operators, the perl interpreter does not. It has a dynamic specializer and JIT compiles to native code, the perl interpreter does not. I should expect MoarVM to do better on this benchmark. (Otherwise, what have I been doing with my life, the last few years?)

Let's try the secret weapon: NQP. NQP stands for Not Quite Perl and it is the language the rakudo compiler is built in. It acts like a 'bootstrap' language and compatibility layer between the various backends of rakudo perl 6, such as MoarVM, Java and Javascript, and in the near future Truffle. I like to think that it relates to MoarVM in the same way that C relates to contemporary CPUs - it is a sort of low level, high level language. Although NQP fully support perl6 classes, regular expressions and grammars, it has no support for ranges or C-style for loops, so it does tend to look a bit primitive:

# reciprocal-boxed.nqp
my $x := 0e0;
my $i := 1e0;
while ($i < 5e7) {
    $x := $x + 1e0/$i;
    $i := $i + 1e0;
}
nqp::say("$x");

This version uses boxed objects and takes (on my machine) approximately 12.5s. If you recall, perl 5 took 2.7s, so this is just a bit less than 5 times slower than perl 5. Still not very satisfying though.

Let's improve this version a little bit and add native types here. The only difference between this code and the code above it, is that we have explicitly opted to use num values for $x and $i.

# reciprocal-native.nqp
my num $x := 0e0;
my num $i := 1e0;
while ($i < 5e7) {
    $x := $x + 1e0/$i;
    $i := $i + 1e0;
}
nqp::say("$x");

This code, with JIT enabled, consistently runs in approximately 0.28s on my machine. That is not a typing error. It prints the correct result. I emphatically want you to try this at home: simply save it as reciprocal.nqp and run time nqp reciprocal.nqp. (With JIT disabled, it runs in 2.4s, which is (finally) a bit faster than perl 5 in the interpreter).

Just out of curiosity, I tried comparing this result with the following C code:
 
#include <stdio.h>

int main(int argc, char **argv) {
    double x = 0.0;
    double i;
    for (i = 1.0;  i < 5e7; i += 1.0)
        x += 1.0/i;
    printf("%f", x);
}

On my machine, this takes approximately 0.22s per run, which means that NQP on MoarVM, using native types, is a little more than 1.3x slower than compiled C, excluding the compilation time of the C program. The NQP program does include JIT compilation.

For the record, the C compiled code is much simpler and faster than that generated by the MoarVM JIT - 75 bytes for C 'main' vs 1139 bytes for the NQP 'mainline' (149 bytes for the hot loop). So there is much to improve left, for sure. (EDIT: I originally wrote this in a rather confusing way, but the C code was always the shorter version).

So what does that tell us? Is perl6 50 times slower than perl5, or is NQP just 30% slower than C? There is actually much more to learn from the relative performance of these programs:
  • Regardless of the mathematical merits of rational numbers, they do not win benchmarks. It is a good thing perl 6 can fall back to floating point.
  • The cost of boxing and unboxing in MoarVM appears to be fairly large. Compare the performance of the boxing-floating point calculations and the native-floating point calculations in NQP - the cost of boxing absolutely dwarves the actual cost of doing the computation. I think this deserves further investigation.
  • The overhead introduced by perl6's flexibility, even after specialization, accounts for a factor of two slowdown compared to NQP. If I'd hazard a guess, I would attribute that to (multi) subroutine call overhead and indirection from scalar containers. I would speculate that (deep) inlining or trace specialization would help significantly.
  • The overhead of specialization and JIT compilation is insignificant even in a short-running benchmark like this. This is pretty encouraging I think.
  • This benchmark, when reduced to native operations, appears to be dominated by the cost of floating point operations, and I expect floating point division to be the most expensive of these. So the value as a benchmark for the JIT is relatively limited - both C and NQP code appear to spend most of the time waiting for the FPU to complete.
After I wrote this (but before publishing), I reran the benchmarks with the new branch postrelease-opts. The results were rather different and encouraging (units in seconds):

test master postrelease-opts
reciprocal.pl 140 51
reciprocal.pl6 31 6.6
reciprocal-while.pl6 26 4.6
reciprocal-boxed.nqp 12 3.6
reciprocal-native.nqp 0.27 0.25

Congratulations to jnthn++ and timotimo++ for this fantastic result. As I understand it, we expect to merge this branch after the release of MoarVM 2018.08.

As always, benchmarks should be taken with a grain of salt - results do not always generalize cleanly to production code. But I hope this has given some insights into the performance characteristics of perl 6, and I think it highlights some areas of interest.

PS. The postrelease-opts has taken the place of the master branch as the mainline for development, as we aim to get master fit for a release. Personally, I'd rather have master open for development and a release branch for cutting a stable release. I expect most users get MoarVM from a specific revision via MOAR_REVISION anyway, so having master be the 'development' branch should cause little harm.

Reacties

  1. The original would, I think, have been far worse if it had actually run to completion with rationals. But ...

    # reciprocal-rat.pl6
    my Rat $x = 0.Rat;
    $x += 1/$_ for 1..50_000_000;
    print "$x\n";

    perl6 reciprocal-rat.pl6
    Type check failed in assignment to $x; expected Rat but got Num (4.437963841730785e0)
    in block at reciprocal-rat.pl6 line 3

    A version with FatRat has already consumed more than 15min of CPU time and is still running as I write this.

    BeantwoordenVerwijderen
  2. https://gist.github.com/donaldh/87441392c21f1190a3d82b385d95cc36#file-rational-twist-org

    BeantwoordenVerwijderen

Een reactie posten

Populaire posts van deze blog

Why bother with Scripting?

Reverse Linear Scan Allocation is probably a good idea

Something about IR optimization