Go to the first, previous, next, last section, table of contents.


Performance

On RISCs the Gforth engine is very close to optimal; i.e., it is usually impossible to write a significantly faster engine.

On register-starved machines like the 386 architecture processors improvements are possible, because gcc does not utilize the registers as well as a human, even with explicit register declarations; e.g., Bernd Beuster wrote a Forth system fragment in assembly language and hand-tuned it for the 486; this system is 1.19 times faster on the Sieve benchmark on a 486DX2/66 than Gforth compiled with gcc-2.6.3 with -DFORCE_REG.

However, this potential advantage of assembly language implementations is not necessarily realized in complete Forth systems: We compared Gforth (direct threaded, compiled with gcc-2.6.3 and -DFORCE_REG) with Win32Forth 1.2093, LMI's NT Forth (Beta, May 1994) and Eforth (with and without peephole (aka pinhole) optimization of the threaded code); all these systems were written in assembly language. We also compared Gforth with three systems written in C: PFE-0.9.14 (compiled with gcc-2.6.3 with the default configuration for Linux: -O2 -fomit-frame-pointer -DUSE_REGS -DUNROLL_NEXT), ThisForth Beta (compiled with gcc-2.6.3 -O3 -fomit-frame-pointer; ThisForth employs peephole optimization of the threaded code) and TILE (compiled with make opt). We benchmarked Gforth, PFE, ThisForth and TILE on a 486DX2/66 under Linux. Kenneth O'Heskin kindly provided the results for Win32Forth and NT Forth on a 486DX2/66 with similar memory performance under Windows NT. Marcel Hendrix ported Eforth to Linux, then extended it to run the benchmarks, added the peephole optimizer, ran the benchmarks and reported the results. We used four small benchmarks: the ubiquitous Sieve; bubble-sorting and matrix multiplication come from the Stanford integer benchmarks and have been translated into Forth by Martin Fraeman; we used the versions included in the TILE Forth package, but with bigger data set sizes; and a recursive Fibonacci number computation for benchmarking calling performance. The following table shows the time taken for the benchmarks scaled by the time taken by Gforth (in other words, it shows the speedup factor that Gforth achieved over the other systems).

relative      Win32-    NT       eforth       This-
  time  Gforth Forth Forth eforth  +opt   PFE Forth  TILE
sieve     1.00  1.39  1.14   1.39  0.85  1.58  3.18  8.58
bubble    1.00  1.31  1.41   1.48  0.88  1.50        3.88
matmul    1.00  1.47  1.35   1.46  0.74  1.58        4.09
fib       1.00  1.52  1.34   1.22  0.86  1.74  2.99  4.30

You may find the good performance of Gforth compared with the systems written in assembly language quite surprising. One important reason for the disappointing performance of these systems is probably that they are not written optimally for the 486 (e.g., they use the lods instruction). In addition, Win32Forth uses a comfortable, but costly method for relocating the Forth image: like cforth, it computes the actual addresses at run time, resulting in two address computations per NEXT (see section Image File Background).

Only Eforth with the peephole optimizer performs comparable to Gforth. The speedups achieved with peephole optimization of threaded code are quite remarkable. Adding a peephole optimizer to Gforth should cause similar speedups.

The speedup of Gforth over PFE, ThisForth and TILE can be easily explained with the self-imposed restriction of the latter systems to standard C, which makes efficient threading impossible (however, the measured implementation of PFE uses a GNU C extension: see section `Defining Global Register Variables' in GNU C Manual). Moreover, current C compilers have a hard time optimizing other aspects of the ThisForth and the TILE source.

Note that the performance of Gforth on 386 architecture processors varies widely with the version of gcc used. E.g., gcc-2.5.8 failed to allocate any of the virtual machine registers into real machine registers by itself and would not work correctly with explicit register declarations, giving a 1.3 times slower engine (on a 486DX2/66 running the Sieve) than the one measured above.

Note also that there have been several releases of Win32Forth since the release presented here, so the results presented here may have little predictive value for the performance of Win32Forth today.

In Translating Forth to Efficient C by M. Anton Ertl and Martin Maierhofer (presented at EuroForth '95), an indirect threaded version of Gforth is compared with Win32Forth, NT Forth, PFE, and ThisForth; that version of Gforth is 2%-8% slower on a 486 than the direct threaded version used here. The paper available at
http://www.complang.tuwien.ac.at/papers/ertl&maierhofer95.ps.gz; it also contains numbers for some native code systems. You can find a newer version of these measurements at http://www.complang.tuwien.ac.at/forth/performance.html. You can find numbers for Gforth on various machines in `Benchres'.


Go to the first, previous, next, last section, table of contents.