- Classes and Structs explains the rationale for the test
- Struct Test Programs with documentation and download
- Sample Test Results measured on my system
- Test Conclusions drawn from these results
This page focuses on method inlining in the presence of user-defined value types, a test case that is notoriously problematic for the .NET CLR. The article Head-to-head benchmark: C++ vs .NET by “Qwertie” on Code Project offers a much broader comparison of computational performance on both platforms, including cases that are more favorable for the CLR and sometimes allow C# to approximate Visual C++ speed.
Is C++ worth it? by Daniel Lemire compares a simple numerical loop in Oracle Java 7 and various C++ compilers, with embarrassing results for the latter. Running his test on my own system, I found that Java outperformed both Visual C# and Visual C++ by a factor of three!
Ternary operator is twice as slow as an if-else block? reveals an amazing CLR optimization failure: C# ternary operators are 10% (x64) or up to 170% (x86) slower than equivalent if-else statements!
Classes and Structs
The .NET Common Language Runtime (CLR) offers two kinds of user-defined objects: reference types (declared as
class in C#) and value types (declared as
struct in C#). This is not equivalent to the C++ distinction of
struct (which only affects default visibility), but rather to the C++ distinction of allocating an object with or without the
new keyword to place it on the heap.
Structs are allocated on the stack when not embedded within other objects, and so do not increase the garbage collector’s workload. Struct allocation within other objects (e.g. array elements) has the same benefit and also avoids storing an extra reference per element, potentially saving large amounts of memory.
Another consequence is that struct contents are copied wholesale on variable assignments, whereas each assignment of a class variable only copies a single reference. These “copying assignments” occur more often than you might expect, e.g. when an object is passed to a method or retrieved from a collection. From a performance viewpoint, the extra time spent on copying large structs eventually erases the benefits of embedded allocation – hence the general recommendation to use structs only for small amounts of data.
Copying contents versus references also constitutes an important semantic distinction, but here we’ll focus on runtime performance. Structs should perform better than classes when objects are frequently created and accessed, provided content copying is inexpensive or can be optimized away. The applications that should benefit most are numerical algorithms and computational geometry, as they require efficient types for small tuples of floating-point values: complex numbers, two- or three-dimensional coordinates, etc.
Primitive Types and Structs
The following benchmark does not compare structs to classes, but rather user-defined structs (where available) or classes (where not) to equivalent tuples of built-in primitive types. Passing a struct to a method (by value) is semantically equivalent to passing its individual fields, and accessing its fields is equivalent to accessing individual variables of the same type within the same storage context.
A good optimizer should be able to exploit this equivalence and produce struct handling code that is indistinguishable from using the “naked” field-equivalent variables directly. This is what we’re going to examine, for the specific case of methods that don’t change their parameters and are small enough to be inlined.
Struct Test Programs
All results shown below were obtained with a suite of small test programs. The download package StructTest.zip (118 KB, ZIP archive) comprises the precompiled executables and their complete source code. Please refer to the enclosed
ReadMe.txt file and the various batch files for the required development tools and expected file paths.
All tests perform 1,000,000,000 loop iterations over two pairs of double-precision values, representing a point’s x- and y-coordinate. We initialize all coordinates to 1, then in each iteration assign the cross-wise sum of all coordinates to the first pair: a := (ax + by, ay + bx). The final coordinates are printed before each result to ensure the calculations were performed correctly (and not optimized away entirely, which C++ can actually do!).
AddByVal — The simplest variant. Two
Pointarguments are supplied by value (i.e. their contents are copied), and a new
AddByRef — Two
Pointarguments are supplied by reference, and a new
AddByOut — Two
Pointarguments are supplied by reference, and a new
AddNaked — This variant uses no
Pointobjects at all. All coordinates are defined, supplied, and returned as “naked”
Sample Test Results
The present test results were obtained in February 2014. The Struct Performance tests were first published in June 2011 and updated in July 2012 and February 2013, so I have test results for various older versions of the tested compilers and runtimes. Some of the old results were originally obtained on Windows 7 SP1 (64 bit) and an Intel DX58SO motherboard, but the MSI board was a drop-in replacement with identical specifications and components. Those tests that I remeasured with old versions showed very little performance change, so the results should be comparable.
GCC & Visual C++
Table 1 shows C++ test results for gcc 4.8.2 (Windows port by MinGW-w64), Microsoft Visual C++ 2010 SP1, and Visual C++ 2013. Previously tested versions with identical results are not shown, including gcc 4.5.2 (32-bit only) and 4.7.0.
The VC++ 2010 results were obtained with the keyword
inline preceding all measured functions. Retesting with current compilers I found this keyword made no difference, so I removed it.
|Table 1||gcc||VC++ 2010 SP1||VC++ 2013|
|32 bit||64 bit||32 bit||64 bit||32 bit||64 bit|
gcc — The only compiler in this comparison that correctly optimizes all test cases and delivers the same excellent performance in each of them.
Visual C++ — Microsoft had one optimizer that could match gcc, and that was the 32-bit version of VC++ 2010. The 64-bit version of that compiler and both versions of VC++ 2013 exhibit embarrassing optimization failures with simple user-defined types, falling behind both CLR and JVM. This should caution you against using VC++ as representative for “C++ performance,” by the way.
.NET CLR & Mono
Table 2 shows C# test results for Microsoft Visual C# 2013 (.NET Framework 4.5.1) and Mono 3.2.3. Previously tested versions with nearly identical results are not shown, including VC# 2010 (.NET 4), VC# 2012 (.NET 4.5), Mono 2.10.9 (32-bit only), and Mono 3.0.3.
.NET 4.5 introduced the method attribute Aggressive Inlining which does have a noticeable effect – but not always a good one. Decorating all measured methods with this attribute yielded speedups of 10-25% but also slowdowns of 25-70%, depending on the test case. I decided to omit the attribute.
|Table 2||Visual C#||Mono|
|32 bit||64 bit||32 bit||64 bit|
Visual C# — Counterintuitively, the optimizer of the 32-bit CLR works correctly only when structs are passed by reference rather than by value – not a great alternative due to the changed semantics. While the 64-bit CLR also profits from this trick, it is slower to begin with, and struct handling never reaches the speed of naked
double values. The latter are over 3× slower than gcc for either CLR, which is also rather unimpressive.
The likely cause for the call-by-reference speedup is the fact that our small test methods can be inlined. My guess: 1. The optimizer identifies call-by-reference structs with caller’s objects, and so wastes no time creating references. 2. The optimizer realizes that naked
double values are not changed in the test method except on return, and so wastes no time copying them. 3. However, the optimizer fails to correctly analyze the use of call-by-value structs, and so always wastes time copying them.
Mono — The major third-party CLR is slower than Microsoft’s implementation by a factor of 1.7–4.6, and we again note the counterintuitive result that passing a small struct by reference is faster than passing it by value. This result is not a scathing criticism of Mono – merely keeping up with new .NET features while porting the CLR to many more platforms is quite an achievement! However, it does demonstrate that Mono is not an option if you’re looking for better performance.
Table 3 shows Java test results for Oracle Java Development Kit 7u13 and 8, using all available flavors of Client and Server VM. Other tested versions with nearly identical results are not shown, including JDK 7u3 (same as 7u13) and JDK 7u51 (same as 8).
|Table 3||Oracle JDK 7u13||Oracle JDK 8|
Java Client VM — This obsolete 32-bit VM performs as expected, i.e. somewhat slower than the CLR.
Java Server VM — On the other hand, the Server VM’s optimizer is so excellent that user-defined types roughly match pass-by-value structs on the 32-bit CLR, and all struct tests on the 64-bit CLR. As of 7u13, naked
double values even came within 60% of the gcc baseline!
Sadly, this amazing result was lost in some optimizer tweak between 7u13 and 7u51. Running the Client vs Server benchmarks, I found that 7u51 is just as fast as 7u13 in the Fibonacci tests and actually 10-20% faster in SciMark, so this does not represent an overall performance regression.
|Table 4||Chrome||Firefox||Internet Explorer|
Chrome & Firefox — Both browsers come within 50% of the best Mono performance for user-defined types, making them perfectly suitable for general application development.
Shockingly, Firefox is second only to C++ for naked
double values! I suspect it cheats a bit, though: my test does not require fractional values, so these
double values may be internally represented as integers. Also note that browsers are prone to regressions in this benchmark (FF 10-19, IE 10-11).
Internet Explorer — While remaining proudly a year or two behind the competition in terms of performance, even IE11 finally matches the worst case of the Mono runtime for user-defined types. Only primitive operations remain problematic.