`Probably` the fastest Black Scholes Pricer in world
This is a small investigative piece into the world of SIMD and quantitative pricing. As the famous beer commercial alludes to, this is probably the fastest Black-Scholes pricer in the world (but PR’s are welcome to make it faster still!)
Black Scholes
I’m generally going to assume you are familiar with Black-Scholes option pricing, but for a brief introduction.
By making some fairly well known assumptions Fischer Black and Myron Scholes derived a closed form mathematical solution that is used for vanilla option pricing.
Key to those assumptions are:
- The option is European and can only be exercised at expiration.
- Markets are efficient (i.e., market movements cannot be predicted).
- There are no transaction costs in buying the option.
- The risk-free rate and volatility of the underlying are known and constant.
- The returns on the underlying asset are normally distributed.
In the real world, things are very different but this is a standard that is used and understood world wide.
Why make a fast implementation?
Apart from the classic itch, “because I can”, it’s used in any options trading system for volatility surface calibration. The faster you can knock these out the faster your portfolio pricing can run. When you have 100,000+ options in a single portfolio (both vanilla and complex) having a fast implementation is crucial to performance.
The formula
From Wikipedia the formula is typically written out as
The main work is being done in d1, d2, and there’s a mysterious N() function.
Let’s implement this in Rust
A simple implementation that is verbatim from the formula above.
I’m not keen on the N() function, so let’s expand it to call it ncd() and fill that in below.
Initial performance test for 10,000,000 calculations
Usually this would be an excellent result and we could leave it at that. But I wanted to go further.
Single Instruction Multiple Data (SIMD)
All CPU’s nowadays have these types of instructions. But what does it mean?
If you have a few lines of code
this adds 1 and 2 together to make 3. This would translate to 3 instructions,
- load the number 1 into register a
- load number 2 into register b
- add a and b and store result in register c
(In reality this example would probably be condensed into a single constant but let’s go with this)
With SIMD, instead we could instead act on multiple numbers together.
From the above we are iterating over each value and adding them one by one. But you would hope the compiler is clever enough to see that actually there’s an instruction _mm_add_ps() that could do 8 additions in 1 instruction. That’s 8x faster.
Let’s check the results
The place to try these experiments out would be on Godbolt which is an excellent resource. What we see though is
The compiler has cleverly detected that we are going to add 8 numbers together and the SIMD instructions are used for the addition but only 4 values at a time. Even with the ability of a lowly cpu to do more work, the compiler is conservative.
Explicit SIMD Instructions
We need to tell the compiler to be better, but to do that we need to tell it more explicitly what to do. There are instructions known as SIMD Intrinsics but they are accessed through arcane incantations which most developers are unfamiliar with.
Enter Wide, an abstraction on the SIMD instructions that makes it easy to perform the basic maths operations of add,subtract,multiple,divide. This should get us quite far.
Rewriting this with the Wide library
We can see the compiler becomes too clever and just encodes the results in statically
Oh well! However the point is that we now are using the wider 256bit ymm registers that can do 8 operations at a time.
Rewriting Black Scholes using the Wide crate
Let’s do it using the standard math operators
Wow, this is a 6x faster, but can we do better?
It turns out we can.
Wide crate additions
The code above is a lot messier than the original scalar one and it’s due to the maths functions not being implemented. It’s surprising but outside the commercial libraries there are very few SIMD libraries (in any language) that implement the full set of math functions on the wide float/double types.
So I took it among myself to implement them. It’s a fascinating arcane world of Taylor series expansions and float point bit mask exploration.
Implementing the exp(), ln(), pow() functions was fun and pushing them back into the code base makes the black_scholes pricer look like
How fast does this run?
Compared to the original 820ms we are nearly 8x faster. And with the math functions in there, it looks like scalar code; making it easy for a non expert to understand.
Caveats
It’s not all fun and games though.
The ncd() function is the Gaussian cumulative distribution function and is written in a scalar fashion as shown near the top of this article. Let’s test that independently
We are processing 8x values at a time but here it takes
In this case not the 8x speed up we wanted but still 2.5x improvement compared to the original. And we can see this from the Godbolt output
verses the SIMD output which uses the wider ymm (8x) registers. The 2x speed up is expected. But we also benefit from the fused-multiply-add which probably contributes to the slightly better than expected improvement.
Conclusion
By looking at your CPU bound code, it’s often possible to see gains of 8x or more. If you are running on server class Xeon or Epyc processors which have AVX-512 instructions you get even better throughput (16x).
Either way, is this the fastest black scholes pricer in the world… probably!
Code
The code is licensed under Affero GPL license allowing commercial use but with accreditation (including any derivative works) to myself. Please contact me if you want to incorporate this into your commercial code base.
Code is available on Github https://github.com/ronniec95/black_scholes