`Probably` the fastest Black Scholes Pricer in world

Mushfaque Ronnie Chowdhury
6 min readSep 1, 2020

This is a small investigative piece into the world of SIMD and quantitative pricing. As the famous beer commercial alludes to, this is probably the fastest Black-Scholes pricer in the world (but PR’s are welcome to make it faster still!)

Black Scholes

I’m generally going to assume you are familiar with Black-Scholes option pricing, but for a brief introduction.

By making some fairly well known assumptions Fischer Black and Myron Scholes derived a closed form mathematical solution that is used for vanilla option pricing.

Key to those assumptions are:

  • The option is European and can only be exercised at expiration.
  • Markets are efficient (i.e., market movements cannot be predicted).
  • There are no transaction costs in buying the option.
  • The risk-free rate and volatility of the underlying are known and constant.
  • The returns on the underlying asset are normally distributed.

In the real world, things are very different but this is a standard that is used and understood world wide.

Why make a fast implementation?

Apart from the classic itch, “because I can”, it’s used in any options trading system for volatility surface calibration. The faster you can knock these out the faster your portfolio pricing can run. When you have 100,000+ options in a single portfolio (both vanilla and complex) having a fast implementation is crucial to performance.

The formula

From Wikipedia the formula is typically written out as

Black Scholes formula

The main work is being done in d1, d2, and there’s a mysterious N() function.

Let’s implement this in Rust

Initial implementation

A simple implementation that is verbatim from the formula above.

I’m not keen on the N() function, so let’s expand it to call it ncd() and fill that in below.

The ncd and erf functions

Initial performance test for 10,000,000 calculations

820ms to run 10m black scholes pricings

Usually this would be an excellent result and we could leave it at that. But I wanted to go further.

Single Instruction Multiple Data (SIMD)

All CPU’s nowadays have these types of instructions. But what does it mean?

If you have a few lines of code

Simple addition

this adds 1 and 2 together to make 3. This would translate to 3 instructions,

  • load the number 1 into register a
  • load number 2 into register b
  • add a and b and store result in register c

(In reality this example would probably be condensed into a single constant but let’s go with this)

With SIMD, instead we could instead act on multiple numbers together.

Adding multiple numbers togther

From the above we are iterating over each value and adding them one by one. But you would hope the compiler is clever enough to see that actually there’s an instruction _mm_add_ps() that could do 8 additions in 1 instruction. That’s 8x faster.

Let’s check the results

Checking the results

The place to try these experiments out would be on Godbolt which is an excellent resource. What we see though is

Compile options
xmm (128bit) registers used

The compiler has cleverly detected that we are going to add 8 numbers together and the SIMD instructions are used for the addition but only 4 values at a time. Even with the ability of a lowly cpu to do more work, the compiler is conservative.

Explicit SIMD Instructions

We need to tell the compiler to be better, but to do that we need to tell it more explicitly what to do. There are instructions known as SIMD Intrinsics but they are accessed through arcane incantations which most developers are unfamiliar with.

Enter Wide, an abstraction on the SIMD instructions that makes it easy to perform the basic maths operations of add,subtract,multiple,divide. This should get us quite far.

Rewriting this with the Wide library

Wide library integration

We can see the compiler becomes too clever and just encodes the results in statically

Oh well! However the point is that we now are using the wider 256bit ymm registers that can do 8 operations at a time.

Rewriting Black Scholes using the Wide crate

Let’s do it using the standard math operators

Rewritten using wide
Time taken is 133ms

Wow, this is a 6x faster, but can we do better?

It turns out we can.

Wide crate additions

The code above is a lot messier than the original scalar one and it’s due to the maths functions not being implemented. It’s surprising but outside the commercial libraries there are very few SIMD libraries (in any language) that implement the full set of math functions on the wide float/double types.

So I took it among myself to implement them. It’s a fascinating arcane world of Taylor series expansions and float point bit mask exploration.

Implementing the exp(), ln(), pow() functions was fun and pushing them back into the code base makes the black_scholes pricer look like

A much clearer version of a wide BS call pricer

How fast does this run?

107 ms for 10m BS prices

Compared to the original 820ms we are nearly 8x faster. And with the math functions in there, it looks like scalar code; making it easy for a non expert to understand.

Caveats

It’s not all fun and games though.

The ncd() function is the Gaussian cumulative distribution function and is written in a scalar fashion as shown near the top of this article. Let’s test that independently

Scalar CDF function
CDF function written using the Wide Crate

We are processing 8x values at a time but here it takes

CDF rewritten as a wide function takes just 6ms

In this case not the 8x speed up we wanted but still 2.5x improvement compared to the original. And we can see this from the Godbolt output

verses the SIMD output which uses the wider ymm (8x) registers. The 2x speed up is expected. But we also benefit from the fused-multiply-add which probably contributes to the slightly better than expected improvement.

Conclusion

By looking at your CPU bound code, it’s often possible to see gains of 8x or more. If you are running on server class Xeon or Epyc processors which have AVX-512 instructions you get even better throughput (16x).

Either way, is this the fastest black scholes pricer in the world… probably!

Code

The code is licensed under Affero GPL license allowing commercial use but with accreditation (including any derivative works) to myself. Please contact me if you want to incorporate this into your commercial code base.

Code is available on Github https://github.com/ronniec95/black_scholes

--

--