QUIC in .NET - Benchmarks and Road to Stability

Written on April 30, 2020

It has been nearly two weeks since my last update, so I decided it's about time I summarize what was I was up to. I had an idea that it would be nice to create performance benchmarks early in the development, so that I am able to track performance improvements as I implement more features and optimizations.

I also wanted to see how my implementation compares to native options. Comparison with raw TCP stack would be unfair, but SslStream which works over TCP seemed a good choice since it also needs to perform SSL negotiation and encryption of the data.

So I put together a benchmark using the BenchmarkDotNet library, which would measuer how long does it take to receive certain amount of data from a server running in the same process. It turned that my implementation was so unstable, that the benchmarks almost never finished all the iterations.

It seems that features like loss detection and data retransmission are vital even if there is no packet loss on local connection. Since I did not implement flow control yet, the OS socket buffer was instantly filled to the brim and started discarding incoming data. Lesson learned.

To make the story short, I implemented packet loss detection and recovery in a day or so, and spent rest of the last two weeks hunting bugs. Infinite cycles, deadlocks, livelocks, data races, min/max confusion, off-by-one errors, and many other. One by one I managed to fix these and the implementation seems quite stable now, meaning that all benchmarks run successfully to completion.

Initial Benchmark Results

I prepared a set of 3 benchmarks, each targeting particular part of the expected usage:

Connection establishment
Streaming data
Connection termination

All benchmarks are run with following configuration:

BenchmarkDotNet=v0.12.0, OS=manjaro 
Intel Core i5-6300HQ CPU 2.30GHz (Skylake), 1 CPU, 4 logical and 4 physical cores
.NET Core SDK=3.1.103
  [Host]     : .NET Core 3.1.3 (CoreCLR 4.700.20.11803, CoreFX 4.700.20.12001), X64 RyuJIT
  Job-AWMUQU : .NET Core 3.1.3 (CoreCLR 4.700.20.11803, CoreFX 4.700.20.12001), X64 RyuJIT

InvocationCount=1  UnrollFactor=1

Connection Establishment Benchmark

The first benchmark measures the connection establishment handshake. Even though QUIC allows establishing connection by sending a single packet, I have not implemented that feature yet, so QUIC connection requires two full roundtrips.

Method	Mean	Error	StdDev	Ratio	RatioSD	Gen 0	Gen 1	Gen 2	Allocated
SslStream	6.441 ms	0.4034 ms	1.164 ms	1.00	0.00	-	-	-	18.66 KB
QuicConnection	6.547 ms	1.4783 ms	4.289 ms	1.06	0.72	1000.0000	-	-	4172.26 KB

The results are suprisingly good, although the jitter in the QuicConnection is still quite large with extreme values at 2ms and 18ms.

Streaming Data Benchmark

Next benchmark measures how long it takes to transfer a particular chunk of data of size 64kB, 1MB, and 32MB respectively.

Method	DataLength	Mean	Error	StdDev	Median	Ratio	RatioSD	Gen 0	Gen 1	Gen 2	Allocated
QuicStream	65536	30,914.4 us	16,044.16 us	47,306.58 us	1,202.8 us	206.28	323.39	-	-	-	10920 B
SslStream	65536	153.5 us	12.55 us	34.79 us	136.9 us	1.00	0.00	-	-	-	33776 B

QuicStream	1048576	17,649.4 us	2,103.88 us	6,137.11 us	17,974.3 us	16.55	5.72	-	-	-	2674496 B
SslStream	1048576	1,101.1 us	41.80 us	116.52 us	1,067.3 us	1.00	0.00	-	-	-	33296 B

QuicStream	33554432	569,656.4 us	124,821.42 us	368,038.77 us	412,800.7 us	17.80	11.02	60000.0000	5000.0000	-	192544992 B
SslStream	33554432	32,077.4 us	639.54 us	1,482.22 us	32,049.2 us	1.00	0.00	-	-	-	576 B

Although the results are less impressive than in previous benchmark, I am actually quite satisfied. There is of course huge room for improvements, mainly in the amount of memory allocated. I did not get into detailed profiling yet, but my guess is that even though I try to avoid excess allocation by using ArrayPool, the sizes of the arrays I request varies too much for the pooling to work efficiently. The buffering needs more work anyway, and if the decreased performance is caused by congestion at the receiver side, then implementing flow control should improve it.

Connection Termination Benchmark

I originally did not intend to measure how long it takes to terminate the connection, but I have noticed that the cleanup stage of previous benchmark takes quite a lot of time for QUIC. This benchmark measures solely the time needed to close a connection via call to Dispose method.

Method	Mean	Error	StdDev	Median	Ratio	RatioSD	Gen 0	Gen 1	Gen 2	Allocated
SslStream	5.529 ms	0.1972 ms	0.5497 ms	5.440 ms	1.00	0.00	-	-	-	18.91 KB
QuicConnection	106.344 ms	21.5032 ms	63.4028 ms	126.804 ms	19.36	11.99	1000.0000	1000.0000	-	3841.66 KB

I have mixed feelings about the results and can't explain them yet. It is possible that Dispose in my QUIC implementation blocks for unnecesarily long time and could return sooner. Another possible explanation is a bug in my implementation causing the connection to close via the slower timeout-based path most of the time.

What comes next?

There are tons of items on the current backlog, so I list only the things I plan to actually focus on in following two weeks:

Flow control, this will hopefully improve performance by avoiding packet discarding at receiver.
Using EventCounter API to expose events such as packet loss to aid further profiling,
Reduce memory allocation.
Maybe? performance comparisons with msquic based implementation, since it was open-sourced now.