JVM Advent

The JVM Programming Advent Calendar

Asynchronous Functional Web Server in Kotlin

Mandatory meme

Using Http4k and Loom

If you read my blog or listened to a talk of mine, you may know that my favorite library to serve HTTP requests in Kotlin is Http4k. It’s easy to understand and lets me map each request as a function that transforms a Request into a Response. Powerful and straightforward: I love it.

Its most complained about drawback is that it’s not supporting asynchronous handling of requests. In Kotlin, there is an excellent way to handle asynchronous calls, which is using coroutines.

Http4k doesn’t support it. It’s so often required that the authors are tired of hearing people asking about it. Why not add it?

Well, long story short, adding it would make everything more fragile and complicated. Since it’s an open-source project, the authors decided (rightly in my opinion) to leave Http4k simple and easy to use and drop the coroutine support.

But… but… surely not having coroutines would greatly impact performance? Well, not necessarily. In the vast majority of use cases, asynchronously handling web requests doesn’t bring any advantage.

This may require an explanation, since all top web server benchmarks show that handling calls asynchronously is faster (and they are not wrong). 

Concurrency and Multi-Threading

Let’s look at how the asynchronous model can improve performance, particularly for web servers. For example, my laptop’s CPU has eight cores, which is pretty typical in 2022, so it can do (at most) 8 things in parallel. 

A web server is an application that listens on a TCP socket, creates an HTTP Request, processes it somehow, and transmits the HTTP Response back to the sender.

So, no matter how many requests arrive at your web server, my laptop cannot process more than eight in parallel.

Now, let’s assume we have 32 requests coming. If we only care about the total processing time, or throughput, the fastest way is to process 8 requests first and then the other 8 and so on:

Each core is processing a request after the other, no wasted CPU

But this is a bit unfair because only 8 users get their request immediately served, and the rest must wait before starting the connection, and since the allocation is somehow random, some users risk waiting so much that the connection will time out.

A more polite web server would process all 32 requests simultaneously but taking the same overall time, so about 4x the time for a single request. For simplicity, let’s ignore the latency and bandwidth limitations of real networks.

How can a CPU with 8 cores handle 32 requests in parallel? It is possible if each core of the CPU switch context between four calls, dedicating a quarter of the time to each. So, we can assign each call to its own processing context that will keep all the variables, the stack trace, and everything else that the CPU needs to work on it.

Using threads 8 cores can process more than 8 requests at same time, but with no performance improvement

The total time is close to the previous case, but all clients will be accepted immediately, reducing the chance of connection timeouts. 

I wrote close and not equal because it will take a little more time. After all, switching the context takes some time (and some memory). This “processing context” is what we call a thread.

The CPU can handle threads very efficiently, but they still slow down the system (CPU cache misses) and the use more memory (thread stack memory and GC roots). 

Roughly speaking on a modern CPU, the overhead is negligible with tens of threads; it became noticeable around 100 threads, and it is problematic when the number of threads approaches 1000. 

A common strategy to limit the resources utilized is to use a ThreadPool to reuse threads once they have finished their task, instead of creating new ones every time.

So far, so good. You cannot do better than this if you need the CPU working full-time to process your requests. In other words, increasing the number of threads to more than the CPU cores won’t speed up the whole operation time (but it can improve fairness).

Don’t Waste CPU Time, The Problem of Waiting

What about those benchmarks, then? Well, probably, during the request processing, there is some time when the CPU is just sitting there waiting without doing any work.

Why should the CPU wait when we want to respond as fast as possible? Usually, it’s because it’s blocked waiting for data from a much slower IO channel, like a net socket or the file system. It can also be blocked by lock, required for synchronization or some other reason. In any case, when a thread is waiting, it doesn’t consume CPU, and having a bunch of other threads ready to be run can improve the global performance:

Requests with long waiting time waste CPU resources

If this is the case, we can see that increasing the number of threads keeps speeding up things beyond the number of physical CPU cores until we arrive at a situation where the CPU is fully utilized again. At that point, adding new threads will only slow down the total time.

Requests that put a thread in waiting can be handled more efficiently, increasing the number of threads

For example, if all requests spend 50% of the time waiting, we should double the number of CPU cores to find the optimal number of threads. If 90% of the time is spent waiting, we should multiply the CPU cores by 10.

Let’s say we are writing a typical RESTful backend, and each thread handles a request with these timings (invented but realistic):

  • 1 ms reading the request data
  • 9 ms calling an external service to retrieve the data
  • 180 ms waiting for the response from the external service
  • 9 ms analyzing the data and rendering it in JSON
  • 1 ms sending the response to the caller

The total time is 200 ms., and we spent roughly 20 ms using the CPU and 180 ms waiting for the network. So 90% of our time is spent waiting, and indicatively using 80 threads should maximize the outcome if the CPU has eight cores. If the external service instead had taken 1980 ms to reply, our ratio would have been 99% waiting and 1% processing, and we would need 800 threads to utilize the CPU fully, but probably less than that because we have to add the CPU load to handle all those threads.

Let’s verify it with some code. The full sources are on GitHub (https://github.com/uberto/Http4kLoom)

Using Http4k as the web server, I created a trivial application that returns a welcome message:

fun testApp(request: Request): Response =
    Response(OK).body("Hello, ${request.query("name")}!")

In Http4k, any function of type Request -> Response can be used as a web server with a single line of code:

val server = ::testApp.asServer(Jetty(port = 9000)).start() //normal jetty

Let’s first find out how many requests my laptop can handle without any sleep calls.

I used autocannon to stress my server. Note that calling the server from several other computers on a fast network would be better for obtaining a meaningful measurement, but since I just want to show the difference and not to measure absolute performance, my laptop would do fine.

From the measurements, my server handles at most 94.000 requests per second, regardless of the number of threads (see all the measurements at the end of the post). This means less than 0.1 millisecond for core to handle a request. Not bad Jetty!

This is our baseline, we can probably do better with specific optimizations, but it’s still a ridiculously high number, considering that the whole Twitter handles 3000 requests for second on average (7B visits for month).

To simulate a realistic web application, we can add 50ms of sleep. Which similar to having to do a few interactions on database. 

fun testApp(request: Request) =
    Response(OK).body("Hello, ${request.query("name")}!")
        .also {
            Thread.sleep(50)//simulating some async operation without cpu load
        }

For this test, I created a specialized JettyLoom class that uses the thread pool passed in the constructor. In this way, I can easily swap the thread pool implementation. For example, let’s use 500 threads:

val server = ::testApp.asServer(
    JettyLoom(port = 9000, threadPool = ExecutorThreadPool(500))
).start() //fixed threads

Note that Jetty uses the “eat what you kill” execution strategy, which it’s not only very fast but also allows us to play with different thread pools.

On my laptop, I couldn’t measure any improvement using more than 500 threads, and having more than 1000 will actually slow things down. 

We should take these numbers with more than a grain of salt: a high number of threads can impact your application in many other ways, such as memory consumption and longer GC pause. I cannot say this enough: base all your decisions on your own measurements and not from some blog on the internet (including my own).

To continue our analysis, since my laptop has 8 cores, about 99% of sleep is the max you can do with threads.

Let’s Cooperate aka The Asynchronous Model

Things start becoming interesting if our requests wait for 99.9% of the time or more. Unfortunately, we cannot continue to increase the number of threads over 500, so what can we do?

Well, one possibility we won’t explore in this post is to change your API to avoid such long waits. For example, make things event-based or use callback instead of long waits. So, if it’s possible to reduce the waiting time to 99% of total or less, this is the best solution.

Alas, for a series of reasons, this is not always possible, so for the rest of this post, we assume that we cannot avoid keeping the CPU idle most of the time, and we need a way to wait without blocking the whole thread.

Roughly speaking, the idea is to pause or park the waiting task and use the thread to work on something else instead of waiting. Once done, we check if the waiting task is ready to continue.

In this way, a single thread can handle many asynchronous tasks at the same time.

But here is the catch: asynchronous APIs are more complicated to use. After all, it’s like when a cook tries to cook several dishes at the same time: it can be done, but it’s challenging. It’s easy to forget something on the gas and ruin the dinner.

Java Futures, RXJava, and Kotlin coroutines, are all brilliant solutions to somehow simplify asynchronicity. They help, but they still add complexity to the straightforward synchronous model.

On the other hand, they shine on benchmarks: it’s relatively easy to tune them for simple test cases and obtaining amazing results. Real production where some requests take seconds and other milliseconds is a different scenario. But in this post we will continue pretending that benchmarks are meaningful because we are super optimistic!

Loom and Virtual Threads

The best approach would be to handle the threads overload and the asynchronous API at the JVM level, closer to the operative system, and let the high-level programming language be blissfully ignorant of what is happening at the low level. This would mean having your cake (performance) and eating it too (synchronous code).

Given the title of this post, you may have guessed that this is the goal of the JDK project Loom. It’s been available for the first time with Java19, released in October 2022.

Without further ado, let’s inject Loom’s virtual threads into our previous example and measure the difference in performance for long waiting calls.

val server = ::testApp.asServer(
    JettyLoom(port = 9000, threadPool = LoomThreadPool())
).start() //loom

And now the results:

normal jetty with defaults
average 80k req/s with no sleep and 100 connections (top 200% CPU)
average 94k req/s with no sleep and 10000 connections (top 200% CPU)
average 1970 req/s with 50ms sleep and 100 connections (top 25% CPU)
average 3600 req/s with 50ms sleep and 10000 connections (top 80% CPU)

threadpool with 500 threads
average 80k req/s with no sleep and 100 connections (top 200% CPU)
average 94k req/s with no sleep and 10000 connections (top 200% CPU)
average 1970 req/s with 50ms sleep and 100 connections (top 35% CPU)
average 9500 req/s with 50ms sleep and 10000 connections (top 130% CPU)

loom
average 77k req/s with no sleep and 100 connections (top 250% CPU)
average 94k req/s with no sleep and 10000 connections (top 250% CPU)
average 1970 req/s with 50ms sleep and 100 connections (top 40% CPU)
average 94k req/s with 50ms sleep and 10000 connections (top 400% CPU)

The value of CPU utilization is quite imprecise, but still indicative. I’ve used the Linux app top to get the data.

A screenshot from the best run:

And now the mandatory graph (higher is better). The green bar is the important one, and Loom is really doing well!

Finally, some comments on the code: 

class LoomThreadPool : ThreadPool {
    var executorService: ExecutorService = Executors.newVirtualThreadPerTaskExecutor()

    @Throws(InterruptedException::class)
    override fun join() {
        executorService.awaitTermination(Long.MAX_VALUE, TimeUnit.NANOSECONDS)
    }

    override fun getThreads(): Int = 1

    override fun getIdleThreads(): Int = 1

    override fun isLowOnThreads(): Boolean = false

    override fun execute(command: Runnable) {
        executorService.submit(command)
    }
}

This is how I’ve implemented a constructor of virtual threads disguised as a thread pool, using the new Loom API.

class JettyLoom(
    private val port: Int,
    override val stopMode: ServerConfig.StopMode,
    private val server: Server
): PolyServerConfig {
    constructor(port: Int, threadPool: ThreadPool) : this(
        port,
        ServerConfig.StopMode.Graceful(Duration.ofSeconds(5)),
        Server(threadPool).apply { addConnector(http(port)(this) )}
    )

This is the necessary boilerplate code to allow Http4k to use Jetty with an external Threadpool. The full code is on the GitHub repository.

Conclusions

The new virtual threads have radically changed the panorama of JVM concurrency model. In future versions of the JDK, they will introduce better constructs for structured concurrency, and the APIs will be finalized, so they can only get better from now.

It’s fascinating how virtual threads can provide many benefits in terms of performance and flexibility without needing changes to our code style or hard-to-tune custom optimizations. 

I hope this post can help other people to experiment with them!

If you liked this post, you may consider following me on Twitter (@ramtop)

PS. in case you want to know more about Http4k and how to build quickly back-ends in Kotlin using functional programming, I’ve written a book about it!

https://pragprog.com/titles/uboop/from-objects-to-functions/


Author: Uberto Barbini

Uberto is a very passionate and opinionated programmer, he helps big and small organizations in delivering value. Trainer, public speaker, and blogger. Currently working on a book on Kotlin and Functional Programming. Google Developer Expert.

Next Post

Previous Post

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

© 2024 JVM Advent | Powered by steinhauer.software Logosteinhauer.software

Theme by Anders Norén