## How (NOT TO) measure latency

Latency is defined as

time interval between the stimulation and response

and is a value which is of importance in many computer systems (financial systems, games, websites, etc). Hence we – as computer engineers – want to specify some upper bounds / worst case scenarios for the systems we build. How can we do this? The days of counting cycles for assembly instructions are long gone (unless you work on embedded systems) – there are just too many additional factors to consider (the operating system – mainly the task scheduler, other running processes, the JIT, the GC, etc). The remaining alternative is doing empirical (hands on) testing.

### Use percentiles

So we whip out JMeter, configure a load test, take the mean (average) value 土 3 x standard deviation and proudly declare that 99.73% of the users will experience latency which is in this interval. We are especially proud because (a) we considered a realistic set of calls (URLs if we are testing a website) and (b) we allowed for JIT warmup.

But we are still very wrong! (which can be sad if our company writes SLAs based on our numbers – we can bankrupt the company single-handedly!)

Lets see where the problem is and how we can fix it before we cause damage. Consider the dataset depicted below (you can get the actual values here to do your own calculations).

For simplicity there are exactly 100 values used in this example. Lets say that they represent the latency of fetching a particular URL. You can immediately tell that the values can be grouped in three distinct categories: very small (perhaps the data was already in the cache?), medium (this is what most users will see) and poor (probably there are some corner-cases). This is typical for medium-to-large complexity (ie. “real life”) composed of many moving parts and is called a multimodal distributions. More on this shortly.

If we quickly drop these values into LibreOffice Calc and do the number crunching, we’ll come to the conclusion that the average (mean) of the values is 40 and according to the six sigma rule 99.73% of the users should experience latencies less than 137. If you look at the chart carefully you’ll see that the average (marked with red) is slightly left of the middle. You can also do a simple calculation (because there are exactly 100 values represented) and see that the maximum value in the 99th percentile is 148 not 137. Now this might not seem like a big difference, but it can be the difference between profit and bankrupcy (if you’ve written a SLA based on this value for example).

Where did we go wrong? Let’s look again carefully at the three sigma rule (emphasis added):

nearly all values lie within three standard deviations of the mean in a normal distribution.

Our problem is that we don’t have a normal distribution. We probably have a multimodal distribution (as mentioned earlier), but to be safe we should use ways of interpreting the results which are independent of the nature of the distribution.

From this example we can derive a couple of recommendations:

1. Make sure that your test framework / load generator / benchmark isn’t the bottleneck – run it against a “null endpoint” (one which doesn’t do anything) and ensure that you can get an order of magnitude better numbers
2. Take into account things like JITing (warmup periods) and GC if you’re testing a JVM based system (or other systems which are based on the same principles – .NET, luajit, etc).
3. Use percentiles. Saying things like “the median (50th percentile) response time of our system is…”, “the 99.99th percentile latency is…”, “the maximum (100th percentile) latency is…” is ok
4. Don’t calculate the average (mean). Don’t use standard deviation. In fact if you see that value in a test report you can assume that the people who put together the report (a) don’t know what they’re talking about or (b) are intentionally trying to mislead you (I would bet on the first, but that’s just my optimism speaking).

### Look out for coordinated omission

Coordinate omission (a phrase coined by Gil Tene of Azul fame) is a problem which can occur if the test loop looks something like:

``` start: t = time() do_request() record_time(time() - t) wait_until_next_second() jump start ```

That is, we’re trying to do one request every second (perhaps every 100ms would be more realistic, but the point stands). Many test systems (including JMeter and YCSB) have inner loops like this.

We run the test and (learning from the previous discussion) report: the 85% of the request will be served under 0.5 seconds if there are 1 requests per second. And we still can be wrong! Let us look at the diagram below to see why:

On the first line we have our test run (horizontal axis being time). Lets say that between second 3 and 6 the system (and hence all requests to it) are blocked (maybe we have a long GC pause). If you calculate the 85th percentile, you’ll 0.5 (hence the claim in the previous paragraph). However, you can see 10 independent clients below, each doing the request in a different second (so we have our criteria of one request per second fulfilled). But if we crunch the numbers, we’ll see that the actual 85th percentile in this case is 1.5 (three times worse than the original calculation).

Where did we go wrong? The problem is that the test loop and the system under test worked together (“coordinated” – hence the name) to hide (omit) the additional requests which happen during the time the server is blocked. This leads to underestimating the delays (as shown in the example).

1. Make sure every request less than the sampling interval or use a better benchmarking tool (I don’t know of any which correct for this) or post-process the data with Gil’s HdrHistogram library which contains built-in facilities to account for coordinated omission

## Performance tuning – measure don’t guess

In my performance tuning career I have given the advice to measure and not guess more often than I can recall. And in many cases the target of this advice has given up after looking at the monolithic 500,000 LOC legacy application they are working.

In the post we are about to share some simple tools and concepts how to setup the initial measurement, so you can have the baseline to start with more fine-grained performance tuning.

# Setting the goal

Pretty much any non-trivial application can be optimized forever. Both in bad and in good way. Bad examples include tweaking random parts of the applications and then praying for the best. Whatever might be the reason – I have seen tens and tens of developers “being pretty sure that exactly this line of code needs attention” without basing their gut feeling to any measurements.

The second category is almost as dangerous, as you can spend hundreds of man-years tuning your application “to be ready for the prime time”. The definition for prime time might vary, but are you really sure you are going to accommodate terabytes of media, tens of millions of records in your database and have to serve hundreds of thousands of concurrent users with less than 100ms latency? Unless you are aiming to put Google out of business, you most likely will not. Worse yet, spending so much time preparing for the primetime is a darn good way to assure the prime time never arrives. Instead of spending this time for squeezing out the extra 1ms in latency nobody is likely to notice, maybe the same time should have spent in ironing out this layout bug annoying the end users on Safari instead?

So how to set a meaningful target? For this you need to understand the business you are in. Real-time strategy games and First Person Shooters tend to have different requirements than online shopping carts. As last time I checked, Java was not going strong in the gaming industry, lets expect you are dealing with a typical Java EE application with web based front-end.

In this case, you should start now segmenting your application into different categories, which, based on the research are similar to the following:

• 0.1 seconds gives the feeling of instantaneous response — that is, the outcome feels like it was caused by the user, not the computer.
• 1 second keeps the user’s flow of thought seamless. Users can sense a delay, and thus know the computer is generating the outcome, but they still feel in control of the overall experience and that they’re moving freely rather than waiting on the computer. This degree of responsiveness is needed for good navigation.
• 10 seconds keeps the user’s attention. From 1–10 seconds, users definitely feel at the mercy of the computer and wish it was faster, but they can handle it. After 10 seconds, they start thinking about other things, making it harder to get their brains back on track once the computer finally does respond.

So you might have functionality such as adding products to the shopping carts or browsing through the recommended items which you need to squeeze into the “instant” bucket to provide the best overall user experience resulting in better conversion rates. Initially I am pretty sure your application is nowhere near this criteria, so feel free to replace the 100ms and 1,000ms threshold with something you actually can achieve, such as 300 and 1,500ms for example.

On the other hand, you most likely have some operations which are expensive enough, such as the products search or user account registration which might fall into the second bucket.

And last, you have the functionality which you can toss into the last bucket, such as generating PDF from the invoice. Note that you also might end up with fourth category – for example if you are generating the bills of delivery for your warehouse, your business might be completely OK if batch processing the daily bills takes 20 minutes.

Pay attention that business persons have a tendency of tossing everything into the “instant” bucket. Now it is your task to explain the effort required and ask “if you had only three operations which are completing under our threshold, what should those be”.

Now you should have your application functionality categorized into different categories, such as the following:

 Requirement Category Browse products instant Add product to a shopping cart instant Search product seamless Register new account attention Generate PDF invoice attention Generate daily bills of delivery slow

# Understanding the current situation

Your next goal is to understand where you will perform your measurements. For end users, the experience is complete when the page has rendered the result of the operation in their browser, which takes into account the network latency and browser DOM operations for example.
As this will be more complex to measure, let us assume your site has been well optimized against the Google Page Speed or YSlow recommendations and for simplicity’s sake lets focus on elements directly under your control.

On most cases, the last piece of infrastructure still under your control will be your web server, such as the Apache HTTPD or nginx. If you have logging enabled, you will have access to something similar to the following in your nginx access.log:

`82.192.41.11 - - [04/Dec/2013:12:16:11 +0200]  "POST /register HTTP/1.1" 301 184 "https://myshop.com/home" "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:25.0) Gecko/20100101 Firefox/25.0" 0.42882.192.41.11 - - [04/Dec/2013:12:16:12 +0200]  "POST /searchProduct HTTP/1.1" 200 35 "https://myshop.com/accountCreated" "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:25.0) Gecko/20100101 Firefox/25.0" 3.008 82.192.41.11 - - [04/Dec/2013:12:16:12 +0200]  "GET /product HTTP/1.1" 302 0 "https://myshop.com/products/searchProducts" "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:25.0) Gecko/20100101 Firefox/25.0" 0.62382.192.41.11 - - [04/Dec/2013:12:16:13 +0200]  "GET /product HTTP/1.1" 200 35 "https://myshop.com/product/123221" "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:25.0) Gecko/20100101 Firefox/25.0" 0.82882.192.41.11 - - [04/Dec/2013:12:16:13 +0200]  "GET /product HTTP/1.1" 200 35 "https://myshop.com/product/128759" "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:25.0) Gecko/20100101 Firefox/25.0" 1.03882.192.41.11 - - [04/Dec/2013:12:16:13 +0200]  "GET /product HTTP/1.1" 200 35 "https://myshop.com/product/128773" "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:25.0) Gecko/20100101 Firefox/25.0" 0.62782.192.41.11 - - [04/Dec/2013:12:16:14 +0200]  "GET /addToShoppingCart HTTP/1.1" 200 35 "https://myshop.com/product/128773" "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:25.0) Gecko/20100101 Firefox/25.0" 2.80882.192.41.11 - - [04/Dec/2013:12:16:14 +0200]  "GET /purchase HTTP/1.1" 302 0 "https://myshop.com/addToShoppingCart" "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:25.0) Gecko/20100101 Firefox/25.0" 3.20482.192.41.11 - - [04/Dec/2013:12:16:16 +0200]  "GET /viewPDFInvoice HTTP/1.1" 200 11562 "https://myshop.com/purchase" "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:25.0) Gecko/20100101 Firefox/25.0" 3.018`

Lets investigate now what we have found in the log. The snippet we have extracted contains actions of one user completing a full transaction in our web site. The user has created an account, searched for products, browsed through the results, found a product he has liked, added it to a shopping cart, completed the purchase and generated a PDF of the invoice. Those actions can be detected in the “POST /searchProduct HTTP/1.1” column. Next important part is the last column containing the total request time it took for a particular request to complete. In the case of /searchProduct it took 3.008 seconds to complete the search.

Note that by default nginx does not have the request time logging enabled, so you might need to to modify your `log_format` by adding `\$request_time` to the pattern.

Now, with a little bit of grep/sed/excel magic, you will have something similar to the following at your fingertips:

 Requirement Category Mean 90% Browse products instant 0.734 0.902 Add product to a shopping cart instant 2.422 3.490 Search product seamless 2.800 3.211 Register new account attention 0.428 0.480 Generate PDF invoice attention 3.441 4.595 Generate daily bills of delivery slow – –

# Picking the target for optimization

From here you will get your list of optimization targets. The suspects are obvious – they fall in the categories where you violate the agreement, in the above case you have problems with three requirements – Browse products, Add product to a shopping cart and Search product all violate your performance requirements.

This article does not focus on actual optimisation tasks, but in order to give food for thought – what should you do next, knowing that adding products to the shopping cart takes way too much time?

Next steps are definitely more application specific than the prior tasks. Also, the tooling to be used can vary – you might now decide to go with APM or monitoring products. But if you do not have the budget or just do not wish to fight with procurement office, a simple solution would include adding logging trail in the component boundaries by using AOP means.

For example, you can add something like the following in your application code to monitor the time spent in the service layer.

``@Aspect@Componentpublic class TimingAspect {  @Around("execution(* com.myshop..*Service+.*(..))")  public Object time(ProceedingJoinPoint joinPoint) throws Throwable {    StopWatch.WatchStatus watch = StopWatch.getInstance().start(joinPoint.getSignature().getName());    final Object result = joinPoint.proceed();    watch.stop();    return result;  }}``

Decomposing the request time further gives you the required insight to understand where the bottlenecks are hiding. More often than not you will quickly discover a violating query, faulty cache configuration or poorly decomposed functionality enabling you to net your first quick wins in a matter of days or even hours.

## Using Intel Performance Counters To Tune Garbage Collection

<

div dir=”ltr” style=”text-align: left;”>

# Introduction

• Allocation: Based on hardware rules, memory addresses are matched to cache lines. If pieces of memory share a cache line but are actually accessed from different threads we get an effect called false sharing. However, if little bits of data are spread out but accessed from the same thread we get poor cache utilisation.
• Moving: Objects are not left in one place throughout their life times. The garbage collector avoids memory fragmentation by moving objects around. This has the interesting effect of guaranteeing that the cache lines associated with the object will no longer be associated with it after a move.
• Collecting: The funny thing is that collecting is the easy bit. It can be as simple as just marking the memory as available for reuse. It is the traversal of the object graphs (multiple roots) to find out what can be collected which is going to cause data cache line loads and thus evict lines from the cache which were being read from or written to by user code.

So we can now see that the design of the garbage collector is critical to the operation of the data cache. Swapping which collector we use will not only have a impact on GC pauses and other obvious issues, it will also effect, at a low level and in a fundamental way, all of user code.

# A Worked Example

pcm is just a command line tool. We pass the command line to run Java to it in quotes and it does its measurements. With other tooling, the performance counters can be used to get all sorts of other detail about an application. The benefit of the pcm command line tool is its simplicity and lack of intrusion into the over all application run. The disadvantage is that it will measure the JVM and application warm up phases. However, for server style applications or batch processors (like Sonic Field) these overheads are usually trivial compared to the actual application run. I ran my patch on my personal Macbook Pro Retina (2012) with 16Gig of RAM. The JVM was:

`java version "1.8.0-ea"Java(TM) SE Runtime Environment (build 1.8.0-ea-b61)Java HotSpot(TM) 64-Bit Server VM (build 25.0-b05, mixed mode)`

Readings from pcm are simply written to standard out when the application exits. I compared runs with no settings for the garbage collector (the default therefore) and with my currently preferred set of tweaks. To be honest, I am not sure if the tweaks are optimal; I kind of lifted them from a bunch of online articles… Here is the launch script for Java:

``/Users/alexanderturner/x/IntelPerformanceCounterMonitorV2.5.1 2/pcm.x "java -Xmx12G -Xms12G  -DsonicFieldTemp=/Users/alexanderturner/temp -DsonicFieldThreads=12 -DsonicFieldSwapLimit=4.0  -XX:+UseConcMarkSweepGC -XX:+UseCompressedOops -XX:ParallelGCThreads=8 -XX:+CMSParallelRemarkEnabled -XX:CMSInitiatingOccupancyFraction=60 -XX:+UseCMSInitiatingOccupancyOnly -classpath bin:spring-framework-3.1.2.RELEASE/dist/org.springframework.asm-3.1.2.RELEASE.jar:spring/spring-framework-3.1.2.RELEASE/dist/org.springframework.beans-3.1.2.RELEASE.jar:spring-framework-3.1.2.RELEASE/dist/org.springframework.core-3.1.2.RELEASE.jar:spring/spring/spring-framework-3.1.2.RELEASE/dist/org.springframework.context-3.1.2.RELEASE.jar:spring/spring-framework-3.1.2.RELEASE/dist/org.springframework.context-support-3.1.2.RELEASE.jar:spring/spring-framework-3.1.2.RELEASE/dist/org.springframework.expression-3.1.2.RELEASE.jar:spring/spring-framework-3.1.2.RELEASE/dist/org.springframework.test-3.1.2.RELEASE.jar:spring/otherJars/commons-logging-1.1.1.jar com.nerdscentral.sfpl.RenderRunner \$1"``

Hopefully it is clear just how simple running Java under Intel Performance Counter Monitor v2 really is. So, here is the output:

## Standard GC

Core (SKT) EXEC IPC FREQ AFREQ L3MISS L2MISS L3HIT L2HIT L3CLK L2CLK READ WRITE TEMP
0 0.53 0.75 0.70 1.31 422 M 621 M 0.32 0.32 0.14 0.01 N/A N/A 32
2 0.56 0.77 0.73 1.31 346 M 466 M 0.26 0.31 0.11 0.01 N/A N/A 28
1 0.22 0.69 0.32 1.31 144 M 192 M 0.25 0.28 0.11 0.01 N/A N/A 32
3 0.21 0.68 0.31 1.31 135 M 171 M 0.21 0.28 0.10 0.01 N/A N/A 28
4 0.55 0.77 0.71 1.31 332 M 410 M 0.19 0.38 0.11 0.01 N/A N/A 22
7 0.18 0.68 0.26 1.30 124 M 134 M 0.08 0.30 0.11 0.00 N/A N/A 27
5 0.19 0.68 0.29 1.31 133 M 155 M 0.14 0.30 0.11 0.00 N/A N/A 22
6 0.61 0.79 0.78 1.32 343 M 382 M 0.10 0.35 0.10 0.00 N/A N/A 27

—————————————————————————————————-

 SKT 0 0.38 0.75 0.51 1.31 1982 M 2533 M 0.22 0.33 0.11 0.01 N/A N/A 22

Instructions retired: 2366 G ; Active cycles: 3166 G ; Time (TSC): 773 Gticks ; C0 (active,non-halted) core residency: 39.04 % —————————————————————————————————

 TOTAL * 0.38 0.75 0.51 1.31 1982 M 2533 M 0.22 0.33 0.11 0.01 N/A N/A N/A

C: 1.49 => corresponds to 37.36 % utilization for cores in active state. Instructions per no C1 core residency: 23.92 %; C3 core residency: 0.01 %; C6 core residency: 0.00 %; C7 core residency: 37.02 % C2 package residency: 0.00 %; C3 package residency: 0.00 %; C6 package residency: 0.00 %; C7 package residency: 0.00 % PHYSICAL CORE I Pminal CPU cycle: 0.76 => corresponds to 19.12 % core utilization over time interval

## Concurrent Mark Sweep

Rather than
`-XX:+UseConcMarkSweepGC-XX:+UseCompressedOops-XX:+CMSParallelRemark-XX:ParallelGCThreads=Enabled-XX:CMSInitiatingOccupancyFraction=60-XX:+UseCMSInitiatingOccupancyOnly`
Core (SKT) EXEC IPC FREQ AFREQ L3MISS L2MISS L3HIT L2HIT L3CLK L2CLK READ WRITE TEMP
0 0.53 0.69 0.76 1.31 511 M 781 M 0.35 0.35 0.17 0.02 N/A N/A 26
2 0.54 0.71 0.75 1.31 418 M 586 M 0.29 0.40 0.14 0.01 N/A N/A 29
1 0.31 0.66 0.47 1.30 207 M 285 M 0.27 0.26 0.11 0.01 N/A N/A 26
3 0.21 0.68 0.31 1.31 135 M 171 M 0.21 0.28 0.10 0.01 N/A N/A 28
4 0.55 0.77 0.71 1.31 332 M 410 M 0.19 0.38 0.11 0.01 N/A N/A 22
7 0.18 0.68 0.26 1.30 124 M 134 M 0.08 0.30 0.11 0.00 N/A N/A 27
5 0.19 0.68 0.29 1.31 133 M 155 M 0.14 0.30 0.11 0.00 N/A N/A 22
6 0.61 0.79 0.78 1.32 343 M 382 M 0.10 0.35 0.10 0.00 N/A N/A 27
3 0.30 0.66 0.46 1.30 198 M 258 M 0.23 0.27 0.11 0.01 N/A N/A 29
4 0.59 0.73 0.81 1.31 397 M 504 M 0.21 0.46 0.12 0.01 N/A N/A 29
7 0.30 0.66 0.45 1.30 186 M 204 M 0.09 0.29 0.11 0.00 N/A N/A 30
7 0.30 0.66 0.45 1.30 186 M 204 M 0.09 0.29 0.11 0.00 N/A N/A 30
5 0.30 0.66 0.45 1.30 188 M 225 M 0.16 0.28 0.11 0.01 N/A N/A 29
6 0.58 0.73 0.79 1.31 414 M 466 M 0.11 0.49 0.13 0.00 N/A N/A 30

—————————————————————————————————-

SKT 0 0.43 0.70 0.62 1.31 2523 M 3313 M 0.24 0.38 0.13 0.01 N/A N/A 25
Instructions retired: 2438 G ; Active cycles: 3501 G ; Time (TSC): 708 Gticks ; C0 (active,non-halted) core residency: 47.22 %

—————————————————————————————————- TOTAL * 0.43 0.70 0.62 1.31 2523 M 3313 M 0.24 0.38 0.13 0.01 N/A N/A N/A

C: 1.39 => corresponds to 34.83 % utilization for cores in active state. Instructions per no C1 core residency: 17.84 %; C3 core residency: 0.01 %; C6 core residency: 0.01 %; C7 core residency: 34.92 % C2 package residency: 0.00 %; C3 package residency: 0.00 %; C6 package residency: 0.00 %; C7 package residency: 0.00 %

PHYSICAL CORE I
Pminal CPU cycle: 0.86 => corresponds to 21.51 % core utilization over time interval. All the information given here is of interest, however, there is so much of it I figure the best thing to do is cut the the case and test my assertion about the CMS collector. To do that we can look at just two lines form the output for each run:

Default:
SKT 0 0.38 0.75 0.51 1.31 1982 M 2533 M 0.22 0.33 0.11 0.01 N/A N/A 22
Instructions retired: 2366 G ; Active cycles: 3166 G ; Time (TSC): 773 Gticks ; C0 (active,non-halted) core residency: 39.04 %
CMS:
0 0.43 0.70 0.62 1.31 2523 M 3313 M 0.24 0.38 0.13 0.01 N/A N/A 25
SKT:
Instructions retired: 2438 G ; Active cycles: 3501 G ; Time (TSC): 708 Gticks ; C0 (active,non-halted) core residency: 47.22 %

## Discussion

We can see that under the CMS collector there were substantially more cache misses. The L2 misses were 30% greater and L2 were up 27% over the default collector. Nevertheless, the total time taken in giga ticks (708CMS/773Default) shows that all this extra data missing has not negatively impacted over all performance at all. I guess this means that a lot more research could and should be done before drawing any conclusions as to the correct approach for this application!
If you leave this post thinking that I did not fully discuss the subject, you are correct. My intention here has been to get the reader interested in thinking about this aspect of Java performance and opening the door to a new approach.