## How (NOT TO) measure latency

Latency is defined as

time interval between the stimulation and response

and is a value which is of importance in many computer systems (financial systems, games, websites, etc). Hence we – as computer engineers – want to specify some upper bounds / worst case scenarios for the systems we build. How can we do this? The days of counting cycles for assembly instructions are long gone (unless you work on embedded systems) – there are just too many additional factors to consider (the operating system – mainly the task scheduler, other running processes, the JIT, the GC, etc). The remaining alternative is doing empirical (hands on) testing.

### Use percentiles

So we whip out JMeter, configure a load test, take the mean (average) value 土 3 x standard deviation and proudly declare that 99.73% of the users will experience latency which is in this interval. We are especially proud because (a) we considered a realistic set of calls (URLs if we are testing a website) and (b) we allowed for JIT warmup.

But we are still very wrong! (which can be sad if our company writes SLAs based on our numbers – we can bankrupt the company single-handedly!)

Lets see where the problem is and how we can fix it before we cause damage. Consider the dataset depicted below (you can get the actual values here to do your own calculations).

For simplicity there are exactly 100 values used in this example. Lets say that they represent the latency of fetching a particular URL. You can immediately tell that the values can be grouped in three distinct categories: very small (perhaps the data was already in the cache?), medium (this is what most users will see) and poor (probably there are some corner-cases). This is typical for medium-to-large complexity (ie. “real life”) composed of many moving parts and is called a multimodal distributions. More on this shortly.

If we quickly drop these values into LibreOffice Calc and do the number crunching, we’ll come to the conclusion that the average (mean) of the values is 40 and according to the six sigma rule 99.73% of the users should experience latencies less than 137. If you look at the chart carefully you’ll see that the average (marked with red) is slightly left of the middle. You can also do a simple calculation (because there are exactly 100 values represented) and see that the maximum value in the 99th percentile is 148 not 137. Now this might not seem like a big difference, but it can be the difference between profit and bankrupcy (if you’ve written a SLA based on this value for example).

Where did we go wrong? Let’s look again carefully at the three sigma rule (emphasis added):

nearly all values lie within three standard deviations of the mean in a normal distribution.

Our problem is that we don’t have a normal distribution. We probably have a multimodal distribution (as mentioned earlier), but to be safe we should use ways of interpreting the results which are independent of the nature of the distribution.

From this example we can derive a couple of recommendations:

1. Make sure that your test framework / load generator / benchmark isn’t the bottleneck – run it against a “null endpoint” (one which doesn’t do anything) and ensure that you can get an order of magnitude better numbers
2. Take into account things like JITing (warmup periods) and GC if you’re testing a JVM based system (or other systems which are based on the same principles – .NET, luajit, etc).
3. Use percentiles. Saying things like “the median (50th percentile) response time of our system is…”, “the 99.99th percentile latency is…”, “the maximum (100th percentile) latency is…” is ok
4. Don’t calculate the average (mean). Don’t use standard deviation. In fact if you see that value in a test report you can assume that the people who put together the report (a) don’t know what they’re talking about or (b) are intentionally trying to mislead you (I would bet on the first, but that’s just my optimism speaking).

### Look out for coordinated omission

Coordinate omission (a phrase coined by Gil Tene of Azul fame) is a problem which can occur if the test loop looks something like:

``` start: t = time() do_request() record_time(time() - t) wait_until_next_second() jump start ```

That is, we’re trying to do one request every second (perhaps every 100ms would be more realistic, but the point stands). Many test systems (including JMeter and YCSB) have inner loops like this.

We run the test and (learning from the previous discussion) report: the 85% of the request will be served under 0.5 seconds if there are 1 requests per second. And we still can be wrong! Let us look at the diagram below to see why:

On the first line we have our test run (horizontal axis being time). Lets say that between second 3 and 6 the system (and hence all requests to it) are blocked (maybe we have a long GC pause). If you calculate the 85th percentile, you’ll 0.5 (hence the claim in the previous paragraph). However, you can see 10 independent clients below, each doing the request in a different second (so we have our criteria of one request per second fulfilled). But if we crunch the numbers, we’ll see that the actual 85th percentile in this case is 1.5 (three times worse than the original calculation).

Where did we go wrong? The problem is that the test loop and the system under test worked together (“coordinated” – hence the name) to hide (omit) the additional requests which happen during the time the server is blocked. This leads to underestimating the delays (as shown in the example).

1. Make sure every request less than the sampling interval or use a better benchmarking tool (I don’t know of any which correct for this) or post-process the data with Gil’s HdrHistogram library which contains built-in facilities to account for coordinated omission

## 2013 – The Java/JVM Year in Review and a Look to the Future!

Hi all, It’s been an amazing year for Java/JVM developers!

For me its been a year of fine tuning the Adopt OpenJDK and Adopt a JSR programmes with the LJC and working hard at jClarity – The Java/JVM performance analysis company that I’ve now been demoted to CEO of ;-).

In terms of the overall ecosystem there’s been some great steps forward:

1. The language and platform continues to evolve (e.g. Java 8 is looking in great shape!).
2. The standards body continues to become more inclusive and open (e.g. More Java User Groups have joined the Executive Committee of the JCP).
3. The community continues to thrive (record numbers of conferences, Java User Groups and GitHub commits).

As is traditional, I’ll try and take you through some of the trends of the past year that was and a small peek into the future as well.

# 2013 Trends

There were some major trends in the Java/JVM and software development arena in 2013. Mobile Apps have become the norm, virtualised infrastructure is everywhere and of course Java is still growing faster (in absolute terms) than any other language out there except for Javascript. Here are some other trends worth noting.

### Functional Programming

Mainstream Java developers have adopted Scala, Groovy, Clojure and other JVM languages to write safer, more concise code for call back handlers, filters, reduces and a host of other operations on collections and event handlers. A massive wave of developers will of course embrace Java 8 and Lambdas when it comes out in early 2014.

### Java moving beyond traditional web apps

In 2013 Java firmly entrenched itself into areas of software development outside of the traditional 3-tier web application. It is used realm of NoSQL (e.g. Hadoop), High performance Messaging systems (e.g. LMAX Disruptor), Polyglot Actor-Like Application Frameworks (e.g. Vert.x) and much much more.

### HTML 5

2013 has started to see the slow decline of Java/JVM serverside templating systems as the rise of Javascript libraries and micro frameworks (e.g. AngularJS, Twitter Bootstrap) means that developers can finally have a client side tempalting framework that doesn’t suck and has real data binding to JSON/XML messages going to and from the browser. Websockets and asynchronous messaging are also nicely taken care of by the likes of SockJS.

# 2014+ – The Future

### Java 8 – Lambdas

More than enough has been written about these, so I’ll offer up a practical tutorial and the pragmatic Java 8 Lambdas Book for 2014 that you should be buying!

### Java 9/10 – VM improvements

Java 9 and 10 are unlikely to bring many changes to the language itself, apart from applying Java 7 and 8 features to the internal APIs, making the overall Java API a much more pleasant experience to use.

So what is coming? Well the first likely major feature will be some sort of packed object or value type representation. This is where you will be able to defined a Java ‘object’ as being purely a data type. What do I mean by this? Well it means that the ‘object’ will be able to be laid out in memory very efficiently and not have the overhead of `hashCode`, `equals` or other regalia that comes with being a Java `Object`. Think of it a little bit like a C/C++ `struct`. Here’s an example of what it might look like in source code:

``` @Packedfinal class PackedPoint extends PackedObject { int x, y;} ```

And the native representation:

``` // Native representationstruct Point { int x, y;} ```

There are loads of performance advantages to this, especially for immutable, well defined data structures such as points on a graph or within a 3D model. The JVM will be able to use far more efficient housekeeping mechanisms with respects to Memory Layout, Lookups, JIT and Garbage Collection.

Some sort of Reified generics will also come into the picture, but it’s more likely to be an extension of the work done of the ‘packed object’ idea and removing the need to unnecessary overhead in maintaining and the manipulation of collections that use the Object representation of primitives (e.g. `List`).

### Java 9/10 – ME/SE convergence

It was very clear at JavaOne this year that Oracle are going to try and make Java a player in the IoT space. The opportunuites are endless, but the JVM has some way to go to find the right balance bewtwen being able to work on really small decvices (which is where Java ME and CDLC sit today) and providing a rich language features and a full stack for Java developers (which is where Java SE and EE sit today).

The merger between ME and SE begins in earnest for Java 8 and will hopefully be completed by Java 9. I encourage all of oyu to try out early versions of this on your Raspberry Pi!

### But wait there’s more!

The way Java itself is being build and maintained is changing itself with a move towards more open development at OpenJDK. Features such as Tail Call Recursion and Co-routines are being discussed and there is a port on the way to have Java natively work on graphics cards.

There’s plenty to challenge yourself with and also have a lot of fun in 2014! Me? I’m going to try and get Java running on my Pi and maybe build that AngularJS front end for controlling hardware in my house…..

Happy Holidays everyone!

Martijn (@karianna) Java Champion, Speaker, Author, Cat Herder etc

## (Part 3 of 3): Synopsis of articles & videos on Performance tuning, JVM, GC in Java, Mechanical Sympathy, et al

This is a continuation of the previous post titled (Part 2 of 3): Synopsis of articles & videos on Performance tuning, JVM, GC in Java, Mechanical Sympathy, et al.

In our first review, The Atlassian guide to GC tuning is an extensive post covering the methodology and things to keep in mind when tuning GC, practical examples are given and references to important resources are also made in the process. The next one How NOT to measure latency by Gil Tene, he discusses some common pitfalls encountered in measuring and characterizing latency, demonstrating and discussing some false assumptions and measurement techniques that lead to dramatically incorrect reporting results, and covers simple ways to sanity check and correct these situations.  Finally Kirk Pepperdine in his post Poorly chosen Java HotSpot Garbage Collection Flags and how to fix them! throws light on some JVM flags – he starts with some 700 flags and boils it down to merely 7 flags. Also cautions you to not just draw conclusions or to take action in a whim but consult and examine – i.e. measure don’t guess!

Background

By default JVM tunings attempt to provide acceptable performance in majority of the cases, but it may not always give the desired results depending on application behaviour. Benchmark your application before applying any GC or JVM related settings. There are a number of environment, OS, hardware related factors to taken into consideration and many a times any tuning to JVM may be linked to these factors and if they chance the tuning may need to be revisited. Beware and accept the fact that any tuning has its upper limit in a given environment, if your goals are still not met, then you need to improve your environment. Always monitor your application to see if your tuning goals are still in scope.

Choosing Performance Goals – GC for the Oracle JVM can be reduced by targeting three goals ie. latency (mean & maximum), throughput, and footprint. To reach the tuning goals, there are principles that guide you to tuning the GC i.e. Minor GC reclamation, GC maximise memory, two of three (pick 2 of 3 of these goals, and sacrifice the other).

Preparing your environment for GC –

Some of the items of interest are load the application with work (apply load to the application as it would be in production), turn on GC logging, set data sampling windows, determine memory footprint, rules of thumb for generation sizes, determine systemic requirements.

Iteratively change the JVM parameters keeping the environment its setting in tact. An example of turning on GC logging is:

`java -XX:+PrintGC -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -Xloggc:<file> …`
or
`java -XX:+PrintGC -XX:+PrintGCDetails -XX:+PrintGCDateStamps -Xloggc:<file> …`

Sampling windows are important to help specifically diagnose an application’s runtime rather than the load-time or time it or the JVM takes to stabilise before it can execute anything. Set the heap sizes either by guessing (give it maximum and then pull back) or by using results accumulated in the GC logs. Also check for OOME which can be an indicator if heap size or standard memory needs increasing. There are a number of rules of thumb for generation sizes (a number of them depend on sizes of other pools).

Understanding the Throughput collector and Behaviour based Tuning
Hotspot uses behaviour based tuning with parallel collectors including the MarkAndSweep collector, they are designed to keep three goals in mind i.e. maximum pause time, throughput and footprint. The-XX:+PrintAdaptiveSizePolicy JVM flag prints details of this behaviour for analysis. Even though there are options to set the maximum pause time for any GC event there is no guarantee it will be performed. It also uses the best of 2 (from 3) goals – i.e. only 2 goals are focussed at a time, and there is interaction between goals, and sometimes there is oscillation between them.

Time to tune
There is a cycle to follow when doing this, i.e.
1. Determine desired behaviour.
2. Measure behaviour before change.
3. Determine indicated change.
4. Make change.
5. Measure behaviour after change.
6. If behaviour not met, try again.
Repeat the above till the results are met (or nearly there).

Tools
Java VisualVM with Visual GC plugin for live telemetry (note there’s an observer effect when using VisualVM, which can have an impact on your readings). GCViewer by Tagtruam industries, useful for post-hoc review of GC performance based on the GC log.

Conclusions: a lot of factors need to be taken into consideration when tuning GC. A meticulous plan needs to be considered, keeping an eye on your goal throughout the process and once achieved, you continue to monitor your application to see if your performance goals are still intact. Adjust your requirements after studying the results of your actions such that they both meet at some point.

— The authors cover the methodology in good detail and give practical advise and steps to follow in order to tune your application – the hands-on and practical aspects of the blog should definitely be read. —

The author starts with saying  “some of the mistakes others including the author has made” – plenty of wrong ways to do things and what to avoid doing.

WHY and HOW – you measure latency, HOW only makes sense if you understand the WHY. There is public domain code / tools available that can be used to measure latency.

Don’t use statistics to measure latency!

Classic way of looking at latency – response time (latency) is a function of load! Its important to know what this response time – is it average, worst case, 99 percentile, etc… based on different types of behaviours?

Hiccups – some sort of accumulated work that gets performed in bursts (spikes). Important to look at these behaviours when looking at response time and average response time. They are not evenly spread around, but may look like periodic freezes, shifts from one mode/behaviour to another (no sequence).

Common fallacies:
– computers run application code continously
– response time can be measure as a work units over time
– response time exhibits a normal distribution
– “glitches” or “semi-random omissions” in measurement don’t have a big effect

Hiccups are hidden knowingly or unknowingly – i.e. averages and standard deviations can help!
Many compute the 99 percentile or another figure from averages and distributions. A better way to deal with it is to measure it using data – for e.g. plot a histogram. You need to have a SLA for the entire spread and plot the distribution across 0 to 100 percentile.

Latency tells us how long something should take! True real-time is being slow and steady!
Author gives examples of types of applications with different latency needs i.e. The Olympics, Pacemaker, Low latency trading, Interactive applications.

A very important way to establishing requirements – an extensive interview process, to help the client come up with a realistic figures for the SLA requirements. We need to find out what is the worst the client is willing to accept and for how long or how often is that acceptable, to help construct a percentile table.

Question is how fast can the system run, smoothly without hitting an obstacles or crashes! The author further makes a comparison of performance between the HotSpot JVM and C4.

Coordinated omission problem – what is it? data that it is not emitted in a random way but in a coordinated way that skews the actual data. Delays in request times does not get measured due to the manner in which response time is recorded from within the system – which omits random uncoordinated data. The maximum value (time) in the data usually is the key value to start with to compute and compare the other percentiles and check if it matches with that computed with coordinated emission in it. Percentiles are useful, compute them.

Graphs with sudden bumps and more smooth lines tend to have coordinated emission in it.

HdrHistogram helps plot histograms – a configurable precision system, telling it the range and precision to cover. It has built in compensation for Coordinated Omission – its OpenSource and available on Github under the CCO license.

jHiccup also plots histogram to show how your computer system has been behaving (run-time and freeze time), also records various percentiles, maximum and minimum values, distribution, good modes, bad modes, etc…. Also opensource under COO license.

Measuring throughput without a latency requirement, the information is meaningless.  Mistakes in measurement/analysis can lead to orders-of-magnitude errors and lead to bad business decisions.

— The author has shown some cool graphs and illustrations of how we measure things wrong and what to look out for when assessing data or even graphs. Gives a lot of tips, on how NOT to measure latency. —

Poorly chosen Java HotSpot Garbage Collection Flags and how to fix them! by Kirk Pepperdine

There are more than 700 product flags defined in the HotSpot JVM and not many have a good idea of what effect it might have when used in runtime, let be the combination of two or more together.

Whenever you think you know a flag, it often pays if you check out the source code underlying that flag to get a more accurate idea.

Identifying redundant flags
The author enlists a huge list of flags for us to determine if we need them or not. It see a list of flags available with your version of java, try this command, to get a long list of options:

\$ java -XX:+PrintFlagsFinal

You can narrow down the list by eliminating deprecated flags and ones with a default value, leaving us with a much shorter list.

DisableExplicitGC, ExplicitGCInvokesConcurrent, and RMI Properties
Calling System.gc() isn’t a great idea and one way to disable that functionality from code that uses it, is to use the DisableExplicitGC flag as one of the arguments of your VM options flags. Or instead use the ExplicitGCInvokesConcurrent flag to invoke the Concurrent Mark and Sweep cycle instead of the Full GC cycle (invoked by System.gc()). You can also set the interval period between complete GC cycles using the sun.rmi.dgc.client.gcInterval and sun.rmi.dgc.server.gcInterval properties.

Xmx, Xms, NewRatio, SurvivorRatio and TargetSurvivorRatio
The -Xmx flag is used to set the maximum heap size, and all JVMs are meant to support this flag. By adjusting the minimum and maximum heap sizes such that the JVM can’t resize the heap size at runtime disables the properties of the adaptive sizing properties.

Use GC logs to help make memory pool size decisions. There is a ratio to maintain between Young Gen and Old Gen and within Young Gen between Eden and the Survivor spaces. The SurvivorRatio flag is one such flag that helps decide how much space will be used by the survivor spaces in Young gen and the rest will be left for Eden. These ratios are key towards determining the frequency and efficiency of the collections in Young Gen. TargetSurvivorRatio is another ratio to examine and cross-check before using.

PermSize, MaxPermSize

These options are removed starting JDK 8 in favour of Meta space, but in previous versions it helped in setting PermGen’s resizing abilities. Restricting its adaptive nature might lead to longer pauses of old generational spaces.

UseConcMarkSweepGC, CMSParallelRemarkEnabled, UseCMSInitiatingOccupancyOnly CMSInitiatingOccupancyFraction

UseConcMarkSweepGC makes the JVM select the CMS collector, CMS works with ParNew   instead of the PSYoung and PSOld collectors for the respective generational spaces.

CMSInitiatingOccupancyFraction (IOF) is flag that helps the JVM determine how to maintain the occupancy of data (by examining the current rate of promotion) in the Old Gen space, sometimes leading to STW, FullGCs or CMF (concurrent mode failure).

UseCMSInitiatingOccupancyOnly tells the JVM to use the IOF without taking the rate of promotion into account.

CMSParallelRemarkEnabled helps parallelise the fourth phase of the CMS (single threaded by default), but use this option only after benchmarking your current production systems.

CMSClassUnloadingEnabled ensures that classes loaded in the PermGen space is regularly unloaded reducing situations of out-of-PermGen-space. Although this could lead to longer concurrent and remark (pause) collection times.

AggressiveOpts

AggressiveOpts as the name suggests helps improve performance by switching on and off certain flags but this strategy hasn’t change across various builds of Java and one must examine benchmarks of their system before using this flag in production.

Next Steps
Of the number of flags examined the below are more interesting flags to look at:
`-ea -mx4G -XX:+MaxPermSize=300M -XX:+UseConcMarkSweepGC -XX:+CMSClassUnloadingEnabled -XX:SurvivorRatio=4 -XX:+DisableExplicitGC`
`Again examine the GC logs to see if any of the above should be used or not, if used, what values are appropriate to set. Quite often the JVM does get it right, right out of the box, hence getting it wrong can be detrimental to your application performance. Its easy to mess up then to get it right, when you have so many flags with dependent and overlapping functionalities.`

`Conclusion:  Refer to benchmarks and GC logs for making decisions on enabling and disabling of flags and the values to set when enabled. Lots of flags, easy to get them wrong, only use them if they are absolutely needed, remember the Hotspot JVM has built in machinery to adapt (adaptive policy), consult an expert if you still want to.`

— The author has done justice by giving us a gist of what the large number of JVM flags and recommend reading the article to learn about specifics of certain flags.. —

As it is not practical to review all such videos and articles, a number of them have been provided in the links below for further study. In many cases I have paraphrased or directly quoted what the authors have to say to preserve the message and meaning they wished to convey.

A few other authors have written articles related to these subjects, for the Java Advent Calendar, see below,

## Reactive Cassandra …

Overview
Let’s first try to define what reactive means from programming point of view.

Functional reactive programming is programming paradigm for reactive programming using the building blocks of functional programming.

Functional programming is a programming paradigm, a style of building the structure and the elements of computer programs, that treats computation, as the evaluation of mathematical functions thats avoids state and mutable data. Functional programming emphasises functions that produce results that depend only on  their inputs and not on program state.

How can we do functional programming in Java? Java is Object Oriented Programming language where mutable state is present everywhere.

Any java developer around the world have used any of the interfaces: java.lang.Runnable, java.util.Comparator, java.util.concurrent.Callable or java.awt.event.ActionListener. All of this interfaces are having only single method declared. These interfaces are known as Single Abstract Methods or SAM. A popular way as these are used is by creating Anonymous inner classes.

`public class RunnableTest {  public static void main(Sting[] args){    new Thread(new Runnable(){      @Override      public void run(){        System.out.println("A new thread is running ...");      }    }).start();  }}`

Functional Programming in Java is hard since function is not included in the language specification. It will become simpler in Java 8 with introduction of “lambda’s”. But how can we do functional programming in Java?
Let’s see a simple example.

``@FunctionalInterfacepublic interface Worker {   public void doWork();}public class FunctionalWorker {  public static void main(String[] args){    // anonymous inner class way    execute( new Worker(){      @Override      public void doWork() {        System.out.println ("working ...");      }    });    // lambda's way    execute(() -> System.out.println("working in lambda's way ..."));  }    public static void execute(Worker worker){    worker.doWork();  }}``

Reactive programming is a programming paradigm oriented around data flows and the propagation of changes. For example, in imperative programming setting, a := b+c, would mean that a is being assigned the result of  b +c in the instant the expression is evaluated. Later values of b or c can be changed without effect on a. In reactive programming, the value of a will be automatically updated based on the new values.

So, we should have a good understanding of what Functional Reactive Programming is, so let’s go and build a prototype…

Cassandra is one of the NoSql storage out there is quite popular.
Let’s imagine that we have to build an Avatar service. This service will store avatars meta information and the content directly in cassandra.

The java driver that we are using is providing us support to query cassandra asynchronous, through the executeAsync() method. The invocation of this method will return a Future. As we all know java Futures are block-able and could not be composed.

Ok, so we have async support but we still need a way to be able to read it reactively…

Netflix built and later open sourced the RxJava library that is providing Functional Reactive Programming for Java (plus other JVM languages).

Functional reactive offers efficient execution and composition by providing a collection of operators capable of filtering, selecting, transforming, combining and composing Observable’s.
The Observable data type can be thought of as a “push” equivalent to Iterable which is “pull”. With an Iterable, the consumer pulls values from the producer and the thread blocks until those values arrive. By contrast with the Observable type, the producer pushes values to the consumer whenever values are available. This approach is more flexible, because values can arrive synchronously or asynchronously.
The Observable type adds two missing semantics to the Gang of Four’s Observer pattern, which are available in the Iterable type:
1. The ability for the producer to signal to the consumer that there is no more data available.
2. The ability for the producer to signal to the consumer that an error has occurred.

With these two simple additions, we have unified the Iterable and Observable types. The only difference between them is the direction in which the data flows. This is very important because now any operation we perform on an Iterable, can also be performed on an Observable.

Let’s see how we can combine the RxJava and Cassandra async query execution to build an Observable.

``` package net.devsprint.reactive.cassandra; import java.util.concurrent.ExecutorService; import rx.Observable;import rx.Observer;import rx.Subscription;import rx.subscriptions.Subscriptions; import com.datastax.driver.core.ResultSet;import com.datastax.driver.core.Session;import com.google.common.util.concurrent.FutureCallback;import com.google.common.util.concurrent.Futures; /** * Wraps an async execution of Datastax Java driver into an observable. * */public class ObservableCassandra { public static Observable executeAsync(final Session execution, final String query, final ExecutorService executorService) { return Observable.create(new Observable.OnSubscribeFunc() { @Override public Subscription onSubscribe(final Observer observer) { try { Futures.addCallback(execution.executeAsync(query), new FutureCallback() { @Override public void onSuccess(ResultSet result) { observer.onNext(result); observer.onCompleted(); } @Override public void onFailure(Throwable t) { observer.onError(t); } }, executorService); } catch (Throwable e) { // If any Throwable can be thrown from // executeAsync observer.onError(e); } return Subscriptions.empty(); } }); } } ```

executeAsync() method is returning a Guava Listenable Future. Adding a callback on this future allows us to properly implement the Observer interface.

A simple service could be implemented as following:

``` package net.devsprint.reactive.cassandra; import java.util.concurrent.ExecutorService;import java.util.concurrent.Executors; import rx.Observable; import com.datastax.driver.core.ResultSet;import com.datastax.driver.core.Session; public class AvatarService { private static final String QUERY = "select * avatars"; private static final ExecutorService executorService = Executors .newFixedThreadPool(Runtime.getRuntime().availableProcessors()); private final Session session; public AvatarService(Session session) { this.session = session; } Observable getAvatars() { return ObservableCassandra .executeAsync(session, QUERY, executorService); } } ```

Assuming that the query is heavy, we are providing a separate execution context where the callback will be executed.

With these two classes we have an Avatar service that could be started but it will not do any thing. It will start fetching the data from Cassandra only when there will be at least one subscriber.

A complete example could be found at Reactive Cassandra.

## Big Data the ‘reactive’ way

A metatrend going on in the IT industry is a shift from query-based, batch oriented systems to (soft) realtime updated systems. While this is associated with financial trading only, there are many other examples such as “Just-In-Time”-logistic systems, flight companies doing realtime pricing of passenger seats based on demand and load, C2C auction system like EBay, real time traffic control and many more.

It is likely this trend will continue, as the (commercial) value of information is time dependent, value decreases with age of information.

Automated trading in the finance sector is just a forerunner in this area, because some microseconds time advantage can be worth millions of dollars. Its natural real time processing systems evolve in this domain faster.

However big parts of traditional IT infrastructure is not designed for reactive, event based systems. From query based databases to request-response based Http protcol, the common paradigm is to store and query data “when needed”.

#### Current Databases are static and query-oriented

Current approaches to data management such as SQL and NOSQL databases focus on data transactions and static query of data. Databases provide convenience in slicing and dicing data but they do not support update of complex queries in real time. Uprising NOSQL databases still focus on computing a static result.
Databases are clearly not “reactive”.

#### Current Messaging Products provide poor query/filtering options

Current messaging products are weak at filtering. Messages are separated into different streams (or topics), so clients can do a raw preselection on the data received. However this frequently means a client application receives like 10 times more data than needed, doing fine grained filtering ‘on-top’.
A big disadvantage is, that the topic approach builts filter capabilities “into” the system’s data design.
E.g. if a stock exchange system splits streams on a per-stock base, a client application still needs to subscribe to all streams in order to provide a dynamically updated list of “most active” stocks. Querying usually means “replay+search the complete message history”.

### A scalable, “continuous query” distributed Datagrid.

I had the enjoyment to do conceptional & technical design for a large scale realtime system, so I’d like to share a generic scalable solution for continuous query processing at high volume and large scale.

It is common, that real-time processing systems are designed “event sourced”. This means, persistence is replaced by journaling transactions. System state is kept in memory, the transaction journal is required for historic analysis and crash recovery only.
Client applications do not query, but listen to event streams instead. A common issue with event sourced systems is the problem of “late joining client”. A late client would have to replay the whole system event journal in order to get an up-to-date snapshot of the system state.
In order to support late joining clients, a kind of “Last Value Cache” (LVC) component is required. The LVC holds current system state and allows late joiners to bootstrap by querying.
In a high performance, large data system, the LVC component becomes a bottleneck as the number of clients rises.

#### Generalizing the Last Value Cache: Continuous Queries

In a continuous query data cache, a query result is kept up to date automatically. Queries are replaced by subscriptions.

subscribe * from Orders where
symbol in [‘ALV’, ‘BMW’] and
volume > 1000 and
owner=’MyCompany’

creates a message stream, which initially performs a query operation, after that updates the result set whenever a data change affecting the query result happened (transparent to the client application). The system ensures each subscriber receives exactly the change notifications necessary to keep its “live” query results up-to-date.

 A distributed continous query system: The LVC Nodes hold data. Transactions are sent to them on a message bus (red). The LVC nodes compute the actual difference caused by a transaction and send change notifications on a message bus (blue).  This enables “processing nodes” to keep a mirror of their relevant data partition up-to-date. External clients connected via TCP/Http do not listen to the message bus (because multicast is not an option in WAN). “Subscription processors” keep the client’s continuous queries up-to-date by listening to the (blue) message bus and dispatching required change notifications only to client’s point2point connection.

Difference of data access patterns compared to static data management:

• High write volume
Real time systems create a high volume of write access/change in data.
• Fewer full table scans.
Only late-joining clients or changes of a query’s condition require a full data scan. Because continuous queries make “refreshing” a query result obsolete, Read/Write ratio is ~ 1:1 (if one counts the change notification resulting from a transaction as “Read Access”).
• The majority of load is generated, when evaluating queries of active continuous subscriptions with each change of data. Consider a transaction load of 100.000 changes per second with 10.000 active continuous queries: this requires 100.000*10.000 = 1 Billion evaluations of query conditions per second. That’s still an underestimation: When a record gets updated, it must be tested whether the record has matched a query condition before the update and whether it matches after the update. A record’s update may result in an add (because it matches after the change) or a remove transaction (because the record does not match anymore after a change) to a query subscription (or ‘update’, or ‘skip’ ofc).

Data Cluster Nodes (“LastValueCache Nodes”)

Data is organized in tables, column oriented. Each table’s data is evenly partitioned amongst all data grid nodes (=last value cache node=”LVC node”). By adding data nodes to the cluster, capacity is increased and snapshot queries (initializing a subscription) are sped up by increased concurrency.

There are three basic transactions/messages processed by the data grid nodes:

• RemoveRow(table,rowId),
• UpdateRow(table, rowId, diff).

The data grid nodes provide a lambda-alike (row iterator) interface supporting the iteration of a table’s rows  using plain java code. This can be used to perform map-reduce jobs and as a specialization, the initial query required by newly subscribing clients. Since ongoing computation of continuous queries is done in the “Gateway” nodes, the load of data nodes and the number of clients correlate weakly only.

All transactions processed by a data grid node are (re-)broadcasted using multicast “Change Notification” messages.

#### Gateway Nodes

Gateway nodes track subscriptions/connections to client applications. They listen to the global stream of change notifications and check whether a change influences the result of a continuous query (=subscription). This is very CPU intensive.

Two things make this work:

1. by using plain java to define a query, query conditions profit from JIT compilation, no need to parse and interpret a query language. HotSpot is one of the best optimizing JIT compilers on the planet.
2. Since multicast is used for the stream of global changes, one can add additional Gateway nodes with ~no impact on throughput of the cluster.

#### Processor (or Mutator) Nodes

These nodes implement logic on-top of the cluster data. E.g. a statistics processor does a continuous query for each table, incrementally counts the number of rows of each table and writes the results back to a “statistics” table, so a monitoring client application can subscribe to realtime data of current table sizes. Another example would be a “Matcher processor” in a stock exchange, listening to orders for a stock, if orders match, it removes them and adds a Trade to the “trades” table.
If one sees the whole cluster as kind of a “giant spreadsheet”, processors implement the formulas of this spreadsheet.

#### Scaling Up

• with data size:
increase number of LVC nodes
• Number of Clients
increase subscription processor nodes.
• TP/S
scale up processor nodes and LVC nodes

Of cause the system relies heavily on availability of a “real” multicast messaging bus system. Any point to point oriented or broker-oriented networking/messaging will be a massive bottleneck.

### Conclusion

Building real time processing software backed by a continuous query system simplifies application development a lot.
• Its model-view-controller at large scale.
Astonishing: patterns used in GUI applications for decades have not been extended regulary to the backing data storage systems.
• Any server side processing can be partitioned in a natural way. A processor node creates an in-memory mirror of its data partition using continuous queries. Processing results are streamed back to the data grid. Computing intensive jobs, e.g. risk computation of derivatives can be scaled by adding processor instances subscribing to distinct partitions of the data (“sharding”).
• The size of the Code Base reduces significantly (both business logic and Front-End).
A lot of code in handcrafted systems deals with keeping data up to date.

I am a technical architect/senior developer consultant at an european company involved heavily in stock & derivative trading systems.
Other blog posts of me can be found here.

## Java, the Steam controller and me

Do you ever wondered if you could use existing stuff for something new? I saw some footage of the so-called ‘Steam controller’ (SC from now on) and looked at my game pad. Asking myself if it would be possible to use it in a steamy-like way, I’ve found some Java libraries and created a project that I’d like to share with you today.

Of course, there have been a lot of input devices (and especially game controllers) long before the SC’s release, but it has one new property which makes it special.

It has two touchpads, which can emulate a mouse’s or keyboard’s input in order to be able to play (practically nearly) every game. As some early videos show, even a mouse-intensive game like the puzzle game ‘Portal’ seems to be playable by using this compatibility mode.

As a gamer enthusiast and Java programmer, what could I do with something like this (a XBOX controller which I already got) in order to come close to that?

A small tool named ‘StrangeCtrl’ saw the world’s bright light. Talking to the controller needs some JNI (because there is no USB subsystem in the JVM for example), but the rest is written in pure Java. It sits in the system tray and is configured manually per configuration file, although one could build a GUI, too.

Its dependencies are ‘net.java.jinput.JInput’ in version 2.0.5 (still working for Windows 8.1) and a little helper I wrote (‘com.xafero.SuperLoader’ v0.1). Now I’ll explain the steps taken on the way.

First step: How do we get Java to talk to my controller?

Luckily, the BSD-licensed JInput project does exactly this. It connects to Microsoft’s XInput interface for example and fills some Java data structures with the native data it gets. Linux and Mac OS X are covered, too, don’t worry.

So I plugged in my game pad (one XBOX-compatible controller) and the way seemed to be clear:

1. get the controllers
2. get their input events
3. and convert them to virtual events for keyboard and mouse.

The library’s native components for the big three OS are delivered in Java archives (at least per Maven). But as you might already know, java.lang.System only loads files directly available on the file system.

Second step: So how do we get around this annoying limitation?

After a quick search, I’ve found wcmatthysen’s ‘mx-native-loader’ which seemed useful as it claims to extract JAR’s and load the native stuff. But it didn’t work, because the libraries of JInput are packed into several ‘jinput-platform-***.jar’ files instead of one big chunk under META-INF/lib as this loader suggests.

So the new helper library called ‘SuperLoader’ works around these circumstances:

1. Create a temporary directory for all nasty native libraries, e.g. with the help of the system property ‘java.io.tmpdir’. It could also be specified by the user directly as it doesn’t really matter where it is.
2. Get all nasty libraries out of the JARs already loaded; iterate over all class path’s URLs and extract them or exclude most of them by using a filter.
3. Extend the existing library path; one thing the other library didn’t do and it’s very annoying to do it manually, so the system property ‘java.library.path’ should be extended.
4. Force the JVM to renew the system paths; one can do it by resetting the field ‘sys_paths’ of the system class loader to null. This forces the System class to really appreciate the new circumstances the next time you request a library.
Now the application pre-loads all native libraries into a temporary folder and when JInput is asked to give a list of the controllers, for example, it doesn’t have to be changed for using JAR files. It simply is able to use System.loadLibrary as anyone would do.
Third step: What’s possible to simulate?
We’ve finally got to read the game pad’s events, so what can we do with it? With AWT’s Robot class, it’s possible since the early Java days to simulate a key press or mouse movements and such. Although the robot requires one to specify the desktop it should work on, it works just as good on multi-monitor systems. The only difference is the offset of all events it generates – an aspect especially important if one wants to click on specific regions of the PC’s screen.
The implemented commands so far are:

• MouseMoveCmd – moves the mouse by some amount horizontally or vertically
• MouseClickCmd – clicks the given mouse button at the current screen position
• KeyComboCmd – presses some keys and releases them in the reversed order

To allow some bit of extensibility, there is an interface which accepts the robot to generate virtual events, the current graphics device and the value given by JInput:

public interface ICommand {
void execute(Robot rbt, GraphicsDevice dev, float value);
}

Its abstract implementation ‘AbstractCmd’ provides one constructor accepting one string. As a first step of processing, the raw string coming from the configuration file is splitted by an empty space into a string array.

Fourth step: Which configuration format can we use?

There are a lot of trendy formats out there, like YAML, JSON, … But Java already provides us with a simple way to achieve this. So the configuration file is parsed with the XML variant of the Java properties mechanism. To build the actual map out of strings connected with their commands, the class ‘com.xafero.strangectrl.cmd.ConfigUtils’

• iterates through all entries,
• searches a command by each entry’s value,
• loads each command by instantiating it with the textual arguments,
• puts the result of key (controller button) and value (associated command) into a new map,
• and produces the actual map used to transform incoming events.
Fifth step: The actual work…
The helper class ‘ControllerPoller’ is a periodically executed TimerTask who is responsible for collecting new JInput events from an arbitrary amount of controllers and inform the caller about every new stuff:

public void run() {
for (Controller controller : controllers) {
if (!controller.poll()) continue;
EventQueue queue = controller.getEventQueue();
Event event = new Event();
while (queue.getNextEvent(event))
callback.onNewEvent(this, controller, event);
}
}

The caller (in this case the so-called ‘App’ living in the system tray) just implements the callback interface and gets all information for free whenever some input occurs:

public static interface IControllerCallback {
void onNewEvent(ControllerPoller p, Controller c, Event e);
}

Left to the ‘App’ is the search for associated commands to the incoming game pad’s events and their execution with the correct parameters. Now we could use it for controlling some game, perhaps an old one like Prince of Persia or something otherwise unplayable with a game pad. But let’s step aside…

Example aside from games: How to configure it for people with constrained movement?
To show just another possible field of application, let’s configure it for someone who can’t press two keys at the same time. An example application should here be a web browser. In the configuration file, there are the following settings:

<!– Button A means now left mouse click –>
<entry key=”Button 0″>mouseClick 1</entry>
<!– Button B will open a new tab –>
<entry key=”Button 1″>keyCombo CONTROL T</entry>
<!– Button X will close an existing tab –>
<entry key=”Button 2″>keyCombo CONTROL W</entry>

The browser in this example doesn’t have to know the game controller, because the operating system will produce new virtual input events, and it will operate as demanded. By using Java and being FOSS, the tool is also customizable and easily understandable in every way (in contrast to some C/C++ code which would be otherwise necessary to emulate input devices).
The source code is available at https://github.com/xafero/StrangeCtrl.
Feel free to use, share or modify any aspect (licensed under GPL v3).

For further information see:

## (Part 2 of 3): Synopsis of articles & videos on Performance tuning, JVM, GC in Java, Mechanical Sympathy, et al

This is a continuation of the previous post titled (Part 1 of 3): Synopsis of articles & videos on Performance tuning, JVM, GC in Java, Mechanical Sympathy, et al.

Without any further ado, lets get started with our next set of blogs and videos, chop…chop…! This time its Martin Thompson’s blog posts and talks. Martin’s first post on Java Garbage collection distilled basically distils the GC process and the underlying components including throwing light on a number of interesting GC flags (-XX:…). In his next talk he does his myth busting shaabang about mechanical sympathy, what people correctly believe in and also some of the misconceptions they have bought into. In the talk on performance testing, Martin takes it further and fuses Java, OS and the hardware to show how understanding aspects of all these can help write better programs.

Java Garbage Collection Distilled by Martin Thompson

There are too many flags to allow tuning the GC to achieve the throughput and latency your application requires. There’s plenty of documentation on the specifics of the bells and whistles around them but none to guide you through them.

– The Tradeoffs – throughput (-XX:GCTimeRatio=99), latency (-XX:MaxGCPauseMillis=<n>) and memory (-Xmx<n>) are the key variables that the collectors depend upon. It is important to note that Hotspot often cannot achieve the above targets. If a low-latency application goes unresponsive for more than a few seconds it can spill disaster. Tradeoffs play out as they
* provide more memory to GC algorithms
* GC can be reduced by containing the live set and keeping heap size small
* frequency of pauses can be reduced by managing heap and generation sizes & controlling application’s object allocation rate
* frequency of large pauses can be reduced by running GC concurrently

GC algorithms are often optimised with the expectation that most objects live for a very short period of time, while relatively few live for very long. Experimentation has shown that generational garbage collectors support much better throughput than non-generational ones – hence used in server JVMs.

– Stop-The-World Events
For GC to occur it is necessary that all the threads of a running application must pause – garbage collectors do this by signalling the threads to stop when they come to a safe-point. Time to safe-point is an important consideration in low-latency applications and can be found using the ‑XX:+PrintGCApplicationStoppedTime flag in addition to the other GC flags. When a STW event occurs a system will undergo significant scheduling pressure as the threads resume when released from safe-points, hence less STWs makes an application more efficient.

– Heap Organisation in Hotspot
Java heap is divided in various regions, an object is created in Eden, and moved into the survivor spaces, and eventually into tenured. PermGen was used to store runtime objects such as classes and static strings. Collectors take help of Virtual spaces to meet throughput & latency targets, and adjusting the region sizes to reach the targets.

– Object Allocation
TLAB (Thread Local Allocation Buffer) is used to allocate objects in Java, which is cheaper than using malloc (takes 10 instructions on most platforms). Rate of minor collection is directly proportional to the rate of object allocation. Large objects (-XX:PretenureSizeThreshold=n) may have to be allocated in Old Gen, but if the threshold is set below the TLAB size, then they will not be created in old gen – (note) does not apply to the G1 collector.

– Minor Collections
Minor collection occurs when Eden becomes full, objects are promoted to the tenured space from Eden once they get old i.e. cross the threshold (-XX:MaxTenuringThreshold).  In minor collection, live reachable objects with known GC roots are copied to the survivor space. Hotspot maintains cross-generational references using a card table. Hence size of the old generation is also a factor in the cost of minor collections. Collection efficiency can be achieved by adjusting the size of Eden to the number of objects to be promoted. These are prone to STW and hence problematic in recent times.

– Major Collections
Major collections collect the old generation so that objects from the young gen can be promoted. Collectors track a fill threshold for the old generation and begin collection when the threshold is passed. To avoid promotion failure you will need to tune the padding that the old generation allows to accommodate promotions (-XX:PromotedPadding=<n>). Heap resizing can be avoided using the –Xms and -Xmx flags. Compaction of old gen causes one of the largest STW pauses an application can experience and directly proportion to the number of live objects in old gen. Tenure space can be filled up slower, by adjusting the survivor space sizes and tenuring threshold but this in turn can cause longer minor collection pause times as a result of increased copying costs between the survivor spaces.

– Serial Collector
It is the simplest collector with the smallest footprint (-XX:+UseSerialGC) and uses a single thread for both minor and major collections.

– Parallel Collector
Comes in two forms (-XX:+UseParallelGC) and (-XX:+UseParallelOldGC) and uses multiple threads for minor collections and a single thread for major collections – since Java 7u4 uses multiple threads for both type of collections. Parallel Old performs very well on a multi-processor system, suitable for batch applications. This collector can be helped by providing more memory, larger but fewer collection pauses. Weigh your bets between the Parallel Old and Concurrent collector depending on how much pause your application can withstand (expect 1 to 5 seconds pauses per GB of live data on modern hardware while old gen is compacted).

– Concurrent Mark Sweep (CMS) Collector
CMS (-XX:+UseConcMarkSweepGC) collector runs in the Old generation collecting tenured objects that are no longer reachable during a major collection. CMS is not a compacting collector causing fragmentation in Old gen over time. Promotion failure will trigger FullGC when a large object cannot fit in Old gen. CMS runs alongside your application taking CPU time. CMS can suffer “concurrent mode failures” when it fails to collect at a sufficient rate to keep up with promotion.

– Garbage First (G1) Collector
G1 (-XX:+UseG1GC) is a new collector introduced in Java 6 and now officially supported in Java 7. It is a generational collector with a partially concurrent collecting algorithm, compacts the Old gen with smaller incremental STW pauses. It divides the heap into fixed sized regions of variable purpose. G1 is target driven on latency (–XX:MaxGCPauseMillis=<n>, default value = 200ms). Collection on the humongous regions can be very costly. It uses “Remembered Sets” to keep track of references to objects from other regions. There is a lot of cost involved with book keeping and maintaining “Remembered Sets”. Similar to CMS, G1 can suffer from an evacuation failure (to-space overflow).

– Alternative Concurrent Collectors
Oracle JRockit Real Time, IBM Websphere Real Time, and Azul Zing are alternative concurrent collectors.  Zing according to the author is the only Java collector that strikes a balance between collection, compaction, and maintains a high-throughput rate for all generations. Zing is concurrent for all phases including during minor collections, irrespective of heap size. For all the concurrent collectors targeting latency you have to give up throughput and gain footprint. Budget for heap size at least 2 to 3 times the live set for efficient operation.

– Garbage Collection Monitoring & Tuning
Important flags to always have enabled to collect optimum GC details:
`-verbose:gc-Xloggc:<filename>-XX:+PrintGCDetails-XX:+PrintGCDateStamps-XX:+PrintTenuringDistribution-XX:+PrintGCApplicationConcurrentTime -XX:+PrintGCApplicationStoppedTime`

Use applications like GCViewer, JVisualVM (with Visual GC plugin) to study the behaviour of your application as a result of GC actions. Run representative load tests that can be executed repeatedly (as you gain knowledge of the various collectors), keep experimenting with different configuration until you reach your throughput and latency targets. jHiccup helps track pauses within the JVM.  As known to us it’s a difficult challenge to strike a balance between latency requirements, high object allocation and promotion rates combined, and that sometimes choosing a commercial solution to achieve this might be a more sensible idea.

Conclusion: GC is a big subject by itself and has a number of components, some of them are constantly replaced and it’s important to know what each one stands for. The GC flags are as much important as the component to which they are related and it’s important to know them and how to use them. Enabling some standard GC flags to record GC logs does not have any significant impact on the performance of the JVM. Using third-party freeware or commercial tools help as long as you follow the authors methodology.

— Highly reading the article multiple times, as Martin has covered lots of details about GC and the collectors which requires close inspection and good understanding.  —

Martin Thompson’s:  Mythbusting modern hardware to gain “Mechanical Sympathy” Video * Slides

He classifies myth into three categories – possible, plausible and busted! In order to get the best out of the hardware you own, you need TO KNOW your hardware. Make tradeoffs as you go along, changing knobs ;), it’s not as scary as you may think.

Good question: do normal developers understand the hardware they program on? Or cannot understand what’s going on? Or do we have the discipline and make the effort to understand the platform we work with?

Myth 1CPUs are not getting faster – clock speed isn’t everything, Sandy bridge architecture is the faster breed. 6 ports to support parallelism (6 ops/cycle). Haswell has 8 ports! Code doing division operations perform slower than any other arithmetic operations. CPUs have both front-end and back-end cycles. It’s getting faster as we are feeding them faster! – PLAUSIBLE

Myth 2 – Memory provides us random access – CPU registers and buffers, internal caches (L1, L2, L3) and memory – mentioned in the order in of speed of access to these areas respectively. Manufacturers have been doing things to CPUs to bring down its operational temperature by performing direct access operations. Writes are less hassle than reads – buffer misses are costly. L1 is organised into cache-lines containing code that the processor will execute – efficiency is achieved by not having cache-line misses. Pre-fetchers help reduce latency and help during reading streaming and predictable data. TLB misses can also cost efficiency (size of TLB = 4K = size of memory page). In short reading memory isn’t anywhere close to reading it randomly but SEQUENTIALLY due to the way the underlying hardware works. Writing highly branched code can cause slow down in execution of the program – keep things together that is, cohesive is the key to efficiency. – BUSTED
Note: TLAB and TLB are two different concepts, google to find out the difference!

Myth 3 – HDD provides random access – spinning disc and an arm moves about reading data. More sectors are placed in the outer tracks than the inner tracks (zone bit recording). Spinning the discs faster isn’t the way to increase HDD performance. 4K is the minimum you can read or write at a time. Seek time in the best disc is 3-6 ms, laptop drives are slower (15 ms). Data transfers take 100-220 Mbytes/sec. Adding a cache can improve writing data into the disc, not so much for reading data from disks. – BUSTED

Myth 4 – SSDs provides random access – Reads and writes are great, works very fast (4K at a time). Delete is not like the real deletes, it’s marked deleted and not really deleted – as you can’t erase at  high resolution hence a whole block needs to be erased at a time (hence marked deleted). All this can cause fragmentation, and GC and compaction is required. Reads are smooth, writes are hindered by fragmentation, GC, compaction, etc…, also to be ware of write-amplification. A few disadvantages when using SSD but overall quite performant. – PLAUSIBLE

Can we understand all of this and write better code?

Conclusion: do not take everything in the space for granted just because it’s on the tin, examine, inspect and investigate for yourself the internals where possible before considering it to be possible, or plausible – in order to write good code and take advantage of these features.

— Great talk and good coverage of the mechanical sympathy topic with some good humour, watch the video for performance statistics gathered on each of the above hardware components  —

“Performance Testing Java Applications” by Martin Thompson

“How to use a profiler?” or “How to use a debugger?”
What is Performance? Could mean two things like throughput or bandwidth (how much can you get through) and latency (how quickly the system responds).

Response time changes as we apply more load on the system. Before we design any system, we need to gather performance requirements i.e. what is the throughput of the system or how fast you want the system to respond (latency)? Does your system scale economically with your business?

Developer time is expensive, hardware is cheap! For anything, you need a transaction budget (with a good break-down of the different components and processes the system is going to go through or goes through).

How can we do performance testing? Apply load to a system and see if the throughput goes up or down? And what is the response time of the system as we apply load. Stress testing is different from load testing (see Load testing) , stress testing (see Stress testing) is a point where things break (collapse of a system), an ideal system will continue with a flat line. Also it’s important to perform load testing not just from one point but from multiple points and concurrently. Most importantly high duration testing is very important – which bring a lot of anomalies to the surface i.e. memory leaks, etc…
Building up a test suite is a good idea, a suite made up of smaller parts. We need to know the basic building blocks of the system we use and what we can get out of it. Do we know the different threshold points of our systems and how much its components can handle? Very important to know the algorithms we use, know how to measure them and use it accordingly.
When should we test performance?
“Premature optimisation is the root of all evil” – Donald Knuth / Tony Hoare

What does optimisation mean? Knowing and choosing your data and working around it for performance. New development practices: we need to test early and often!

From a Performance perspective, “test first” practice is very important, and then design the system gradually as changes can cost a lot in the future.

RedGreenDebugProfileRefactor, a new way of “test first” performance methodology as opposed to RedGreenRefactor methodology only! Earlier and shorter feedback cycle is better than finding something out in the future.

Use “like live” pairing stations, Mac is a bad example to work on if you are working in the Performance space – a Linux machine is a better option.

Performance tests can fail the build – and it should fail a build in your CI system! What should a micro benchmark look like (i.e. calliper)? Division operations in your code can be very expensive, instead use a mask operator!

What about system performance tests? Should we be able to test big and small clients with different ranges. It’s fun to know deep-dive details of the system you work on. A business problem is the core and the most important one to solve and NOT to discuss on what frameworks to use to build it. Please do not use Java serialisation as it is not designed for on the-wire-protocol! Measure performance of a system using a observer rather than measure it from inside the system only.
Performance Testing Lessons – lots of technical stuff and lots of cultural stuff. Technical Lessons – learn how to measure, check out histograms! Do not sample a system, we miss out when things when the system go strange, outliers, etc… – histograms help! Knowing what is going on around the areas where the system takes a long time is important! Capturing time from the OS is a very important as well.
With time you get – accuracy, precision and resolution, and most people mix all of these up. On machines with dual sockets, time might not be synchronised. Quality of the time information is very dependent on the OS you are using. Time might be an issue on virtualised systems, or between two machines. This issue can be resolved, do round-trip times between two systems (note the start and stop clock times) and half them to get a more accurate time. Time can go backwards on you on certain OSes (possibly due to NTP) – instead use monotonic time.
Know your system, its underlying components – get the metrics and study them ! Use a Linux tool like perstat, will give lots of performance and statistics related information on your CPU and OS – branch predictions and cache-misses!

RDTSC is not an ordered-instructions execution system, x86 is an ordered instruction systems and operations do not occur in a unordered fashion.

Theory of constraints! – Always start with number 1 problem on the list, the one that takes most of the time – the bottleneck of your system, the remaining sequence of issues might be dependent on the number 1 issue and are not separate issues!

Trying to create a performance team is an anti-pattern – make the experts help bring the skills out to the rest of the team, and stretch-up their abilities!

Beware of YAGNI – about doing performance tests – smell of excuse!
Commit builds > 3.40 mins = worrying, same for acceptance test build > 15 mins = lowers team confidence.
Test environment should equal to production environment! Easy to get exactly similar hardware these days!
Conclusion: Start with a “test first” performance testing approach when writing applications that are low latency dependent. Know your targets and work towards it. Know your underlying systems all the way from hardware to the development environment. Its not just technical things that matter, cultural things matter as much, when it comes to performance testing. Share and spread the knowledge across the team rather than isolating it to one or two people i.e. so called experts in the team. Its everyone’s responsibility not just a few seniors in the team. Learn more about time across various hardware and operating systems, and between systems.

As it is not practical to review all such videos and articles, a number of them have been provided in the links below for further study. In many cases I have paraphrased or directly quoted what the authors have to say to preserve the message and meaning they wished to convey. A follow-on to this blog post will appear in the same space under the title (Part 3 of 3): Synopsis of articles & videos on Performance tuning, JVM, GC in Java, Mechanical Sympathy, et al on 23rd Dec 2013.

## Type-safing the Observer with mutually recursive bounds

The Observer is known as a behavioral pattern, as it’s used to form relationships between objects at run-time.  The definition provided in the original Gang of Four book on Design Patterns states:

Defines a one-to-many dependency between objects so that when one object changes state, all its dependents are notified and updated automatically.

The Java library implements a non-generic version of the Subject-Observer pattern in the package java.util with the class Observable and the interface Observer. The Observable class contains methods to register observers (addObserver), to indicate that the observable has changed (setChanged), and to notify all observers of any changes (notifyObservers), among others. The notifyObservers method may accept an arbitrary argument of type Object that is to be broadcast to all the observers. The Observer interface specifies the update method that is called by notifyObservers. This method takes two parameters: the first one, of type Observable, is the subject that has changed; the second, of type Object, is the broadcast argument.
We could, of course, create our own classes and not use those provided by the library, but isn’t it wonderful when someone else takes care of maintaining our code? So we’re going to use the provided library for this example.

Let’s see an example. Suppose we have a family with 4 members, mom, dad, a baby and a dog. When the little baby cries, mom changes his diaper, so we have the class Baby as an Observable and the class Mother as an Observer of the Baby:

### public class Baby extends Observable {    private String name;    public Baby(String name) {     this.name = name;    }    public void babyIsCrying() {     setChanged();     notifyObservers(name);    } }

public class Mother implements Observer {
@Override
public void update(Observable subject, Object param) {
String babyName = (String) param;
System.out.println(“Let’s change that diaper on “ + babyName + “!”);
}
}

The father takes the dog for a walk when it barks, so the class Dog is another Observable and the class Father its Observer:
public class Dog extends Observable {
private String name;
public Dog(String name) {
this.name = name;
}
public void barks() {
setChanged();
notifyObservers(name);
}
}
public class Father implements Observer {
@Override
public void update(Observable o, Object arg) {
String dogName = (String) arg;
System.out.println(dogName + “, let’s go to a walk!”);
}
}

At a first glance, everything seems to be working ok, but What if somebody understands wrongly the relationships and adds the Mother as an Observer of the Dog(the compiler wouldn’t complain about it and the runtime will sillently change the diaper on the dog and wouldn’t even notify us about the horrible mistake we did).
Let’s test it:
public class TestingLegacy {
public static void main(String[] args) {
Baby david = new Baby(“David”);
Mother mom = new Mother();

Dog rover = new Dog(“Rover”);
Father john = new Father();

// test mother-baby relationship
david.babyIsCrying();
// test father-dog relationship
rover.barks();

// delete observers to test wrong relatinships
david.deleteObservers();
rover.deleteObservers();

System.out.println(“Add a wrong relationship and test.”);

// add the father as the baby’s observer
// add the mother as the dog’s observer

david.babyIsCrying();
rover.barks();
}
}

The console outputs:
Let’s change that diaper on David!
Rover, let’s go to a walk!
Add a wrong relationship and test.
David, let’s go to a walk!
Let’s change that diaper on Rover!

To ensure that a Subject-Observer relationship is well established, we should use one of the nicest Java 5 feature: generics. They add stability to the code by making more of your bugs detectable at compile time. But there is a small problem, because Observable and Observer are part of the java docs, we can’t and shouldn’t change them. So to be able to add bounds, we will create stub classes with generic signatures but no bodies. We compile the generic client against the generic signatures, but run the code against the legacy class files. This technique is appropriate when the source is not released, or when others are responsible for maintaining the source.
So we can replace Observable and Observer with the type parameters S and O (for Subject and Observer). Then within the update method of the observer, you may call on any method supported by the subject S without first requiring a cast. Also, The appearance of Object in a method signature often indicates an opportunity to generify. So we should add a type parameter, A, corresponding to the argument type.
Here is our first version of stubs with two generic params:
public class Observable<O extends Observer<?, A>, A> {..}
public interface Observer<S extends Observable<?, A>, A> {…}

Now when creating a Mother-Baby relationship things work just fine. What about Mother-Dog? the compiler let’s us do the link, the runtime however throws a ClassCastExeption for this time.
class WrongBoundedBaby extends Observable<WrongBoundedMother, String>
class WrongBoundedMother implements Observer<Dog, String>

Exception in thread “main” java.lang.ClassCastException: bounds.WrongBoundedBaby cannot be cast to bounds.Dog
at bounds.WrongBoundedMother.update(WrongBoundedMother.java:1)
at java.util.Observable.notifyObservers(Observable.java:142)
at bounds.WrongBoundedBaby.babyIsCrying(WrongBoundedBaby.java:28)
at bounds.TestingLegacy.main(TestingLegacy.java:38)

So, not good enough, as the bug won’t be detected until run-time. We need S to be in scope in Observable so that it can be passed as a parameter to Observer, and we need O to be in scope in Observer so that it can be passed as a parameter to Observable.
class Observable<S extends Observable<S, O, A>, O extends Observer<S, O, A>, A>
interface Observer<S extends Observable<S, O, A>, O extends Observer<S, O, A>, A>

This is the last version of our classes:
class Baby extends Observable<Baby, Mother, String>
class Mother implements Observer<Baby, Mother, String>
class Dog extends Observable<Dog, Father, String>
class Father implements Observer<Dog, Father, String>

// COMPILE- error when adding the father as the baby’s observer
// COMPILE- error when adding the mother as the dog’s observer

The Java Generics and Collections book defines this type of bound as mutually recursive.

In conclusion, the Observer is a great design pattern with many implementations, but a junior dev can be easily tricked to use it wrongly. Adding the mutually recursive bounds helps us idetifying the problem at compile time and sleeping good at night when knowing that we did a great job.

For full source code, please see https://github.com/CrisIf/generics .

## Introduction To JUnit Theories

Have you ever read a mathematical theory?

It typically reads something like this:

For all a, b > 0  the following is true: a+b > a and a+b > b

Just typically the statements are more difficult to understand.

There is something interesting about this kind of statement: It holds for EVERY element (or combination of elements) of a rather large (infinite in this case) set.

Compare that to the statement a typical test makes:

`` @Test public void a_plus_b_is_greater_than_a_and_greater_than_b(){   int a = 2;   int b = 3;   assertTrue(a + b > a);   assertTrue(a + b > b); }``

This is just a statement about a single element of the large set we talked about. Not very impressive. Of course we can fix that somewhat by looping over the test (or using parameterized tests):

`` @Test public void a_plus_b_is_greater_than_a_and_greater_than_b_multiple_values() {    List<Integer> values = Arrays.asList(1, 2, 300, 400000);    for (Integer a : values)      for (Integer b : values) {         assertTrue(a + b > a);         assertTrue(a + b > b);      }    }``

Of course this still only tests a few values, but it also became pretty ugly. We are using 9 lines of code to  test what a mathematician writes in a single line! And the main point that this relation ship should hold for any value a,b is completely lost in translation.

But there is hope: JUnit Theories. Let’s see how the test looks like with that nifty tool:

``import org.junit.experimental.theories.DataPoints;import org.junit.experimental.theories.Theories;import org.junit.experimental.theories.Theory;import org.junit.runner.RunWith;import static org.junit.Assert.assertTrue;@RunWith(Theories.class)public class AdditionWithTheoriesTest {  @DataPoints  public static int[] positiveIntegers() {       return new int[]{                        1, 10, 1234567};  }  @Theory  public void a_plus_b_is_greater_than_a_and_greater_than_b(Integer a, Integer b) {      assertTrue(a + b > a);      assertTrue(a + b > b);  }}``

With JUnit Theories the test gets split in two separate parts: a method providing data points i.e. values to be used for tests, and the theory itself. The theory looks almost like a test, but it has a different annotation (@Theory) and it takes parameters. The theories in a class get executed with every possible combination of data points.

This means that if we have more then one theory about our test subject we only have to declare the data points once. So let’s add the following theory, which should be true for addition: a + b = b + a So we add the following theory to our class
@Theory public void addition_is_commutative(Integer a, Integer b) { assertTrue(a + b == b + a); }
This works like a charm and one can start to see that this actually saves some code as well, because we don’t duplicate the data points. But we only test with positive integers, while the commutative property should hold for all integers! Of course our first theory still only holds for positive numbers

There is a solution for this as well: Assume. With assume you can check precondition for your theory. If it isn’t true for a given parameter set, the theory gets skipped for that parameter set. So our test now looks like this:

`` @RunWith(Theories.class) public class AdditionWithTheoriesTest {  @DataPoints  public static int[] integers() {     return new int[]{                   -1, -10, -1234567,1, 10, 1234567};  }  @Theory  public void a_plus_b_is_greater_than_a_and_greater_than_b(Integer a, Integer b) {     Assume.assumeTrue(a >0 && b > 0 );     assertTrue(a + b > a);     assertTrue(a + b > b);  }  @Theory  public void addition_is_commutative(Integer a, Integer b) {     assertTrue(a + b == b + a);  }}``

This makes the tests nicely expressive.

The separation of test data from test/theory implementation can have another positive effect apart from brevity: You might start to think about you test data independent of the actual stuff to test.

Lets do just that. If you want to test a method that takes an integer argument, what integers would be likely to cause problems? This is my proposal:

``  @DataPoints  public static int[] integers() {     return new int[]{                     0, -1, -10, -1234567,1, 10, 1234567, Integer.MAX_VALUE, Integer.MIN_VALUE};}``

Which of course causes a test failure in our example. If you add a positive integer to Integer.MAX_VALUE you get an overflow! So we just learned that our theory in its current form is wrong! Yeah I know this is obvious, but have a look at the tests in your current project. Do all the tests that use Integers test with MIN_VALUE, MAX_VALUE, 0, a positive and a negative value? Yeah, thought so.

What about more complex objects? Strings? Dates? Collections? Or domain objects? With JUnit Theories you can setup test data generators once that create all the scenarios that are prone to create problems and then reuse those in all your tests using theories. It will make your tests more expressive and improve the probability of finding bugs.

I must confess the title of this post is a little bit catchy. I have recently read this blog post and this is a good summary of  discussions & debates about this subject.
But this time I would like to try a different approach to compare those 2 well known data structures: using Hardware Performance Counters.

I will not perform a micro-benchmark, well not directly. I will not time using System.nanoTime(), but rather using HPCs like cache hits/misses.

No need to present those data structures, everybody knows what they are using for and how they are implemented. I am focusing my study on list iteration because, beside adding an element, this is the most common task for a list. And also because the memory access pattern for a list is a good example of CPU cache interaction.

Here my code for measuring list iteration for LinkedList & ArrayList:

``import java.util.ArrayList;import java.util.LinkedList;import java.util.List;import ch.usi.overseer.OverHpc;public class ListIteration{    private static List<String> arrayList = new ArrayList<>();    private static List<String> linkedList = new LinkedList<>();    public static void initializeList(List<String> list, int bufferSize)    {        for (int i = 0; i < 50000; i++)        {            byte[] buffer = null;            if (bufferSize > 0)            {                buffer = new byte[bufferSize];            }            String s = String.valueOf(i);            list.add(s);            // avoid buffer to be optimized away            if (System.currentTimeMillis() == 0)            {                System.out.println(buffer);            }        }    }    public static void bench(List<String> list)    {        if (list.contains("bar"))        {            System.out.println("bar found");        }    }    public static void main(String[] args) throws Exception    {        if (args.length != 2) return;        List<String> benchList = "array".equals(args[0]) ? arrayList : linkedList;        int bufferSize = Integer.parseInt(args[1]);        initializeList(benchList, bufferSize);        HWCounters.init();        System.out.println("init done");        // warmup        for (int i = 0; i < 10000; i++)        {            bench(benchList);        }        Thread.sleep(1000);        System.out.println("warmup done");        HWCounters.start();        for (int i = 0; i < 1000; i++)        {            bench(benchList);        }        HWCounters.stop();        HWCounters.printResults();        HWCounters.shutdown();    }}``

To measure, I am using a class called HWCounters based on overseer library to get Hardware Performance Counters. You can find the code of this class here.

The program take 2 parameters: the first one to choose between ArrayList implementation or LinkedList one, the second one to take a buffer size used in initializeList method. This method fills a list implementation with 50K strings. Each string is newly created just before add it to the list. We may also allocate a buffer based on our second parameter of the program. if 0, no buffer is allocated.
bench method performs a search of a constant string which is not contained into the list, so we fully traverse the list.
Finally, main method, perform initialization of the list, warmups the bench method and measure 1000 runs of this method. Then, we print results from HPCs.

Let’s run our program with no buffer allocation on Linux with 2 Xeon X5680:

`[[email protected]/* <![CDATA[ */!function(t,e,r,n,c,a,p){try{t=document.currentScript||function(){for(t=document.getElementsByTagName('script'),e=t.length;e--;)if(t[e].getAttribute('data-cfhash'))return t[e]}();if(t&&(c=t.previousSibling)){p=t.parentNode;if(a=c.getAttribute('data-cfemail')){for(e='',r='0x'+a.substr(0,2)|0,n=2;a.length-n;n+=2)e+='%'+('0'+('0x'+a.substr(n,2)^r).toString(16)).slice(-2);p.replaceChild(document.createTextNode(decodeURIComponent(e)),c)}p.removeChild(t)}}catch(u){}}()/* ]]> */]# java -cp .:overseer.jar com.ullink.perf.myths.ListIteration array 0init donewarmup doneCycles: 428,711,720Instructions: 776,215,597L2 hits: 5,302,792L2 misses: 23,702,079LLC hits: 42,933,789LLC misses: 73CPU migrations: 0Local DRAM: 0Remote DRAM: 0`

`[[email protected]/* <![CDATA[ */!function(t,e,r,n,c,a,p){try{t=document.currentScript||function(){for(t=document.getElementsByTagName('script'),e=t.length;e--;)if(t[e].getAttribute('data-cfhash'))return t[e]}();if(t&&(c=t.previousSibling)){p=t.parentNode;if(a=c.getAttribute('data-cfemail')){for(e='',r='0x'+a.substr(0,2)|0,n=2;a.length-n;n+=2)e+='%'+('0'+('0x'+a.substr(n,2)^r).toString(16)).slice(-2);p.replaceChild(document.createTextNode(decodeURIComponent(e)),c)}p.removeChild(t)}}catch(u){}}()/* ]]> */]# /java -cp .:overseer.jar com.ullink.perf.myths.ListIteration linked 0init donewarmup doneCycles: 767,019,336Instructions: 874,081,196L2 hits: 61,489,499L2 misses: 2,499,227LLC hits: 3,788,468LLC misses: 0CPU migrations: 0Local DRAM: 0Remote DRAM: 0`

First run is on the ArrayList implementation, second with LinkedList.

• Number of cycles is the number of CPU cycle spent on executing our code. Clearly LinkedList spent much more cycles than ArrayList.
• Instructions is little higher for LinkedList. But it is not so significant here.
• For L2 cache accesses we have a clear difference: ArrayList has significant more L2 misses compared to LinkedList.
• Mechanically, LLC hits are very important for ArrayList.

The conclusion on this comparison is that most of the data accessed during list iteration is located into L2 for LinkedList but into L3 for ArrayList.
My explanation for this is that strings added to the list are created right before. For LinkedList it means that it is local the Node entry that is created when adding the element. We have more locality with the node.

But let’s re-run the comparison with intermediary buffer allocated for each new String added.

`[[email protected]/* <![CDATA[ */!function(t,e,r,n,c,a,p){try{t=document.currentScript||function(){for(t=document.getElementsByTagName('script'),e=t.length;e--;)if(t[e].getAttribute('data-cfhash'))return t[e]}();if(t&&(c=t.previousSibling)){p=t.parentNode;if(a=c.getAttribute('data-cfemail')){for(e='',r='0x'+a.substr(0,2)|0,n=2;a.length-n;n+=2)e+='%'+('0'+('0x'+a.substr(n,2)^r).toString(16)).slice(-2);p.replaceChild(document.createTextNode(decodeURIComponent(e)),c)}p.removeChild(t)}}catch(u){}}()/* ]]> */]# java -cp .:overseer.jar com.ullink.perf.myths.ListIteration array 256init donewarmup doneCycles: 584,965,201Instructions: 774,373,285L2 hits: 952,193L2 misses: 62,840,804LLC hits: 63,126,049LLC misses: 4,416CPU migrations: 0Local DRAM: 824Remote DRAM: 0`

`[[email protected]/* <![CDATA[ */!function(t,e,r,n,c,a,p){try{t=document.currentScript||function(){for(t=document.getElementsByTagName('script'),e=t.length;e--;)if(t[e].getAttribute('data-cfhash'))return t[e]}();if(t&&(c=t.previousSibling)){p=t.parentNode;if(a=c.getAttribute('data-cfemail')){for(e='',r='0x'+a.substr(0,2)|0,n=2;a.length-n;n+=2)e+='%'+('0'+('0x'+a.substr(n,2)^r).toString(16)).slice(-2);p.replaceChild(document.createTextNode(decodeURIComponent(e)),c)}p.removeChild(t)}}catch(u){}}()/* ]]> */]# java -cp .:overseer.jar com.ullink.perf.myths.ListIteration linked 256init donewarmup doneCycles: 5,289,317,879Instructions: 874,350,022L2 hits: 1,487,037L2 misses: 75,500,984LLC hits: 81,881,688LLC misses: 5,826,435CPU migrations: 0Local DRAM: 1,645,436Remote DRAM: 1,042`

Here the results are quite different:

• Cycles are 10 times more important.
• Instructions stay the same as previously
• For cache accesses, ArrayList have more L2 misses/LLC hits, than previous run, but still in the same magnitude order. LinkedList on the contrary have a lot more L2 misses/LLC hits, but moreover a significant number of LLC misses/DRAM accesses. And the difference is here.

With the intermediary buffer, we are pushing away entries and strings, which generate more cache misses and the end also DRAM accesses which is much more slower than hitting caches.
ArrayList is more predictable here since we keep locality of element from each other.

The memory access pattern here is crucial for list iteration performance. ArrayList is more stable than LinkedList in the way that whatever you are doing between each element adding, you are keeping your data  much more local than the LinkedList.
Remember also that, iterating through an array is much more efficient for CPU since it can trigger Hardware Prefetching because access pattern is very predictable.