Monday, December 22, 2014

A persistent KeyValue Server in 40 lines and a sad fact

Advent time again .. picking up Peters well written overview on the uses of Unsafe, i'll have a short fly-by on how low level techniques in Java can save development effort by enabling a higher level of abstraction or allow for Java performance levels probably unknown to many.

My major point is to show that conversion of Objects to bytes and vice versa is an important fundamental, affecting virtually any modern java application.

Hardware enjoys to process streams of bytes, not object graphs connected by pointers as "All memory is tape" (M.Thompson if I remember correctly ..).

Many basic technologies are therefore hard to use with vanilla Java heap objects:
  • Memory Mapped Files - a great and simple technology to persist application data safe, fast & easy.
  • Network communication is based on sending packets of bytes
  • Interprocess communication (shared memory)
  • Large main memory of today's servers (64GB to 256GB). (GC issues)
  • CPU caches work best on data stored as a continuous stream of bytes in memory
so use of the Unsafe class in most cases boil down in helping to transform a java object graph into a continuous memory region and vice versa either using
  • [performance enhanced] object serialization or
  • wrapper classes to ease access to data stored in a continuous memory region.
(Code & examples of this post can be found here)

Serialization based Off-Heap

Consider a retail WebApplication where there might be millions of registered users. We are actually not interested in representing data in a relational database as all needed is a quick retrieve of user related data once he logs in. Additionally one would like to traverse the social graph quickly.

Let's take a simple user class holding some attributes and a list of 'friends' making up a social graph.

easiest way to store this on heap, is a simple huge HashMap.
Alternatively one can use off heap maps to store large amounts of data. An off heap map stores its keys and values inside the native heap, so garbage collection does not need to track this memory. In addition, native heap can be told to automagically get synchronized to disk (memory mapped files). This even works in case your application crashes, as the OS manages write back of changed memory regions.

There are some open source off heap map implementations out there with various feature sets (e.g. ChronicleMap), for this example I'll use a plain and simple implementation featuring fast iteration (optional full scan search) and ease of use.

Serialization is used to store objects, deserialization is used in order to pull them to the java heap again. Pleasantly I have written the (afaik) fastest fully JDK compliant object serialization on the planet, so I'll make use of that.

  • persistence by memory mapping a file (map will reload upon creation). 
  • Java Heap still empty to serve real application processing with Full GC < 100ms. 
  • Significantly less overall memory consumption. A user record serialized is ~60 bytes, so in theory 300 million records fit into 180GB of server memory. No need to raise the big data flag and run 4096 hadoop nodes on AWS ;).

Comparing a regular in-memory java HashMap and a fast-serialization based persistent off heap map holding 15 millions user records, will show following results (on a 3Ghz older XEON 2x6):

consumed Java Heap (MB)Full GC (s)Native Heap (MB)get/put ops per srequired VM size (MB)
OffheapMap (Serialization based)

[test source / blog project] Note: You'll need at least 16GB of RAM to execute them.

As one can see, even with fast serialization there is a heavy penalty (~factor 5) in access performance, anyway: compared to other persistence alternatives, its still superior (1-3 microseconds per "get" operation, "put()" very similar).

Use of JDK serialization would perform at least 5 to 10 times slower (direct comparison below) and therefore render this approach useless.

Trading performance gains against higher level of abstraction: "Serverize me"

A single server won't be able to serve (hundreds of) thousands of users, so we somehow need to share data amongst processes, even better: across machines.

Using a fast implementation, its possible to generously use (fast-) serialization for over-the-network messaging. Again: if this would run like 5 to 10 times slower, it just wouldn't be viable. Alternative approaches require an order of magnitude more work to achieve similar results.

By wrapping the persistent off heap hash map by an Actor implementation (async ftw!), some lines of code make up a persistent KeyValue server with a TCP-based and a HTTP interface (uses kontraktor actors). Of course the Actor can still be used in-process if one decides so later on.

Now that's a micro service. Given it lacks any attempt of optimization and is single threaded, its reasonably fast [same XEON machine as above]:
  • 280_000 successful remote lookups per second 
  • 800_000 in case of fail lookups (key not found)
  • serialization based TCP interface (1 liner)
  • a stringy webservice for the REST-of-us (1 liner).
[source: KVServer, KVClient] Note: You'll need at least 16GB of RAM to execute the test.

A real world implementation might want to double performance by directly putting received serialized object byte[] into the map instead of encoding it twice (encode/decode once for transmission over wire, then decode/encode for offheaping map).

"RestActorServer.Publish(..);" is a one liner to also expose the KVActor as a webservice in addition to raw tcp:

C like performance using flyweight wrappers / structs

With serialization, regular Java Objects are transformed to a byte sequence. One can do the opposite: Create  wrapper classes which read data from fixed or computed positions of an underlying byte array or native memory address. (E.g. see this blog post).

By moving the base pointer its possible to access different records by just moving the the wrapper's offset. Copying such a "packed object" boils down to a memory copy. In addition, its pretty easy to write allocation free code this way. One downside is, that reading/writing single fields has a performance penalty compared to regular Java Objects. This can be made up for by using the Unsafe class.

"flyweight" wrapper classes can be implemented manually as shown in the blog post cited, however as code grows this starts getting unmaintainable.
Fast-serializaton provides a byproduct "struct emulation" supporting creation of flyweight wrapper classes from regular Java classes at runtime. Low level byte fiddling in application code can be avoided for the most part this way.

How a regular Java class can be mapped to flat memory (fst-structs):

Of course there are simpler tools out there to help reduce manual programming of encoding  (e.g. Slab) which might be more appropriate for many cases and use less "magic".

What kind of performance can be expected using the different approaches (sad fact incoming) ?

Lets take the following struct-class consisting of a price update and an embedded struct denoting a tradable instrument (e.g. stock) and encode it using various methods:

a 'struct' in code

Pure encoding performance:

Structsfast-Ser (no shared refs)fast-SerJDK Ser (no shared)JDK Ser

Real world test with messaging throughput:

In order to get a basic estimation of differences in a real application, i do an experiment how different encodings perform when used to send and receive messages at a high rate via reliable UDP messaging:

The Test:
A sender encodes messages as fast as possible and publishes them using reliable multicast, a subscriber receives and decodes them.

Structsfast-Ser (no shared refs)fast-SerJDK Ser (no shared)JDK Ser

(Tests done on I7/Win8, XEON/Linux scores slightly higher, msg size ~70 bytes for structs, ~60 bytes serialization).

Slowest compared to fastest: factor of 82. The test highlights an issue not covered by micro-benchmarking: Encoding and Decoding should perform similar, as factual throughput is determined by Min(Encoding performance, Decoding performance). For unknown reasons JDK serialization manages to encode the message tested like 500_000 times per second, decoding performance is only 80_000 per second so in the test the receiver gets dropped quickly:

***** Stats for receive rate:   80351   per second *********
***** Stats for receive rate:   78769   per second *********
SUB-ud4q has been dropped by PUB-9afs on service 1
fatal, could not keep up. exiting
(Creating backpressure here probably isn't the right way to address the issue ;-)  )

  • a fast serialization allows for a level of abstraction in distributed applications impossible if serialization implementation is either
    - too slow
    - incomplete. E.g. cannot handle any serializable object graph
    - requires manual coding/adaptions. (would put many restrictions on actor message types, Futures, Spore's, Maintenance nightmare)
  • Low Level utilities like Unsafe enable different representations of data resulting in extraordinary throughput or guaranteed latency boundaries (allocation free main path) for particular workloads. These are impossible to achieve by a large margin with JDK's public tool set.
  • In distributed systems, communication performance is of fundamental importance. Removing Unsafe is  not the biggest fish to fry looking at the numbers above .. JSON or XML won't fix this ;-).
  • While the HotSpot VM has reached an extraordinary level of performance and reliability, CPU is wasted in some parts of the JDK like there's no tomorrow. Given we are living in the age of distributed applications and data, moving stuff over the wire should be easy to achieve (not manually coded) and as fast as possible. 

Addendum: bounded latency

A quick Ping Pong RTT latency benchmark showing that java can compete with C solutions easily, as long the main path is allocation free and techniques like described above are employed:

[credits: charts+measurement done with HdrHistogram]

This is an "experiment" rather than a benchmark (so do not read: 'Proven: Java faster than C'), it shows low-level-Java can compete with C in at least this low-level domain.
Of course its not exactly idiomatic Java code, however its still easier to handle, port and maintain compared to a JNI or pure C(++) solution. Low latency C(++) code won't be that idiomatic either ;-)

About me: I am a solution architect freelancing at an exchange company in the area of realtime GUIs, middleware, and low latency CEP (Complex Event Processing).
I am blogging at,
hacking at

This post is part of the Java Advent Calendar and is licensed under the Creative Commons 3.0 Attribution license. If you like it, please spread the word by sharing, tweeting, FB, G+ and so on!

Sunday, December 21, 2014

Doing microservices with micro-infra-spring

We've been working at 4financeit for last couple of months on some open source solutions for microservices. I will be publishing some articles related to microservices and our tools and this is the first of (hopefully) many that I will write in the upcoming weeks (months?) on Too much coding blog.

This article will be an introduction to the micro-infra-spring library showing how you can quickly set up a microservice using our tools.

Saturday, December 20, 2014

How is Java / JVM built ? Adopt OpenJDK is your answer!

Introduction & history
As some of you may already know, starting with Java 7, OpenJDK is the Reference Implementation (RI) to Java. The below timeline gives you an idea about the history of OpenJDK:
OpenJDK history (2006 till date)
If you have wondered about the JDK or JRE binaries that you download from vendors like Oracle, Red Hat, etcetera, then the clue is that these all stem from OpenJDK. Each vendor then adds some extra artefacts that are not open source yet due to security, proprietary or other reasons.

What is OpenJDK made of ?
OpenJDK is made up of a number of repositories, namely corba, hotspot, jaxp, jaxws, jdk, langtools, and nashorn. Between OpenjJDK8 and OpenJDK9 there have been no new repositories introduced, but lots of new changes and restructuring, primarily due to Jigsaw - the modularisation of Java itself [2] [3] [4] [5].
repo composition, language breakdown (metrics are estimated)
Recent history
OpenJDK Build Benchmarks - build-infra (Nov 2011) by Fredrik Öhrström, ex-Oracle, OpenJDK hero!
Fredrik Öhrström visited the LJC [16] in November 2011 where he showed us how to build OpenJDK on the three major platforms, and also distributed a four page leaflet with the benchmarks of the various components and how long they took to build. The new build system and the new makefiles are a result  of the build system being re-written (build-infra). 

Below are screen-shots of the leaflets, a good reference to compare our journey:
Build Benchmarks page 2 [26]

How has Java the language and platform built over the years ?
Java is built by bootstrapping an older (previous) version of Java - i.e. Java is built using Java itself as its building block. Where older components are put together to create a new component which in the next phase becomes the building block. A good example of bootstrapping can be found at Scheme from Scratch [6] or even on Wikipedia [7].

OpenJDK8 [8] is compiled and built using JDK7, similarly OpenJDK9 [9] is compiled and built using JDK8. In theory OpenJDK8 can be compiled using the images created from OpenJDK8, similarly for OpenJDK9 using OpenJDK9. Using a process called bootcycle images - a JDK image of OpenJDK is created and then using the same image, OpenJDK is compiled again, which can be accomplished using a make command option:

$ make bootcycle-images       # Build images twice, second time with newly built JDK

make offers a number of options under OpenJDK8 and OpenJDK9, you can build individual components or modules by naming them, i.e.

$ make [component-name] | [module-name]

or even run multiple build processes in parallel, i.e.

$ make JOBS=<n>                 # Run <n> parallel make jobs

Finally install the built artefact using the install option, i.e.

$ make install

Some myths busted
OpenJDK or Hotspot to be more specific isn't completely written in C/C++, a good part of the code-base is good 'ole Java (see the composition figure above). So you don't have to be a hard-core developer to contribute to OpenJDK. Even the underlying C/C++ code code-base isn't scary or daunting to look at. For example here is an extract of a code snippet from vm/memory/universe.cpp in the HotSpot repo -

if (UseParallelGC) {
   #ifndef SERIALGC
   Universe::_collectedHeap = new ParallelScavengeHeap();
   #else // SERIALGC
       fatal("UseParallelGC not supported in this VM.");
   #endif // SERIALGC

} else if (UseG1GC) {
   #ifndef SERIALGC
   G1CollectorPolicy* g1p = new G1CollectorPolicy();
   G1CollectedHeap* g1h = new G1CollectedHeap(g1p);
   Universe::_collectedHeap = g1h;
   #else // SERIALGC
       fatal("UseG1GC not supported in java kernel vm.");
   #endif // SERIALGC

} else {
   GenCollectorPolicy* gc_policy;

   if (UseSerialGC) {
       gc_policy = new MarkSweepPolicy();
   } else if (UseConcMarkSweepGC) {
       #ifndef SERIALGC
       if (UseAdaptiveSizePolicy) {
           gc_policy = new ASConcurrentMarkSweepPolicy();
       } else {
           gc_policy = new ConcurrentMarkSweepPolicy();
       #else // SERIALGC
           fatal("UseConcMarkSweepGC not supported in this VM.");
       #endif // SERIALGC
   } else { // default old generation
       gc_policy = new MarkSweepPolicy();

   Universe::_collectedHeap = new GenCollectedHeap(gc_policy);
(please note that the above code snippet might have changed since published here)
The things that appears clear from the above code-block are, we are looking at how pre-compiler notations are used to create Hotspot code that supports a certain type of GC i.e. Serial GC or Parallel GC. Also the type of GC policy is selected in the above code-block when one or more GC switches are toggled i.e. UseAdaptiveSizePolicy when enabled selects the Asynchronous Concurrent Mark and Sweep policy. In case of either Use Serial GC or Use Concurrent Mark Sweep GC are not selected, then the GC policy selected is Mark and Sweep policy. All of this and more is pretty clearly readable and verbose, and not just nicely formatted code that reads like English.

Further commentary can be found in the section called Deep dive Hotspot stuff in the Adopt OpenJDK Intermediate & Advance experiences [11] document.

Steps to build your own JDK or JRE
Earlier we mentioned about JDK and JRE images - these are no longer only available to the big players in the Java world, you and I can build such images very easily. The steps for the process have been simplified, and for a quick start see the Adopt OpenJDK Getting Started Kit [12] and Adopt OpenJDK Intermediate & Advance experiences [11] documents. For detailed version of the same steps, please see the Adopt OpenJDK home page [13]. Basically building a JDK image from the OpenJDK code-base boils down to the below commands:

(setup steps have been made brief and some commands omitted, see links above for exact steps)

$ hg clone jdk8  (a)...OpenJDK8
$ hg clone jdk9  (a)...OpenJDK9

$ ./                                    (b)
$ bash configure                                      (c)
$ make clean images                                   (d)

(setup steps have been made brief and some commands omitted, see links above for exact steps)

To explain what is happening at each of the steps above:
(a) We clone the openjdk mercurial repo just like we would using git clone ….
(b) Once we have step (a) completed, we change into the folder created, and run the command, which is equivalent to a git fetch or a git pull, since the step (a) only brings down base files and not all of the files and folders.
(c) Here we run a script that checks for and creates the configuration needed to do the compile and build process
(d) Once step (c) is success we perform a complete compile, build and create JDK and JRE images from the built artefacts

As you can see these are dead-easy steps to follow to build an artefact or JDK/JRE images [step (a) needs to be run only once].

- contribute to the evolution and improvement of the Java the language & platform
- learn about the internals of the language and platform
- learn about the OS platform and other technologies whilst doing the above
- get involved in F/OSS projects
- stay on top the latest changes in the Java / JVM sphere
- knowledge and experience that helps professionally but also these are not readily available from other sources (i.e. books, training, work-experience, university courses, etcetera).
- advancement in career
- personal development (soft skills and networking)

Join the Adopt OpenJDK [14] and Betterrev [15] projects and contribute by giving us feedback about everything Java including these projects. Join the Adoption Discuss mailing list and other OpenJDK related mailing lists to start with, these will keep you updated with latest progress and changes to OpenJDK. Fork any of the projects you see and submit changes via pull-requests.

Thanks and support
Adopt OpenJDK [14] and umbrella projects have been supported and progressed with help of JCP [21], the Openjdk team [22], JUGs like London Java Community [16], SouJava [17] and other JUGs in Brazil, a number of JUGs in Europe i.e. BGJUG (Bulgarian JUG) [18], BeJUG (Belgium JUG) [19], Macedonian JUG [20], and a number of other small JUGs. We hope in the coming time more JUGs and individuals would get involved. If you or your JUG wish to participate please get in touch.

Special thanks to +Martijn Verburg (incepted Adopt OpenJDK), +Richard Warburton+Oleg Shelajev+Mite Mitreski, +Kaushik Chaubal and +Julius G for helping improve the content and quality of this post, and sharing their OpenJDK experience with us.

Please share your comments here or tweet at @theNeomatrix369.

[8] OpenJDK8 
[15] Betterrev
[17] SouJava 

This post is part of the Java Advent Calendar and is licensed under the Creative Commons 3.0 Attribution license. If you like it, please spread the word by sharing, tweeting, FB, G+ and so on!