Advent time again .. picking up Peters well written overview on the uses of Unsafe, i’ll have a
short fly-by on how low level techniques in Java can save development effort by enabling a higher level of abstraction or allow for Java performance levels probably unknown to many.
My major point is to show that conversion of Objects to bytes and vice versa is an important fundamental, affecting virtually any modern java application.
Hardware enjoys to process streams of bytes, not object graphs connected by pointers as “All memory is tape” (M.Thompson if I remember correctly ..).
Many basic technologies are therefore hard to use with vanilla Java heap objects:
- Memory Mapped Files – a great and simple technology to persist application data safe, fast & easy.
- Network communication is based on sending packets of bytes
- Interprocess communication (shared memory)
- Large main memory of today’s servers (64GB to 256GB). (GC issues)
- CPU caches work best on data stored as a continuous stream of bytes in memory
so use of the Unsafe class in most cases boil down in helping to transform a java object graph into a continuous memory region and vice versa either using
- [performance enhanced] object serialization or
- wrapper classes to ease access to data stored in a continuous memory region.
(source of examples used in this post can be found here, messaging latency test here)
Serialization based Off-Heap
Let’s take a simple user class holding some attributes and a list of ‘friends’ making up a social graph.
easiest way to store this on heap, is a simple huge HashMap.
Alternatively one can use off heap maps to store large amounts of data. An off heap map stores its keys and values inside the native heap, so garbage collection does not need to track this memory. In addition, native heap can be told to automagically get synchronized to disk (memory mapped files). This even works in case your application crashes, as the OS manages write back of changed memory regions.
There are some open source off heap map implementations out there with various feature sets (e.g. ChronicleMap), for this example I’ll use a plain and simple implementation featuring fast iteration (optional full scan search) and ease of use.
Serialization is used to store objects, deserialization is used in order to pull them to the java heap again. Pleasantly I have written the (afaik) fastest fully JDK compliant object serialization on the planet, so I’ll make use of that.
- persistence by memory mapping a file (map will reload upon creation).
- Java Heap still empty to serve real application processing with Full GC < 100ms.
- Significantly less overall memory consumption. A user record serialized is ~60 bytes, so in theory 300 million records fit into 180GB of server memory. No need to raise the big data flag and run 4096 hadoop nodes on AWS ;).
|consumed Java Heap (MB)||Full GC (s)||Native Heap (MB)||get/put ops per s||required VM size (MB)|
|OffheapMap (Serialization based)||
[test source / blog project] Note: You’ll need at least 16GB of RAM to execute them.
As one can see, even with fast serialization there is a heavy penalty (~factor 5) in access performance, anyway: compared to other persistence alternatives, its still superior (1-3 microseconds per “get” operation, “put()” very similar).
Use of JDK serialization would perform at least 5 to 10 times slower (direct comparison below) and therefore render this approach useless.
Trading performance gains against higher level of abstraction: “Serverize me”
A single server won’t be able to serve (hundreds of) thousands of users, so we somehow need to share data amongst processes, even better: across machines.
Using a fast implementation, its possible to generously use (fast-) serialization for over-the-network messaging. Again: if this would run like 5 to 10 times slower, it just wouldn’t be viable. Alternative approaches require an order of magnitude more work to achieve similar results.
By wrapping the persistent off heap hash map by an Actor implementation (async ftw!), some lines of code make up a persistent KeyValue server with a TCP-based and a HTTP interface (uses kontraktor actors). Of course the Actor can still be used in-process if one decides so later on.
Now that’s a micro service. Given it lacks any attempt of optimization and is single threaded, its reasonably fast [same XEON machine as above]:
- 280_000 successful remote lookups per second
- 800_000 in case of fail lookups (key not found)
- serialization based TCP interface (1 liner)
- a stringy webservice for the REST-of-us (1 liner).
[source: KVServer, KVClient] Note: You’ll need at least 16GB of RAM to execute the test.
A real world implementation might want to double performance by directly putting received serialized object byte into the map instead of encoding it twice (encode/decode once for transmission over wire, then decode/encode for offheaping map).
“RestActorServer.Publish(..);” is a one liner to also expose the KVActor as a webservice in addition to raw tcp:
C like performance using flyweight wrappers / structs
With serialization, regular Java Objects are transformed to a byte sequence. One can do the opposite: Create wrapper classes which read data from fixed or computed positions of an underlying byte array or native memory address. (E.g. see this blog post).
By moving the base pointer its possible to access different records by just moving the the wrapper’s offset. Copying such a “packed object” boils down to a memory copy. In addition, its pretty easy to write allocation free code this way. One downside is, that reading/writing single fields has a performance penalty compared to regular Java Objects. This can be made up for by using the Unsafe class.
“flyweight” wrapper classes can be implemented manually as shown in the blog post cited, however as code grows this starts getting unmaintainable.
Fast-serializaton provides a byproduct “struct emulation” supporting creation of flyweight wrapper classes from regular Java classes at runtime. Low level byte fiddling in application code can be avoided for the most part this way.
Of course there are simpler tools out there to help reduce manual programming of encoding (e.g. Slab) which might be more appropriate for many cases and use less “magic”.
What kind of performance can be expected using the different approaches (sad fact incoming) ?
Lets take the following struct-class consisting of a price update and an embedded struct denoting a tradable instrument (e.g. stock) and encode it using various methods:
|Structs||fast-Ser (no shared refs)||fast-Ser||JDK Ser (no shared)||JDK Ser|
Real world test with messaging throughput:
In order to get a basic estimation of differences in a real application, i do an experiment how different encodings perform when used to send and receive messages at a high rate via reliable UDP messaging:
A sender encodes messages as fast as possible and publishes them using reliable multicast, a subscriber receives and decodes them.
|Structs||fast-Ser (no shared refs)||fast-Ser||JDK Ser (no shared)||JDK Ser|
(Tests done on I7/Win8, XEON/Linux scores slightly higher, msg size ~70 bytes for structs, ~60 bytes serialization).
Slowest compared to fastest: factor of 82. The test highlights an issue not covered by micro-benchmarking: Encoding and Decoding should perform similar, as factual throughput is determined by Min(Encoding performance, Decoding performance). For unknown reasons JDK serialization manages to encode the message tested like 500_000 times per second, decoding performance is only 80_000 per second so in the test the receiver gets dropped quickly:
***** Stats for receive rate: 80351 per second *********
***** Stats for receive rate: 78769 per second *********
SUB-ud4q has been dropped by PUB-9afs on service 1
fatal, could not keep up. exiting
(Creating backpressure here probably isn’t the right way to address the issue 😉 )
- a fast serialization allows for a level of abstraction in distributed applications impossible if serialization implementation is either
– too slow
– incomplete. E.g. cannot handle any serializable object graph
– requires manual coding/adaptions. (would put many restrictions on actor message types, Futures, Spore’s, Maintenance nightmare)
- Low Level utilities like Unsafe enable different representations of data resulting in extraordinary throughput or guaranteed latency boundaries (allocation free main path) for particular workloads. These are impossible to achieve by a large margin with JDK’s public tool set.
- In distributed systems, communication performance is of fundamental importance. Removing Unsafe is not the biggest fish to fry looking at the numbers above .. JSON or XML won’t fix this ;-).
- While the HotSpot VM has reached an extraordinary level of performance and reliability, CPU is wasted in some parts of the JDK like there’s no tomorrow. Given we are living in the age of distributed applications and data, moving stuff over the wire should be easy to achieve (not manually coded) and as fast as possible.
A quick Ping Pong RTT latency benchmark showing that java can compete with C solutions easily, as long the main path is allocation free and techniques like described above are employed:
[credits: charts+measurement done with HdrHistogram]
This is an “experiment” rather than a benchmark (so do not read: ‘Proven: Java faster than C’), it shows low-level-Java can compete with C in at least this low-level domain.
Of course its not exactly idiomatic Java code, however its still easier to handle, port and maintain compared to a JNI or pure C(++) solution. Low latency C(++) code won’t be that idiomatic either 😉
About me: I am a solution architect freelancing at an exchange company in the area of realtime GUIs, middleware, and low latency CEP (Complex Event Processing).
I am blogging at http://java-is-the-new-c.blogspot.de/,
hacking at https://github.com/RuedigerMoeller.
This post is part of the Java Advent Calendar and is licensed under the Creative Commons 3.0 Attribution license. If you like it, please spread the word by sharing, tweeting, FB, G+ and so on!
Peter Lawrey December 22, 2014
The sad fact that while just about every other serialization is faster, there is no JEP to improve performance.
Rüdiger Möller December 22, 2014
1) None of the "about every other" serializations is JDK compatible *and* faster. As mentioned, a partial implementation isn't all that useful as a lot of work done in JDK classes (readReplace, read/writeObject, Externalizable) has to get rewritten, also its impossible to just drop-in-replace in existing applications.
2) Regarding a JEP. Agree its sad, however doing a serialization JEP is a serious amount of work with unknown reward (might get rejected, postponed, whatever). Anyway maybe I should give it a try ..
Olimpiu Pop December 22, 2014
Rudiger, I support your efforts. Go for the JEP, take the matter in your own hands :P.
Alexander Turner December 27, 2014
Seems to me that invoke dynamic, as used in lambda, breaks any chance of a fast serialisation system. I do not see an major problem with having to hand code up serialisation. It just needs to be easier. Maybe, one could have a 'pod' (plain old data) definition for Java and then a fast serialisation standard for that, but a system which is fast and can cope with the full Java object model is just not going to happen.
Rüdiger Möller December 27, 2014
Hi Alexander, you are probably living in a different problem universe, I am targeting distributed systems in the first place..
1) serializing lambdas is tricky, but not a show stopper (requires remote class loading subprotocol or shared classpath amongst all cluster nodes). Worst case one can live with the restriction of not serializing lambdas.
2) Hand coding: It gets a problem if you run a distributed application with >1000 different message types and a larger team. Its expensive, error prone. hard to maintain to do serialization manually then. Handcoding is in essence dumb: a perfect example for something a computer can do :-).
A second disadvantage is coupling of encoding and application logic. E.g. by changing the serialization codec, we can opt for shorter wire format when sending stuff to clients over internet/leased lines and for higher throughput when talking via 10Gb ethernet to other cluster nodes. Application code is identical, we use just a different serializtion codec under the hood.
"but a system which is fast and can cope with the full Java object model is just not going to happen"
(?!) go checkout examples above and run them :-). fst can cope with the full java object model.
It already happened. Fast-serialization runs in a production system coping with up to 150k events/second matched against ~30k complex filters (continuous queries). Cluster internal event rate is ~twice as high.
As you can see above, full graph object serialization -if done right- can result in performance levels where encoding stops being the major bottleneck. The example above deals with very simple data structures, once data is more complex, serialization gains ground.
Additionally there is zero processing in the messaging example. If some processing is added, throughput difference of structs (=fastest possible way to encode/decode) and serialization gets smaller.
Truth is, if you have a decent complex interlinked data structure, a manually implemented serialization will be slower than fast-serialization (if you implement pointer restauration and cycle detection ad-hoc it will be slower for sure). E.g. we share+update reference data (several MB of a complex model of metadata) across cluster nodes with a single 'sendObject'.
Fast, transparent generic serialization fulfils a very old promise of Java: "The network is the computer". You won't get that by cluttering 30% of your code base with putting and getting fields from messages sent/received.