I have been contemplating for a number of months about reviewing a cache of articles and videos on topics like Performance tuning, JVM, GC in Java, Mechanical Sympathy, etc… and finally took the time to do it – may be this was the point in my intellectual progress when was I required to do such a thing!
Thanks to Attila-Mihaly
for giving me the opportunity to write a post for his yearly newsletter Java Advent Calendar
, hence a review on various Java related topics fits the bill! The selection of videos and articles are purely random, and based on the order in which they came to my knowledge. My hidden agenda is to mainly go through them to understand and broaden my own knowledge at the same time share any insight with others along the way.
I’ll be covering three reviews of talks by Attila Szegedi (1 talk) and Ben Evans (2 talks). They speak on the subject of Java Performance and the GC. The first talk by Attila covers a lot of his experience as an Engineer at Twitter – so its lots of information out of live experience in the field on production systems. Making use of thin objects instead of fat ones is one of the buzzwords in his talk.
Ben in his two talks covers Performance, JVM and GC in great depth. He points out about people’s misconception about Performance, the JVM and GC, things that people don’t have certain run-time flags enabled in production. How the underlying machinery works, why it works the way it works?How efficient the machinery is and what best to do and not to do to get good throughput out of it?
Here I go with my commentary, I decided to start with Attila Szegedi’s talk as I quite liked the title…..
Everything I Ever Learned About JVM Performance Tuning @Twitter by Attila Szegedi
(video & slides)
Attila at the time of the talk worked for Twitter where he learnt a lot about the internals of the JVM and the Java language itself – Twitter being an organisation where tuning, optimising JVMs, low-latency are defacto practises.
He covers interesting topics like:
– contributors of latency
– finished code not ready for production
– areas of performance tuning (primarily memory tuning and lock contention tuning)
– Memory footprint tuning (OOME, inefficient tuning, FAT data)
– FAT data – a new terminology coined by him, and how to resolve issues created by it (pretty indepth and interesting) – learn about byte allocations to data types in the Java / JVM languages.
Some deep dive topics like compressed object pointers, are one of the suggestions (including a pit-fall). Certain types in Scala 2.7.7 are inefficient – as revealed by a JVM profiler. Do not use Thrift – as it is not a friend of low-latency, as they are heavy – adds between 52 to 72 bytes of overhead per object, does not support 32-bit floats, etc… Be careful with thread locals – sticks around and uses more resources than expected.
Performance triangle, Attila shares his insight into this concept. GC is the biggest threat of the JVM. Old gen uses ConcCollector, while the new gen goes through the STW process, and enlists a number of throughput and low-pause collectors.
Improve GC by taking advantage of the Adaptive sizing policy, and give it a target to work on. Use a throughput collector with or without the adaptive policy and benchmark the results. He takes us through the various -XX: +Print… flags and explains its uses. Keep fragmentations low and avoid full GC stops. Lots of detail on the workings of the GC and what can be done to improve GC (tuning both new and old gens).
Latency that are not GC related – thread coordination optimization. Barriers and half-barriers can be used when using threads to improve latency – along with some tricks when using the Atomic values & AtomicReferences. Cassandra slab allocator – helps efficiency and performance – do not write your own memory manager. Attila is no longer a fan of “Soft references” – although great in theory but not in practice, more GC cycles are needed to clear them!
Conclusion: know your code as often they may be the root to your problems – frameworks can many a times be the cause of performance issues. Lots of things can be done to squeeze performance out of the programs written, if one knows how to best use the fundamental building blocks of data structures of your development environment. Its a hard game to maintain the best throughput and get the best performance out of the JVM.
— Recommend watching the video, lots more covered than the synopsis above —
9 Fallacies of Java Performance by Ben Evans (blog)
In this article Ben goes about busting old myths and assumptions about Java, its performance, GC, etc… Areas covered being:
1) Java is slow, 2) A single line of Java means anything in isolation, 3) A micro-benchmark means what you think it does , 4) Algorithmic slowness is the most common cause of performance problems, 5) Caching solves everything, 6) All apps need to be concerned about Stop-The-World, 7) Hand-rolled Object Pooling is appropriate for a wide range of apps, 8) CMS is always a better choice of GC than Parallel Old, 9) Increasing the heap size will solve your memory problem
– JIT compiled code is as fast as C++ in many cases
– JIT compiler can optimize away dead and unused code, even on the basis of profiling data. In JVMs like JRockit, the JIT can decompose object operations.
– For best results don’t prematurely optimize, instead correct your performance hot spots.
– Richard Feynman once said: “The first principle is that you must not fool yourself – and you are the easiest person to fool” – something to keep in mind when thinking of writing Java micro-benchmarks.
The points being the ideas people have in their minds about Java is but the opposite of the reality of things. Basically suggesting the masses to revisit the ideas and make conclusions based on sheer facts and not assumptions or old beliefs.
– GC, database access, misconfiguration, etc… are likely to cause application slowness as compared to algorithms.
– Measure, don’t guess ! Use empirical production data to uncover the true causes of performance problems.
– Don’t just add a cache to redirect the problem elsewhere and add complexity to the system, but collect basic usage statistics (miss rate, hit rate, etc.) to prove that the caching layer is actually adding value.
– If the users haven’t complained or you are not in the low-latency stack – don’t worry about STOP-THE-WORLD pauses (circa 200 ms depending on the heap size).
– Object pooling is very difficult and should only be used when GC pauses are unacceptable, and intelligent attempts at tuning and refactoring have been unable to reduce pauses to an acceptable level.
– Check if CMS is your correct GC strategy, you should first determine that STW pauses from Parallel Old are unacceptable and can’t be tuned. Ben stresses: be sure that all metrics are obtained on a production-equivalent system.
– Understanding the dynamics of object allocation and lifetime before changing heap size or tuning other parameters is essential. Acting without measuring can make matters worse. The tenuring distribution information from the garbage collector is especially important here.
Conclusion: The GC subsystem has incredible potential for tuning and for producing data to guide tuning, and then to use a tool to analyse the logs – either handwritten scripts and some graph generation, or a visual tool such as the (open-source) GCViewer or a commercial product.
Visualizing Java GC by Ben Evans (video & slides)
Misunderstanding or shortcomings in people’s understanding of GC. Its not just Mark & Sweep. Many run-times these days have GC! Two schools of thoughts – GC & Reference counting! Humans make mistakes as compared to machines which requires high levels of precision. True GC is incredibly efficient, reference counting is expensive –
pioneered by Java (comments from +Gil Tene: On the correctness side, I’d be careful saying “pioneered by Java” for anything in GC. Java’s GC semantics are fairly classic, and present no new significant problems that predating environments did not. Most core GC techniques used in JVMs were researched and well known in other environments (smalltalk, lisp, etc.) and are also available in other Runtimes. While it is fair to say that from a practical perspective, JVMs tend to have the most mature GC mechanisms these days, that’s because Java is a natural place to apply new GC techniques that actually work. But innovation and pioneering in GC is not strongly tied to Java.)
The allocation list is where all objects are rooted from.
You can’t get an accurate picture of all the objects of a running object at any given point of time of a running live application without stopping the application that’s why we have STW (Stop-The-World)
! (comments from +Gil Tene: In addition, the notion that “you can’t get an accurate picture of all the objects of a running object at any given point of time of a running live application without stopping the application that’s why we have STW (Stop-The-World)!” is wrong. Concurrent marking and concurrent compaction are very real things that achieve just that without stopping the application. “Just needs some good engineering”, and “you just can’t do X” are very different things.)
Golden rules of GC
– must collect all the garbage (sensitive rule)
– must never collect a live object
(trick: but they are never created equal)
Hotspot is C/C++/Assembly application. Heap is a contiguous block of memory with different memory pools – Young Gen, Old Gen, and PermGen pools. Objects are created by application (mutator) threads and removed by GC. Applications are not slow due to GC all the time.
PermG – not desirable, going away in Java 8 (known issue: causes OOME exceptions), to be replaced by Metaspace outside the heap (native memory).
GC is based on ‘Weak generational hypothesis’ – objects die young, or die old – found out through empirical research. (comments from +Michael Barker: I think this statement:“GC is based on ‘Weak generational hypothesis’ – objects die young, or die old – found out through empirical research.”Is not correct. I think I can guess at what you mean, but you may want to consider rewording it so that it is not misleading. There are GC implementations in real world VMs that are not generational collectors.
comments from +Kirk Pepperdine: Indeed. the ParcPlace VM had 7 different memory spaces that have a strong resemblance to the todays generational spaces. There is Eden with two hemi spaces plus 4 other spaces for different types of long lived data.)
Re-worded version: GC in the JVM is based on ‘Weak generational hypothesis’ – objects die young, or die old – found out through empirical research.
Tenuring threshold is the number of GC you survive before your get moved to the Old Gen (Tenuring space). JavaFX is bundled with jdk7u6 and up.
Source code of JavaFX Memory Visualizer written in Java replacing the Flash version – https://github.com/kittylyst/jfx-mem
– written using FlexML (FXML). An extensive explanation of how the program is written in FlexML, a nice programming language – uses the builder pattern in combination with DSL like expressions. The program models the way GC works and how objects are created, destroyed and moved about the different pools.
List of mandatory flags, which do not have any performance impact
All the information needed about an executing application and GC are recorded by the above. Also covers basic heap sizing flags. Setting the heap flags to equal do not apply anymore since recent versions of the JDK. Also there’s more than 200 flags to the GC and VM not including all the undocumented ones.
GC log files are useful for post-processing, but sometimes are not recorded correctly. MXBeans impact the running application but also do not give more information than the log files.
GC log files have a general format giving information on change of allocation, occupancy, tenuring info, collection info, etc…, – explosion of GC log file formats and not much tooling out there. Many of the free tools cover some sort of dashboard like output showing various GC related metrics, the commercial versions have a better approach and useful information in general.
Premature promotion – under pressure of creation of new objects, objects are moved directly from YG to OG without going through the Survivor spaces.
Use tools, measure and don’t guess!
Conclusion: know the facts and find out details if they are not known but do not guess or assume. False conceptions have lead to assumptions and incorrect understanding of the JVM and the GC process at times. Don’t just changes flags or use tools, know why to and what they do. For e.g. switching on GC logging (with appropriate flags enabled) does not have a visible impact on the performance of the JVM but is a boon in the medium to long run.
— Highly recommend watching the video, lots more covered than the synopsis above, Ben has explained GC in the simplest form one could, covering many important details —