(Part 3 of 3): Synopsis of articles & videos on Performance tuning, JVM, GC in Java, Mechanical Sympathy, et al

This is a continuation of the previous post titled (Part 2 of 3): Synopsis of articles & videos on Performance tuning, JVM, GC in Java, Mechanical Sympathy, et al.

In our first review, The Atlassian guide to GC tuning is an extensive post covering the methodology and things to keep in mind when tuning GC, practical examples are given and references to important resources are also made in the process. The next one How NOT to measure latency by Gil Tene, he discusses some common pitfalls encountered in measuring and characterizing latency, demonstrating and discussing some false assumptions and measurement techniques that lead to dramatically incorrect reporting results, and covers simple ways to sanity check and correct these situations.  Finally Kirk Pepperdine in his post Poorly chosen Java HotSpot Garbage Collection Flags and how to fix them! throws light on some JVM flags – he starts with some 700 flags and boils it down to merely 7 flags. Also cautions you to not just draw conclusions or to take action in a whim but consult and examine – i.e. measure don’t guess!


By default JVM tunings attempt to provide acceptable performance in majority of the cases, but it may not always give the desired results depending on application behaviour. Benchmark your application before applying any GC or JVM related settings. There are a number of environment, OS, hardware related factors to taken into consideration and many a times any tuning to JVM may be linked to these factors and if they chance the tuning may need to be revisited. Beware and accept the fact that any tuning has its upper limit in a given environment, if your goals are still not met, then you need to improve your environment. Always monitor your application to see if your tuning goals are still in scope.

Choosing Performance Goals – GC for the Oracle JVM can be reduced by targeting three goals ie. latency (mean & maximum), throughput, and footprint. To reach the tuning goals, there are principles that guide you to tuning the GC i.e. Minor GC reclamation, GC maximise memory, two of three (pick 2 of 3 of these goals, and sacrifice the other).

Preparing your environment for GC – 

Some of the items of interest are load the application with work (apply load to the application as it would be in production), turn on GC logging, set data sampling windows, determine memory footprint, rules of thumb for generation sizes, determine systemic requirements.

Iteratively change the JVM parameters keeping the environment its setting in tact. An example of turning on GC logging is:

java -XX:+PrintGC -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -Xloggc:<file> …
java -XX:+PrintGC -XX:+PrintGCDetails -XX:+PrintGCDateStamps -Xloggc:<file> …

Sampling windows are important to help specifically diagnose an application’s runtime rather than the load-time or time it or the JVM takes to stabilise before it can execute anything. Set the heap sizes either by guessing (give it maximum and then pull back) or by using results accumulated in the GC logs. Also check for OOME which can be an indicator if heap size or standard memory needs increasing. There are a number of rules of thumb for generation sizes (a number of them depend on sizes of other pools). 

Understanding the Throughput collector and Behaviour based Tuning
Hotspot uses behaviour based tuning with parallel collectors including the MarkAndSweep collector, they are designed to keep three goals in mind i.e. maximum pause time, throughput and footprint. The-XX:+PrintAdaptiveSizePolicy JVM flag prints details of this behaviour for analysis. Even though there are options to set the maximum pause time for any GC event there is no guarantee it will be performed. It also uses the best of 2 (from 3) goals – i.e. only 2 goals are focussed at a time, and there is interaction between goals, and sometimes there is oscillation between them.

Time to tune
There is a cycle to follow when doing this, i.e. 
  1. Determine desired behaviour.
  2. Measure behaviour before change.
  3. Determine indicated change.
  4. Make change.
  5. Measure behaviour after change.
  6. If behaviour not met, try again.
Repeat the above till the results are met (or nearly there).

Java VisualVM with Visual GC plugin for live telemetry (note there’s an observer effect when using VisualVM, which can have an impact on your readings). GCViewer by Tagtruam industries, useful for post-hoc review of GC performance based on the GC log. 

Conclusions: a lot of factors need to be taken into consideration when tuning GC. A meticulous plan needs to be considered, keeping an eye on your goal throughout the process and once achieved, you continue to monitor your application to see if your performance goals are still intact. Adjust your requirements after studying the results of your actions such that they both meet at some point.

— The authors cover the methodology in good detail and give practical advise and steps to follow in order to tune your application – the hands-on and practical aspects of the blog should definitely be read. —

The author starts with saying  “some of the mistakes others including the author has made” – plenty of wrong ways to do things and what to avoid doing.

WHY and HOW – you measure latency, HOW only makes sense if you understand the WHY. There is public domain code / tools available that can be used to measure latency.

Don’t use statistics to measure latency!

Classic way of looking at latency – response time (latency) is a function of load! Its important to know what this response time – is it average, worst case, 99 percentile, etc… based on different types of behaviours?

Hiccups – some sort of accumulated work that gets performed in bursts (spikes). Important to look at these behaviours when looking at response time and average response time. They are not evenly spread around, but may look like periodic freezes, shifts from one mode/behaviour to another (no sequence).

Common fallacies:
– computers run application code continously
– response time can be measure as a work units over time
– response time exhibits a normal distribution
– “glitches” or “semi-random omissions” in measurement don’t have a big effect

Hiccups are hidden knowingly or unknowingly – i.e. averages and standard deviations can help!
Many compute the 99 percentile or another figure from averages and distributions. A better way to deal with it is to measure it using data – for e.g. plot a histogram. You need to have a SLA for the entire spread and plot the distribution across 0 to 100 percentile.

Latency tells us how long something should take! True real-time is being slow and steady!
Author gives examples of types of applications with different latency needs i.e. The Olympics, Pacemaker, Low latency trading, Interactive applications.

A very important way to establishing requirements – an extensive interview process, to help the client come up with a realistic figures for the SLA requirements. We need to find out what is the worst the client is willing to accept and for how long or how often is that acceptable, to help construct a percentile table.

Question is how fast can the system run, smoothly without hitting an obstacles or crashes! The author further makes a comparison of performance between the HotSpot JVM and C4.

Coordinated omission problem – what is it? data that it is not emitted in a random way but in a coordinated way that skews the actual data. Delays in request times does not get measured due to the manner in which response time is recorded from within the system – which omits random uncoordinated data. The maximum value (time) in the data usually is the key value to start with to compute and compare the other percentiles and check if it matches with that computed with coordinated emission in it. Percentiles are useful, compute them.

Graphs with sudden bumps and more smooth lines tend to have coordinated emission in it.

HdrHistogram helps plot histograms – a configurable precision system, telling it the range and precision to cover. It has built in compensation for Coordinated Omission – its OpenSource and available on Github under the CCO license.

jHiccup also plots histogram to show how your computer system has been behaving (run-time and freeze time), also records various percentiles, maximum and minimum values, distribution, good modes, bad modes, etc…. Also opensource under COO license.

Measuring throughput without a latency requirement, the information is meaningless.  Mistakes in measurement/analysis can lead to orders-of-magnitude errors and lead to bad business decisions.

— The author has shown some cool graphs and illustrations of how we measure things wrong and what to look out for when assessing data or even graphs. Gives a lot of tips, on how NOT to measure latency. —

Poorly chosen Java HotSpot Garbage Collection Flags and how to fix them! by Kirk Pepperdine

There are more than 700 product flags defined in the HotSpot JVM and not many have a good idea of what effect it might have when used in runtime, let be the combination of two or more together.

Whenever you think you know a flag, it often pays if you check out the source code underlying that flag to get a more accurate idea.

Identifying redundant flags
The author enlists a huge list of flags for us to determine if we need them or not. It see a list of flags available with your version of java, try this command, to get a long list of options:

$ java -XX:+PrintFlagsFinal  

You can narrow down the list by eliminating deprecated flags and ones with a default value, leaving us with a much shorter list.

DisableExplicitGC, ExplicitGCInvokesConcurrent, and RMI Properties
Calling System.gc() isn’t a great idea and one way to disable that functionality from code that uses it, is to use the DisableExplicitGC flag as one of the arguments of your VM options flags. Or instead use the ExplicitGCInvokesConcurrent flag to invoke the Concurrent Mark and Sweep cycle instead of the Full GC cycle (invoked by System.gc()). You can also set the interval period between complete GC cycles using the sun.rmi.dgc.client.gcInterval and sun.rmi.dgc.server.gcInterval properties.

Xmx, Xms, NewRatio, SurvivorRatio and TargetSurvivorRatio
The -Xmx flag is used to set the maximum heap size, and all JVMs are meant to support this flag. By adjusting the minimum and maximum heap sizes such that the JVM can’t resize the heap size at runtime disables the properties of the adaptive sizing properties.

Use GC logs to help make memory pool size decisions. There is a ratio to maintain between Young Gen and Old Gen and within Young Gen between Eden and the Survivor spaces. The SurvivorRatio flag is one such flag that helps decide how much space will be used by the survivor spaces in Young gen and the rest will be left for Eden. These ratios are key towards determining the frequency and efficiency of the collections in Young Gen. TargetSurvivorRatio is another ratio to examine and cross-check before using.

PermSize, MaxPermSize

These options are removed starting JDK 8 in favour of Meta space, but in previous versions it helped in setting PermGen’s resizing abilities. Restricting its adaptive nature might lead to longer pauses of old generational spaces. 

UseConcMarkSweepGC, CMSParallelRemarkEnabled, UseCMSInitiatingOccupancyOnly CMSInitiatingOccupancyFraction

UseConcMarkSweepGC makes the JVM select the CMS collector, CMS works with ParNew   instead of the PSYoung and PSOld collectors for the respective generational spaces. 

CMSInitiatingOccupancyFraction (IOF) is flag that helps the JVM determine how to maintain the occupancy of data (by examining the current rate of promotion) in the Old Gen space, sometimes leading to STW, FullGCs or CMF (concurrent mode failure). 

UseCMSInitiatingOccupancyOnly tells the JVM to use the IOF without taking the rate of promotion into account.

CMSParallelRemarkEnabled helps parallelise the fourth phase of the CMS (single threaded by default), but use this option only after benchmarking your current production systems.


CMSClassUnloadingEnabled ensures that classes loaded in the PermGen space is regularly unloaded reducing situations of out-of-PermGen-space. Although this could lead to longer concurrent and remark (pause) collection times.


AggressiveOpts as the name suggests helps improve performance by switching on and off certain flags but this strategy hasn’t change across various builds of Java and one must examine benchmarks of their system before using this flag in production.

Next Steps
Of the number of flags examined the below are more interesting flags to look at:
Again examine the GC logs to see if any of the above should be used or not, if used, what values are appropriate to set. Quite often the JVM does get it right, right out of the box, hence getting it wrong can be detrimental to your application performance. Its easy to mess up then to get it right, when you have so many flags with dependent and overlapping functionalities.

Conclusion:  Refer to benchmarks and GC logs for making decisions on enabling and disabling of flags and the values to set when enabled. Lots of flags, easy to get them wrong, only use them if they are absolutely needed, remember the Hotspot JVM has built in machinery to adapt (adaptive policy), consult an expert if you still want to.

— The author has done justice by giving us a gist of what the large number of JVM flags and recommend reading the article to learn about specifics of certain flags.. —

As it is not practical to review all such videos and articles, a number of them have been provided in the links below for further study. In many cases I have paraphrased or directly quoted what the authors have to say to preserve the message and meaning they wished to convey. 

A few other authors have written articles related to these subjects, for the Java Advent Calendar, see below,

Using Intel Performance Counters To Tune Garbage Collection
How NOT to measure latency

Feel free to post your comments below or tweet at @theNeomatrix369!

    Useful resources

      This post is part of the Java Advent Calendar and is licensed under the Creative Commons 3.0 Attribution license. If you like it, please spread the word by sharing, tweeting, FB, G+ and so on!

      (Part 2 of 3): Synopsis of articles & videos on Performance tuning, JVM, GC in Java, Mechanical Sympathy, et al

      This is a continuation of the previous post titled (Part 1 of 3): Synopsis of articles & videos on Performance tuning, JVM, GC in Java, Mechanical Sympathy, et al.

      Without any further ado, lets get started with our next set of blogs and videos, chop…chop…! This time its Martin Thompson’s blog posts and talks. Martin’s first post on Java Garbage collection distilled basically distils the GC process and the underlying components including throwing light on a number of interesting GC flags (-XX:…). In his next talk he does his myth busting shaabang about mechanical sympathy, what people correctly believe in and also some of the misconceptions they have bought into. In the talk on performance testing, Martin takes it further and fuses Java, OS and the hardware to show how understanding aspects of all these can help write better programs. 

      Java Garbage Collection Distilled by Martin Thompson

      There are too many flags to allow tuning the GC to achieve the throughput and latency your application requires. There’s plenty of documentation on the specifics of the bells and whistles around them but none to guide you through them.

      – The Tradeoffs – throughput (-XX:GCTimeRatio=99), latency (-XX:MaxGCPauseMillis=<n>) and memory (-Xmx<n>) are the key variables that the collectors depend upon. It is important to note that Hotspot often cannot achieve the above targets. If a low-latency application goes unresponsive for more than a few seconds it can spill disaster. Tradeoffs play out as they
      * provide more memory to GC algorithms
      * GC can be reduced by containing the live set and keeping heap size small
      * frequency of pauses can be reduced by managing heap and generation sizes & controlling application’s object allocation rate
      * frequency of large pauses can be reduced by running GC concurrently

      – Object Lifetimes
      GC algorithms are often optimised with the expectation that most objects live for a very short period of time, while relatively few live for very long. Experimentation has shown that generational garbage collectors support much better throughput than non-generational ones – hence used in server JVMs.

      – Stop-The-World Events
      For GC to occur it is necessary that all the threads of a running application must pause – garbage collectors do this by signalling the threads to stop when they come to a safe-point. Time to safe-point is an important consideration in low-latency applications and can be found using the ‑XX:+PrintGCApplicationStoppedTime flag in addition to the other GC flags. When a STW event occurs a system will undergo significant scheduling pressure as the threads resume when released from safe-points, hence less STWs makes an application more efficient.

      – Heap Organisation in Hotspot
      Java heap is divided in various regions, an object is created in Eden, and moved into the survivor spaces, and eventually into tenured. PermGen was used to store runtime objects such as classes and static strings. Collectors take help of Virtual spaces to meet throughput & latency targets, and adjusting the region sizes to reach the targets.

      – Object Allocation
      TLAB (Thread Local Allocation Buffer) is used to allocate objects in Java, which is cheaper than using malloc (takes 10 instructions on most platforms). Rate of minor collection is directly proportional to the rate of object allocation. Large objects (-XX:PretenureSizeThreshold=n) may have to be allocated in Old Gen, but if the threshold is set below the TLAB size, then they will not be created in old gen – (note) does not apply to the G1 collector.

      – Minor Collections
      Minor collection occurs when Eden becomes full, objects are promoted to the tenured space from Eden once they get old i.e. cross the threshold (-XX:MaxTenuringThreshold).  In minor collection, live reachable objects with known GC roots are copied to the survivor space. Hotspot maintains cross-generational references using a card table. Hence size of the old generation is also a factor in the cost of minor collections. Collection efficiency can be achieved by adjusting the size of Eden to the number of objects to be promoted. These are prone to STW and hence problematic in recent times.

      – Major Collections
      Major collections collect the old generation so that objects from the young gen can be promoted. Collectors track a fill threshold for the old generation and begin collection when the threshold is passed. To avoid promotion failure you will need to tune the padding that the old generation allows to accommodate promotions (-XX:PromotedPadding=<n>). Heap resizing can be avoided using the –Xms and -Xmx flags. Compaction of old gen causes one of the largest STW pauses an application can experience and directly proportion to the number of live objects in old gen. Tenure space can be filled up slower, by adjusting the survivor space sizes and tenuring threshold but this in turn can cause longer minor collection pause times as a result of increased copying costs between the survivor spaces.

      – Serial Collector
      It is the simplest collector with the smallest footprint (-XX:+UseSerialGC) and uses a single thread for both minor and major collections.

      – Parallel Collector
      Comes in two forms (-XX:+UseParallelGC) and (-XX:+UseParallelOldGC) and uses multiple threads for minor collections and a single thread for major collections – since Java 7u4 uses multiple threads for both type of collections. Parallel Old performs very well on a multi-processor system, suitable for batch applications. This collector can be helped by providing more memory, larger but fewer collection pauses. Weigh your bets between the Parallel Old and Concurrent collector depending on how much pause your application can withstand (expect 1 to 5 seconds pauses per GB of live data on modern hardware while old gen is compacted).

      – Concurrent Mark Sweep (CMS) Collector
      CMS (-XX:+UseConcMarkSweepGC) collector runs in the Old generation collecting tenured objects that are no longer reachable during a major collection. CMS is not a compacting collector causing fragmentation in Old gen over time. Promotion failure will trigger FullGC when a large object cannot fit in Old gen. CMS runs alongside your application taking CPU time. CMS can suffer “concurrent mode failures” when it fails to collect at a sufficient rate to keep up with promotion.

      – Garbage First (G1) Collector
      G1 (-XX:+UseG1GC) is a new collector introduced in Java 6 and now officially supported in Java 7. It is a generational collector with a partially concurrent collecting algorithm, compacts the Old gen with smaller incremental STW pauses. It divides the heap into fixed sized regions of variable purpose. G1 is target driven on latency (–XX:MaxGCPauseMillis=<n>, default value = 200ms). Collection on the humongous regions can be very costly. It uses “Remembered Sets” to keep track of references to objects from other regions. There is a lot of cost involved with book keeping and maintaining “Remembered Sets”. Similar to CMS, G1 can suffer from an evacuation failure (to-space overflow).

      – Alternative Concurrent Collectors
      Oracle JRockit Real Time, IBM Websphere Real Time, and Azul Zing are alternative concurrent collectors.  Zing according to the author is the only Java collector that strikes a balance between collection, compaction, and maintains a high-throughput rate for all generations. Zing is concurrent for all phases including during minor collections, irrespective of heap size. For all the concurrent collectors targeting latency you have to give up throughput and gain footprint. Budget for heap size at least 2 to 3 times the live set for efficient operation.

      – Garbage Collection Monitoring & Tuning
      Important flags to always have enabled to collect optimum GC details:


      Use applications like GCViewer, JVisualVM (with Visual GC plugin) to study the behaviour of your application as a result of GC actions. Run representative load tests that can be executed repeatedly (as you gain knowledge of the various collectors), keep experimenting with different configuration until you reach your throughput and latency targets. jHiccup helps track pauses within the JVM.  As known to us it’s a difficult challenge to strike a balance between latency requirements, high object allocation and promotion rates combined, and that sometimes choosing a commercial solution to achieve this might be a more sensible idea.

      Conclusion: GC is a big subject by itself and has a number of components, some of them are constantly replaced and it’s important to know what each one stands for. The GC flags are as much important as the component to which they are related and it’s important to know them and how to use them. Enabling some standard GC flags to record GC logs does not have any significant impact on the performance of the JVM. Using third-party freeware or commercial tools help as long as you follow the authors methodology.

      — Highly reading the article multiple times, as Martin has covered lots of details about GC and the collectors which requires close inspection and good understanding.  —

      Martin Thompson’s:  Mythbusting modern hardware to gain “Mechanical Sympathy” Video * Slides

      He classifies myth into three categories – possible, plausible and busted! In order to get the best out of the hardware you own, you need TO KNOW your hardware. Make tradeoffs as you go along, changing knobs ;), it’s not as scary as you may think.

      Good question: do normal developers understand the hardware they program on? Or cannot understand what’s going on? Or do we have the discipline and make the effort to understand the platform we work with?

      Myth 1CPUs are not getting faster – clock speed isn’t everything, Sandy bridge architecture is the faster breed. 6 ports to support parallelism (6 ops/cycle). Haswell has 8 ports! Code doing division operations perform slower than any other arithmetic operations. CPUs have both front-end and back-end cycles. It’s getting faster as we are feeding them faster! – PLAUSIBLE

      Myth 2 – Memory provides us random access – CPU registers and buffers, internal caches (L1, L2, L3) and memory – mentioned in the order in of speed of access to these areas respectively. Manufacturers have been doing things to CPUs to bring down its operational temperature by performing direct access operations. Writes are less hassle than reads – buffer misses are costly. L1 is organised into cache-lines containing code that the processor will execute – efficiency is achieved by not having cache-line misses. Pre-fetchers help reduce latency and help during reading streaming and predictable data. TLB misses can also cost efficiency (size of TLB = 4K = size of memory page). In short reading memory isn’t anywhere close to reading it randomly but SEQUENTIALLY due to the way the underlying hardware works. Writing highly branched code can cause slow down in execution of the program – keep things together that is, cohesive is the key to efficiency. – BUSTED
      Note: TLAB and TLB are two different concepts, google to find out the difference!

      Myth 3 – HDD provides random access – spinning disc and an arm moves about reading data. More sectors are placed in the outer tracks than the inner tracks (zone bit recording). Spinning the discs faster isn’t the way to increase HDD performance. 4K is the minimum you can read or write at a time. Seek time in the best disc is 3-6 ms, laptop drives are slower (15 ms). Data transfers take 100-220 Mbytes/sec. Adding a cache can improve writing data into the disc, not so much for reading data from disks. – BUSTED

      Myth 4 – SSDs provides random access – Reads and writes are great, works very fast (4K at a time). Delete is not like the real deletes, it’s marked deleted and not really deleted – as you can’t erase at  high resolution hence a whole block needs to be erased at a time (hence marked deleted). All this can cause fragmentation, and GC and compaction is required. Reads are smooth, writes are hindered by fragmentation, GC, compaction, etc…, also to be ware of write-amplification. A few disadvantages when using SSD but overall quite performant. – PLAUSIBLE

      Can we understand all of this and write better code?

      Conclusion: do not take everything in the space for granted just because it’s on the tin, examine, inspect and investigate for yourself the internals where possible before considering it to be possible, or plausible – in order to write good code and take advantage of these features.

      — Great talk and good coverage of the mechanical sympathy topic with some good humour, watch the video for performance statistics gathered on each of the above hardware components  —

      “Performance Testing Java Applications” by Martin Thompson

      “How to use a profiler?” or “How to use a debugger?” 
      What is Performance? Could mean two things like throughput or bandwidth (how much can you get through) and latency (how quickly the system responds).

      Response time changes as we apply more load on the system. Before we design any system, we need to gather performance requirements i.e. what is the throughput of the system or how fast you want the system to respond (latency)? Does your system scale economically with your business?

      Developer time is expensive, hardware is cheap! For anything, you need a transaction budget (with a good break-down of the different components and processes the system is going to go through or goes through).

      How can we do performance testing? Apply load to a system and see if the throughput goes up or down? And what is the response time of the system as we apply load. Stress testing is different from load testing (see Load testing) , stress testing (see Stress testing) is a point where things break (collapse of a system), an ideal system will continue with a flat line. Also it’s important to perform load testing not just from one point but from multiple points and concurrently. Most importantly high duration testing is very important – which bring a lot of anomalies to the surface i.e. memory leaks, etc…
      Building up a test suite is a good idea, a suite made up of smaller parts. We need to know the basic building blocks of the system we use and what we can get out of it. Do we know the different threshold points of our systems and how much its components can handle? Very important to know the algorithms we use, know how to measure them and use it accordingly.
      When should we test performance? 
      “Premature optimisation is the root of all evil” – Donald Knuth / Tony Hoare

      What does optimisation mean? Knowing and choosing your data and working around it for performance. New development practices: we need to test early and often!

      From a Performance perspective, “test first” practice is very important, and then design the system gradually as changes can cost a lot in the future.

      RedGreenDebugProfileRefactor, a new way of “test first” performance methodology as opposed to RedGreenRefactor methodology only! Earlier and shorter feedback cycle is better than finding something out in the future.

      Use “like live” pairing stations, Mac is a bad example to work on if you are working in the Performance space – a Linux machine is a better option. 

      Performance tests can fail the build – and it should fail a build in your CI system! What should a micro benchmark look like (i.e. calliper)? Division operations in your code can be very expensive, instead use a mask operator!

      What about concurrency testing? Is it just about performance? Invariants? Contention? 

      What about system performance tests? Should we be able to test big and small clients with different ranges. It’s fun to know deep-dive details of the system you work on. A business problem is the core and the most important one to solve and NOT to discuss on what frameworks to use to build it. Please do not use Java serialisation as it is not designed for on the-wire-protocol! Measure performance of a system using a observer rather than measure it from inside the system only.
      Performance Testing Lessons – lots of technical stuff and lots of cultural stuff. Technical Lessons – learn how to measure, check out histograms! Do not sample a system, we miss out when things when the system go strange, outliers, etc… – histograms help! Knowing what is going on around the areas where the system takes a long time is important! Capturing time from the OS is a very important as well. 
      With time you get – accuracy, precision and resolution, and most people mix all of these up. On machines with dual sockets, time might not be synchronised. Quality of the time information is very dependent on the OS you are using. Time might be an issue on virtualised systems, or between two machines. This issue can be resolved, do round-trip times between two systems (note the start and stop clock times) and half them to get a more accurate time. Time can go backwards on you on certain OSes (possibly due to NTP) – instead use monotonic time.
      Know your system, its underlying components – get the metrics and study them ! Use a Linux tool like perstat, will give lots of performance and statistics related information on your CPU and OS – branch predictions and cache-misses!

      RDTSC is not an ordered-instructions execution system, x86 is an ordered instruction systems and operations do not occur in a unordered fashion.

      Theory of constraints! – Always start with number 1 problem on the list, the one that takes most of the time – the bottleneck of your system, the remaining sequence of issues might be dependent on the number 1 issue and are not separate issues!

      Trying to create a performance team is an anti-pattern – make the experts help bring the skills out to the rest of the team, and stretch-up their abilities!

      Beware of YAGNI – about doing performance tests – smell of excuse!
      Commit builds > 3.40 mins = worrying, same for acceptance test build > 15 mins = lowers team confidence.
      Test environment should equal to production environment! Easy to get exactly similar hardware these days!
      Conclusion: Start with a “test first” performance testing approach when writing applications that are low latency dependent. Know your targets and work towards it. Know your underlying systems all the way from hardware to the development environment. Its not just technical things that matter, cultural things matter as much, when it comes to performance testing. Share and spread the knowledge across the team rather than isolating it to one or two people i.e. so called experts in the team. Its everyone’s responsibility not just a few seniors in the team. Learn more about time across various hardware and operating systems, and between systems.

      As it is not practical to review all such videos and articles, a number of them have been provided in the links below for further study. In many cases I have paraphrased or directly quoted what the authors have to say to preserve the message and meaning they wished to convey. A follow-on to this blog post will appear in the same space under the title (Part 3 of 3): Synopsis of articles & videos on Performance tuning, JVM, GC in Java, Mechanical Sympathy, et al on 23rd Dec 2013.

      Feel free to post your comments below or tweet at @theNeomatrix369!

      Useful resources

      This post is part of the Java Advent Calendar and is licensed under the Creative Commons 3.0 Attribution license. If you like it, please spread the word by sharing, tweeting, FB, G+ and so on!

      (Part 1 of 3): Synopsis of articles & videos on Performance tuning, JVM, GC in Java, Mechanical Sympathy, et al

      I have been contemplating for a number of months about reviewing a cache of articles and videos on topics like Performance tuning, JVM, GC in Java, Mechanical Sympathy, etc… and finally took the time to do it – may be this was the point in my intellectual progress when was I required to do such a thing!
      Thanks to Attila-Mihaly for giving me the opportunity to write a post for his yearly newsletter Java Advent Calendar, hence a review on various Java related topics fits the bill! The selection of videos and articles are purely random, and based on the order in which they came to my knowledge. My hidden agenda is to mainly go through them to understand and broaden my own knowledge at the same time share any insight with others along the way.

      I’ll be covering three reviews of talks by Attila Szegedi (1 talk) and Ben Evans (2 talks). They speak on the subject of Java Performance and the GC. The first talk by Attila covers a lot of his experience as an Engineer at Twitter – so its lots of information out of live experience in the field on production systems. Making use of thin objects instead of fat ones is one of the buzzwords in his talk.

      Ben in his two talks covers Performance, JVM and GC in great depth. He points out about people’s misconception about Performance, the JVM and GC, things that people don’t have certain run-time flags enabled in production.  How the underlying machinery works, why it works the way it works?How efficient the machinery is and what best to do and not to do to get good throughput out of it?

      Here I go with my commentary, I decided to start with Attila Szegedi’s talk as I quite liked the title…..

      Everything I Ever Learned About JVM Performance Tuning @Twitter by Attila Szegedi
      (video & slides)

      Attila at the time of the talk worked for Twitter where he learnt a lot about the internals of the JVM and the Java language itself – Twitter being an organisation where tuning, optimising JVMs, low-latency are defacto practises.
      He covers interesting topics like:
      – contributors of latency
      – finished code not ready for production
      – areas of performance tuning (primarily memory tuning and lock contention tuning)
      – Memory footprint tuning (OOME, inefficient tuning, FAT data)
      – FAT data – a new terminology coined by him, and how to resolve issues created by it (pretty indepth and interesting) – learn about byte allocations to data types in the Java / JVM languages.
      Some deep dive topics like compressed object pointers, are one of the suggestions (including a pit-fall). Certain types in Scala 2.7.7 are inefficient – as revealed by a JVM profiler. Do not use Thrift – as it is not a friend of low-latency, as they are heavy – adds between 52 to 72 bytes of overhead per object, does not support 32-bit floats, etc… Be careful with thread locals – sticks around and uses more resources than expected.
      Performance triangle, Attila shares his insight into this concept. GC is the biggest threat of the JVM. Old gen uses ConcCollector, while the new gen goes through the STW process, and enlists a number of throughput and low-pause collectors.
      Improve GC by taking advantage of the Adaptive sizing policy, and give it a target to work on. Use a throughput collector with or without the adaptive policy and benchmark the results.  He takes us through the various -XX: +Print… flags and explains its uses. Keep fragmentations low and avoid full GC stops. Lots of detail on the workings of the GC and what can be done to improve GC (tuning both new and old gens).
      Latency that are not GC related – thread coordination optimization. Barriers and half-barriers can be used when using threads to improve latency – along with some tricks when using the Atomic values & AtomicReferences. Cassandra slab allocator – helps efficiency and performance – do not write your own memory manager. Attila is no longer a fan of “Soft references” – although great in theory but not in practice, more GC cycles are needed to clear them!

      Conclusion: know your code as often they may be the root to your problems – frameworks can many a times be the cause of performance issues. Lots of things can be done to squeeze performance out of the programs written, if one knows how to best use the fundamental building blocks of data structures of your development environment. Its a hard game to maintain the best throughput and get the best performance out of the JVM.

      — Recommend watching the video, lots more covered than the synopsis above  —

      9 Fallacies of Java Performance by Ben Evans (blog)

      In this article Ben goes about busting old myths and assumptions about Java, its performance, GC, etc… Areas covered being:
      1) Java is slow, 2) A single line of Java means anything in isolation, 3) A micro-benchmark means what you think it does , 4) Algorithmic slowness is the most common cause of performance problems, 5) Caching solves everything, 6)  All apps need to be concerned about Stop-The-World, 7) Hand-rolled Object Pooling is appropriate for a wide range of apps, 8) CMS is always a better choice of GC than Parallel Old, 9) Increasing the heap size will solve your memory problem
      – JIT compiled code is as fast as C++ in many cases
      – JIT compiler can optimize away dead and unused code, even on the basis of profiling data. In JVMs like JRockit, the JIT can decompose object operations.
      – For best results don’t prematurely optimize, instead correct your performance hot spots. 
      – Richard Feynman once said: “The first principle is that you must not fool yourself – and you are the easiest person to fool” – something to keep in mind when thinking of writing Java micro-benchmarks.
      The points being the ideas people have in their minds about Java is but the opposite of the reality of things. Basically suggesting the masses to revisit the ideas and make conclusions based on sheer facts and not assumptions or old beliefs.
      – GC, database access, misconfiguration, etc… are likely to cause application slowness as compared to algorithms.
      – Measure, don’t guess ! Use empirical production data to uncover the true causes of performance problems.
      – Don’t just add a cache to redirect the problem elsewhere and add complexity to the system, but collect basic usage statistics (miss rate, hit rate, etc.) to prove that the caching layer is actually adding value.
      – If the users haven’t complained or you are not in the low-latency stack – don’t worry about STOP-THE-WORLD pauses (circa 200 ms depending on the heap size).
      – Object pooling is very difficult and should only be used when GC pauses are unacceptable, and intelligent attempts at tuning and refactoring have been unable to reduce pauses to an acceptable level.
      – Check if CMS is your correct GC strategy, you should first determine that STW pauses from Parallel Old are unacceptable and can’t be tuned. Ben stresses: be sure that all metrics are obtained on a production-equivalent system.  
      – Understanding the dynamics of object allocation and lifetime before changing heap size or tuning other parameters is essential. Acting without measuring can make matters worse. The tenuring distribution information from the garbage collector is especially important here.
      Conclusion: The GC subsystem has incredible potential for tuning and for producing data to guide tuning, and then to use a tool to analyse the logs – either handwritten scripts and some graph generation, or a visual tool such as the (open-source) GCViewer or a commercial product.

      Visualizing Java GC by Ben Evans (video & slides)

      Misunderstanding or shortcomings in people’s understanding of GC. Its not just Mark & Sweep. Many run-times these days have GC! Two schools of thoughts – GC & Reference counting! Humans make mistakes as compared to machines which requires high levels of precision. True GC is incredibly efficient, reference counting is expensive – pioneered by Java (comments from +Gil Tene: On the correctness side, I’d be careful saying “pioneered by Java” for anything in GC. Java’s GC semantics are fairly classic, and present no new significant problems that predating environments did not. Most core GC techniques used in JVMs were researched and well known in other environments (smalltalk, lisp, etc.) and are also available in other Runtimes. While it is fair to say that from a practical perspective, JVMs tend to have the most mature GC mechanisms these days, that’s because Java is a natural place to apply new GC techniques that actually work. But innovation and pioneering in GC is not strongly tied to Java.)

      The allocation list is where all objects are rooted from. You can’t get an accurate picture of all the objects of a running object at any given point of time of a running live application without stopping the application that’s why we have STW (Stop-The-World)(comments from +Gil Tene: In addition, the notion that “you can’t get an accurate picture of all the objects of a running object at any given point of time of a running live application without stopping the application that’s why we have STW (Stop-The-World)!” is wrong. Concurrent marking and concurrent compaction are very real things that achieve just that without stopping the application. “Just needs some good engineering”, and “you just can’t do X” are very different things.)

      Golden rules of GC
      – must collect all the garbage (sensitive rule)
      – must never collect a live object
      (trick: but they are never created equal)

      Hotspot is C/C++/Assembly application. Heap is a contiguous block of memory with different memory pools – Young Gen, Old Gen, and PermGen pools. Objects are created by application (mutator) threads and removed by GC. Applications are not slow due to GC all the time.
      PermG – not desirable, going away in Java 8 (known issue: causes OOME exceptions), to be replaced by Metaspace outside the heap (native memory).
      GC is based on ‘Weak generational hypothesis’ – objects die young, or die old – found out through empirical research. (comments from +Michael Barker: I think this statement:
      “GC is based on ‘Weak generational hypothesis’ – objects die young, or die old – found out through empirical research.”
      Is not correct.  I think I can guess at what you mean, but you may want to consider rewording it so that it is not misleading.  There are GC implementations in real world VMs that are not generational collectors.

      comments from +Kirk Pepperdine: Indeed. the ParcPlace VM had 7 different memory spaces that have a strong resemblance to the todays generational spaces. There is Eden with two hemi spaces plus 4 other spaces for different types of long lived data.)
      Re-worded version: GC in the JVM is based on ‘Weak generational hypothesis’ – objects die young, or die old – found out through empirical research.

      Tenuring threshold is the number of GC you survive before your get moved to the Old Gen (Tenuring space). JavaFX is bundled with jdk7u6 and up.
      Source code of JavaFX Memory Visualizer written in Java replacing the Flash version  – https://github.com/kittylyst/jfx-mem – written using FlexML (FXML). An extensive explanation of how the program is written in FlexML, a nice programming language – uses the builder pattern in combination with DSL like expressions. The program models the way GC works and how objects are created, destroyed and moved about the different pools. 
      List of mandatory flags, which do not have any performance impact
      -verbose: gc
      All the information needed about an executing application and GC are recorded by the above. Also covers basic heap sizing flags. Setting the heap flags to equal do not apply anymore since recent versions of the JDK. Also there’s more than 200 flags to the GC and VM not including all the undocumented ones.
      GC log files are useful for post-processing, but sometimes are not recorded correctly. MXBeans impact the running application but also do not give more information than the log files.
      GC log files have a general format giving information on change of allocation, occupancy, tenuring info, collection info, etc…,  – explosion of GC log file formats and not much tooling out there. Many of the free tools cover some sort of dashboard like output showing various GC related metrics, the commercial versions have a better approach and useful information in general.

      Premature promotion – under pressure of creation of new objects, objects are moved directly from YG to OG without going through the Survivor spaces.

      Use tools, measure and don’t guess!

      Conclusion: know the facts and find out details if they are not known but do not guess or assume. False conceptions have lead to assumptions and incorrect understanding of the JVM and the GC process at times. Don’t just changes flags or use tools, know why to and what they do. For e.g. switching on GC logging (with appropriate flags enabled) does not have a visible impact on the performance of the JVM but is a boon in the medium to long run.

      — Highly recommend watching the video, lots more covered than the synopsis above, Ben has explained GC in the simplest form one could, covering many important details  —

      As it is not practical to review all such videos and articles, a number of them have been provided in the links below for further study. In many cases I have paraphrased or directly quoted what the authors have to say to preserve the message and meaning they wished to convey. A follow-on to this blog post will appear in the same space under the titled (Part 2 of 3): Synopsis of articles & videos on Performance tuning, JVM, GC in Java, Mechanical Sympathy, et al on 19th Dec 2013.


      Thanks to +Gil Tene, +Michael Barker, @Ryan Rawson, +Kirk Pepperdine, and +Richard Warburton for read the post and providing using feedback.

      Feel free to post your comments below or tweet at @theNeomatrix369!

      Useful resources

        This post is part of the Java Advent Calendar and is licensed under the Creative Commons 3.0 Attribution license. If you like it, please spread the word by sharing, tweeting, FB, G+ and so on!