(Part 3 of 3): Synopsis of articles & videos on Performance tuning, JVM, GC in Java, Mechanical Sympathy, et al

This is a continuation of the previous post titled (Part 2 of 3): Synopsis of articles & videos on Performance tuning, JVM, GC in Java, Mechanical Sympathy, et al.

In our first review, The Atlassian guide to GC tuning is an extensive post covering the methodology and things to keep in mind when tuning GC, practical examples are given and references to important resources are also made in the process. The next one How NOT to measure latency by Gil Tene, he discusses some common pitfalls encountered in measuring and characterizing latency, demonstrating and discussing some false assumptions and measurement techniques that lead to dramatically incorrect reporting results, and covers simple ways to sanity check and correct these situations.  Finally Kirk Pepperdine in his post Poorly chosen Java HotSpot Garbage Collection Flags and how to fix them! throws light on some JVM flags – he starts with some 700 flags and boils it down to merely 7 flags. Also cautions you to not just draw conclusions or to take action in a whim but consult and examine – i.e. measure don’t guess!


Background

By default JVM tunings attempt to provide acceptable performance in majority of the cases, but it may not always give the desired results depending on application behaviour. Benchmark your application before applying any GC or JVM related settings. There are a number of environment, OS, hardware related factors to taken into consideration and many a times any tuning to JVM may be linked to these factors and if they chance the tuning may need to be revisited. Beware and accept the fact that any tuning has its upper limit in a given environment, if your goals are still not met, then you need to improve your environment. Always monitor your application to see if your tuning goals are still in scope.

Choosing Performance Goals – GC for the Oracle JVM can be reduced by targeting three goals ie. latency (mean & maximum), throughput, and footprint. To reach the tuning goals, there are principles that guide you to tuning the GC i.e. Minor GC reclamation, GC maximise memory, two of three (pick 2 of 3 of these goals, and sacrifice the other).

Preparing your environment for GC – 

Some of the items of interest are load the application with work (apply load to the application as it would be in production), turn on GC logging, set data sampling windows, determine memory footprint, rules of thumb for generation sizes, determine systemic requirements.

Iteratively change the JVM parameters keeping the environment its setting in tact. An example of turning on GC logging is:

java -XX:+PrintGC -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -Xloggc:<file> …
or
java -XX:+PrintGC -XX:+PrintGCDetails -XX:+PrintGCDateStamps -Xloggc:<file> …

Sampling windows are important to help specifically diagnose an application’s runtime rather than the load-time or time it or the JVM takes to stabilise before it can execute anything. Set the heap sizes either by guessing (give it maximum and then pull back) or by using results accumulated in the GC logs. Also check for OOME which can be an indicator if heap size or standard memory needs increasing. There are a number of rules of thumb for generation sizes (a number of them depend on sizes of other pools). 

Understanding the Throughput collector and Behaviour based Tuning
Hotspot uses behaviour based tuning with parallel collectors including the MarkAndSweep collector, they are designed to keep three goals in mind i.e. maximum pause time, throughput and footprint. The-XX:+PrintAdaptiveSizePolicy JVM flag prints details of this behaviour for analysis. Even though there are options to set the maximum pause time for any GC event there is no guarantee it will be performed. It also uses the best of 2 (from 3) goals – i.e. only 2 goals are focussed at a time, and there is interaction between goals, and sometimes there is oscillation between them.

Time to tune
There is a cycle to follow when doing this, i.e. 
  1. Determine desired behaviour.
  2. Measure behaviour before change.
  3. Determine indicated change.
  4. Make change.
  5. Measure behaviour after change.
  6. If behaviour not met, try again.
Repeat the above till the results are met (or nearly there).

Tools
Java VisualVM with Visual GC plugin for live telemetry (note there’s an observer effect when using VisualVM, which can have an impact on your readings). GCViewer by Tagtruam industries, useful for post-hoc review of GC performance based on the GC log. 

Conclusions: a lot of factors need to be taken into consideration when tuning GC. A meticulous plan needs to be considered, keeping an eye on your goal throughout the process and once achieved, you continue to monitor your application to see if your performance goals are still intact. Adjust your requirements after studying the results of your actions such that they both meet at some point.

— The authors cover the methodology in good detail and give practical advise and steps to follow in order to tune your application – the hands-on and practical aspects of the blog should definitely be read. —

The author starts with saying  “some of the mistakes others including the author has made” – plenty of wrong ways to do things and what to avoid doing.

WHY and HOW – you measure latency, HOW only makes sense if you understand the WHY. There is public domain code / tools available that can be used to measure latency.

Don’t use statistics to measure latency!

Classic way of looking at latency – response time (latency) is a function of load! Its important to know what this response time – is it average, worst case, 99 percentile, etc… based on different types of behaviours?

Hiccups – some sort of accumulated work that gets performed in bursts (spikes). Important to look at these behaviours when looking at response time and average response time. They are not evenly spread around, but may look like periodic freezes, shifts from one mode/behaviour to another (no sequence).

Common fallacies:
– computers run application code continously
– response time can be measure as a work units over time
– response time exhibits a normal distribution
– “glitches” or “semi-random omissions” in measurement don’t have a big effect

Hiccups are hidden knowingly or unknowingly – i.e. averages and standard deviations can help!
Many compute the 99 percentile or another figure from averages and distributions. A better way to deal with it is to measure it using data – for e.g. plot a histogram. You need to have a SLA for the entire spread and plot the distribution across 0 to 100 percentile.

Latency tells us how long something should take! True real-time is being slow and steady!
Author gives examples of types of applications with different latency needs i.e. The Olympics, Pacemaker, Low latency trading, Interactive applications.

A very important way to establishing requirements – an extensive interview process, to help the client come up with a realistic figures for the SLA requirements. We need to find out what is the worst the client is willing to accept and for how long or how often is that acceptable, to help construct a percentile table.

Question is how fast can the system run, smoothly without hitting an obstacles or crashes! The author further makes a comparison of performance between the HotSpot JVM and C4.

Coordinated omission problem – what is it? data that it is not emitted in a random way but in a coordinated way that skews the actual data. Delays in request times does not get measured due to the manner in which response time is recorded from within the system – which omits random uncoordinated data. The maximum value (time) in the data usually is the key value to start with to compute and compare the other percentiles and check if it matches with that computed with coordinated emission in it. Percentiles are useful, compute them.

Graphs with sudden bumps and more smooth lines tend to have coordinated emission in it.

HdrHistogram helps plot histograms – a configurable precision system, telling it the range and precision to cover. It has built in compensation for Coordinated Omission – its OpenSource and available on Github under the CCO license.

jHiccup also plots histogram to show how your computer system has been behaving (run-time and freeze time), also records various percentiles, maximum and minimum values, distribution, good modes, bad modes, etc…. Also opensource under COO license.

Measuring throughput without a latency requirement, the information is meaningless.  Mistakes in measurement/analysis can lead to orders-of-magnitude errors and lead to bad business decisions.


— The author has shown some cool graphs and illustrations of how we measure things wrong and what to look out for when assessing data or even graphs. Gives a lot of tips, on how NOT to measure latency. —

Poorly chosen Java HotSpot Garbage Collection Flags and how to fix them! by Kirk Pepperdine


There are more than 700 product flags defined in the HotSpot JVM and not many have a good idea of what effect it might have when used in runtime, let be the combination of two or more together.

Whenever you think you know a flag, it often pays if you check out the source code underlying that flag to get a more accurate idea.

Identifying redundant flags
The author enlists a huge list of flags for us to determine if we need them or not. It see a list of flags available with your version of java, try this command, to get a long list of options:

$ java -XX:+PrintFlagsFinal  

You can narrow down the list by eliminating deprecated flags and ones with a default value, leaving us with a much shorter list.

DisableExplicitGC, ExplicitGCInvokesConcurrent, and RMI Properties
Calling System.gc() isn’t a great idea and one way to disable that functionality from code that uses it, is to use the DisableExplicitGC flag as one of the arguments of your VM options flags. Or instead use the ExplicitGCInvokesConcurrent flag to invoke the Concurrent Mark and Sweep cycle instead of the Full GC cycle (invoked by System.gc()). You can also set the interval period between complete GC cycles using the sun.rmi.dgc.client.gcInterval and sun.rmi.dgc.server.gcInterval properties.

Xmx, Xms, NewRatio, SurvivorRatio and TargetSurvivorRatio
The -Xmx flag is used to set the maximum heap size, and all JVMs are meant to support this flag. By adjusting the minimum and maximum heap sizes such that the JVM can’t resize the heap size at runtime disables the properties of the adaptive sizing properties.

Use GC logs to help make memory pool size decisions. There is a ratio to maintain between Young Gen and Old Gen and within Young Gen between Eden and the Survivor spaces. The SurvivorRatio flag is one such flag that helps decide how much space will be used by the survivor spaces in Young gen and the rest will be left for Eden. These ratios are key towards determining the frequency and efficiency of the collections in Young Gen. TargetSurvivorRatio is another ratio to examine and cross-check before using.

PermSize, MaxPermSize

These options are removed starting JDK 8 in favour of Meta space, but in previous versions it helped in setting PermGen’s resizing abilities. Restricting its adaptive nature might lead to longer pauses of old generational spaces. 

UseConcMarkSweepGC, CMSParallelRemarkEnabled, UseCMSInitiatingOccupancyOnly CMSInitiatingOccupancyFraction

UseConcMarkSweepGC makes the JVM select the CMS collector, CMS works with ParNew   instead of the PSYoung and PSOld collectors for the respective generational spaces. 

CMSInitiatingOccupancyFraction (IOF) is flag that helps the JVM determine how to maintain the occupancy of data (by examining the current rate of promotion) in the Old Gen space, sometimes leading to STW, FullGCs or CMF (concurrent mode failure). 

UseCMSInitiatingOccupancyOnly tells the JVM to use the IOF without taking the rate of promotion into account.

CMSParallelRemarkEnabled helps parallelise the fourth phase of the CMS (single threaded by default), but use this option only after benchmarking your current production systems.

CMSClassUnloadingEnabled

CMSClassUnloadingEnabled ensures that classes loaded in the PermGen space is regularly unloaded reducing situations of out-of-PermGen-space. Although this could lead to longer concurrent and remark (pause) collection times.

AggressiveOpts

AggressiveOpts as the name suggests helps improve performance by switching on and off certain flags but this strategy hasn’t change across various builds of Java and one must examine benchmarks of their system before using this flag in production.

Next Steps
Of the number of flags examined the below are more interesting flags to look at:
-ea 
-mx4G
-XX:+MaxPermSize=300M
-XX:+UseConcMarkSweepGC
-XX:+CMSClassUnloadingEnabled
-XX:SurvivorRatio=4
-XX:+DisableExplicitGC
Again examine the GC logs to see if any of the above should be used or not, if used, what values are appropriate to set. Quite often the JVM does get it right, right out of the box, hence getting it wrong can be detrimental to your application performance. Its easy to mess up then to get it right, when you have so many flags with dependent and overlapping functionalities.




Conclusion:  Refer to benchmarks and GC logs for making decisions on enabling and disabling of flags and the values to set when enabled. Lots of flags, easy to get them wrong, only use them if they are absolutely needed, remember the Hotspot JVM has built in machinery to adapt (adaptive policy), consult an expert if you still want to.



— The author has done justice by giving us a gist of what the large number of JVM flags and recommend reading the article to learn about specifics of certain flags.. —

As it is not practical to review all such videos and articles, a number of them have been provided in the links below for further study. In many cases I have paraphrased or directly quoted what the authors have to say to preserve the message and meaning they wished to convey. 

A few other authors have written articles related to these subjects, for the Java Advent Calendar, see below,

Using Intel Performance Counters To Tune Garbage Collection
How NOT to measure latency


Feel free to post your comments below or tweet at @theNeomatrix369!

    Useful resources

      This post is part of the Java Advent Calendar and is licensed under the Creative Commons 3.0 Attribution license. If you like it, please spread the word by sharing, tweeting, FB, G+ and so on!

      (Part 2 of 3): Synopsis of articles & videos on Performance tuning, JVM, GC in Java, Mechanical Sympathy, et al

      This is a continuation of the previous post titled (Part 1 of 3): Synopsis of articles & videos on Performance tuning, JVM, GC in Java, Mechanical Sympathy, et al.

      Without any further ado, lets get started with our next set of blogs and videos, chop…chop…! This time its Martin Thompson’s blog posts and talks. Martin’s first post on Java Garbage collection distilled basically distils the GC process and the underlying components including throwing light on a number of interesting GC flags (-XX:…). In his next talk he does his myth busting shaabang about mechanical sympathy, what people correctly believe in and also some of the misconceptions they have bought into. In the talk on performance testing, Martin takes it further and fuses Java, OS and the hardware to show how understanding aspects of all these can help write better programs. 


      Java Garbage Collection Distilled by Martin Thompson


      There are too many flags to allow tuning the GC to achieve the throughput and latency your application requires. There’s plenty of documentation on the specifics of the bells and whistles around them but none to guide you through them.

      – The Tradeoffs – throughput (-XX:GCTimeRatio=99), latency (-XX:MaxGCPauseMillis=<n>) and memory (-Xmx<n>) are the key variables that the collectors depend upon. It is important to note that Hotspot often cannot achieve the above targets. If a low-latency application goes unresponsive for more than a few seconds it can spill disaster. Tradeoffs play out as they
      * provide more memory to GC algorithms
      * GC can be reduced by containing the live set and keeping heap size small
      * frequency of pauses can be reduced by managing heap and generation sizes & controlling application’s object allocation rate
      * frequency of large pauses can be reduced by running GC concurrently

      – Object Lifetimes
      GC algorithms are often optimised with the expectation that most objects live for a very short period of time, while relatively few live for very long. Experimentation has shown that generational garbage collectors support much better throughput than non-generational ones – hence used in server JVMs.

      – Stop-The-World Events
      For GC to occur it is necessary that all the threads of a running application must pause – garbage collectors do this by signalling the threads to stop when they come to a safe-point. Time to safe-point is an important consideration in low-latency applications and can be found using the ‑XX:+PrintGCApplicationStoppedTime flag in addition to the other GC flags. When a STW event occurs a system will undergo significant scheduling pressure as the threads resume when released from safe-points, hence less STWs makes an application more efficient.

      – Heap Organisation in Hotspot
      Java heap is divided in various regions, an object is created in Eden, and moved into the survivor spaces, and eventually into tenured. PermGen was used to store runtime objects such as classes and static strings. Collectors take help of Virtual spaces to meet throughput & latency targets, and adjusting the region sizes to reach the targets.

      – Object Allocation
      TLAB (Thread Local Allocation Buffer) is used to allocate objects in Java, which is cheaper than using malloc (takes 10 instructions on most platforms). Rate of minor collection is directly proportional to the rate of object allocation. Large objects (-XX:PretenureSizeThreshold=n) may have to be allocated in Old Gen, but if the threshold is set below the TLAB size, then they will not be created in old gen – (note) does not apply to the G1 collector.

      – Minor Collections
      Minor collection occurs when Eden becomes full, objects are promoted to the tenured space from Eden once they get old i.e. cross the threshold (-XX:MaxTenuringThreshold).  In minor collection, live reachable objects with known GC roots are copied to the survivor space. Hotspot maintains cross-generational references using a card table. Hence size of the old generation is also a factor in the cost of minor collections. Collection efficiency can be achieved by adjusting the size of Eden to the number of objects to be promoted. These are prone to STW and hence problematic in recent times.

      – Major Collections
      Major collections collect the old generation so that objects from the young gen can be promoted. Collectors track a fill threshold for the old generation and begin collection when the threshold is passed. To avoid promotion failure you will need to tune the padding that the old generation allows to accommodate promotions (-XX:PromotedPadding=<n>). Heap resizing can be avoided using the –Xms and -Xmx flags. Compaction of old gen causes one of the largest STW pauses an application can experience and directly proportion to the number of live objects in old gen. Tenure space can be filled up slower, by adjusting the survivor space sizes and tenuring threshold but this in turn can cause longer minor collection pause times as a result of increased copying costs between the survivor spaces.

      – Serial Collector
      It is the simplest collector with the smallest footprint (-XX:+UseSerialGC) and uses a single thread for both minor and major collections.

      – Parallel Collector
      Comes in two forms (-XX:+UseParallelGC) and (-XX:+UseParallelOldGC) and uses multiple threads for minor collections and a single thread for major collections – since Java 7u4 uses multiple threads for both type of collections. Parallel Old performs very well on a multi-processor system, suitable for batch applications. This collector can be helped by providing more memory, larger but fewer collection pauses. Weigh your bets between the Parallel Old and Concurrent collector depending on how much pause your application can withstand (expect 1 to 5 seconds pauses per GB of live data on modern hardware while old gen is compacted).

      – Concurrent Mark Sweep (CMS) Collector
      CMS (-XX:+UseConcMarkSweepGC) collector runs in the Old generation collecting tenured objects that are no longer reachable during a major collection. CMS is not a compacting collector causing fragmentation in Old gen over time. Promotion failure will trigger FullGC when a large object cannot fit in Old gen. CMS runs alongside your application taking CPU time. CMS can suffer “concurrent mode failures” when it fails to collect at a sufficient rate to keep up with promotion.

      – Garbage First (G1) Collector
      G1 (-XX:+UseG1GC) is a new collector introduced in Java 6 and now officially supported in Java 7. It is a generational collector with a partially concurrent collecting algorithm, compacts the Old gen with smaller incremental STW pauses. It divides the heap into fixed sized regions of variable purpose. G1 is target driven on latency (–XX:MaxGCPauseMillis=<n>, default value = 200ms). Collection on the humongous regions can be very costly. It uses “Remembered Sets” to keep track of references to objects from other regions. There is a lot of cost involved with book keeping and maintaining “Remembered Sets”. Similar to CMS, G1 can suffer from an evacuation failure (to-space overflow).

      – Alternative Concurrent Collectors
      Oracle JRockit Real Time, IBM Websphere Real Time, and Azul Zing are alternative concurrent collectors.  Zing according to the author is the only Java collector that strikes a balance between collection, compaction, and maintains a high-throughput rate for all generations. Zing is concurrent for all phases including during minor collections, irrespective of heap size. For all the concurrent collectors targeting latency you have to give up throughput and gain footprint. Budget for heap size at least 2 to 3 times the live set for efficient operation.

      – Garbage Collection Monitoring & Tuning
      Important flags to always have enabled to collect optimum GC details:

      -verbose:gc
      -Xloggc:<filename>
      -XX:+PrintGCDetails
      -XX:+PrintGCDateStamps
      -XX:+PrintTenuringDistribution
      -XX:+PrintGCApplicationConcurrentTime
      -XX:+PrintGCApplicationStoppedTime


      
      
      Use applications like GCViewer, JVisualVM (with Visual GC plugin) to study the behaviour of your application as a result of GC actions. Run representative load tests that can be executed repeatedly (as you gain knowledge of the various collectors), keep experimenting with different configuration until you reach your throughput and latency targets. jHiccup helps track pauses within the JVM.  As known to us it’s a difficult challenge to strike a balance between latency requirements, high object allocation and promotion rates combined, and that sometimes choosing a commercial solution to achieve this might be a more sensible idea.

      Conclusion: GC is a big subject by itself and has a number of components, some of them are constantly replaced and it’s important to know what each one stands for. The GC flags are as much important as the component to which they are related and it’s important to know them and how to use them. Enabling some standard GC flags to record GC logs does not have any significant impact on the performance of the JVM. Using third-party freeware or commercial tools help as long as you follow the authors methodology.

      — Highly reading the article multiple times, as Martin has covered lots of details about GC and the collectors which requires close inspection and good understanding.  —



      Martin Thompson’s:  Mythbusting modern hardware to gain “Mechanical Sympathy” Video * Slides

      He classifies myth into three categories – possible, plausible and busted! In order to get the best out of the hardware you own, you need TO KNOW your hardware. Make tradeoffs as you go along, changing knobs ;), it’s not as scary as you may think.

      Good question: do normal developers understand the hardware they program on? Or cannot understand what’s going on? Or do we have the discipline and make the effort to understand the platform we work with?

      Myth 1CPUs are not getting faster – clock speed isn’t everything, Sandy bridge architecture is the faster breed. 6 ports to support parallelism (6 ops/cycle). Haswell has 8 ports! Code doing division operations perform slower than any other arithmetic operations. CPUs have both front-end and back-end cycles. It’s getting faster as we are feeding them faster! – PLAUSIBLE

      Myth 2 – Memory provides us random access – CPU registers and buffers, internal caches (L1, L2, L3) and memory – mentioned in the order in of speed of access to these areas respectively. Manufacturers have been doing things to CPUs to bring down its operational temperature by performing direct access operations. Writes are less hassle than reads – buffer misses are costly. L1 is organised into cache-lines containing code that the processor will execute – efficiency is achieved by not having cache-line misses. Pre-fetchers help reduce latency and help during reading streaming and predictable data. TLB misses can also cost efficiency (size of TLB = 4K = size of memory page). In short reading memory isn’t anywhere close to reading it randomly but SEQUENTIALLY due to the way the underlying hardware works. Writing highly branched code can cause slow down in execution of the program – keep things together that is, cohesive is the key to efficiency. – BUSTED
      Note: TLAB and TLB are two different concepts, google to find out the difference!

      Myth 3 – HDD provides random access – spinning disc and an arm moves about reading data. More sectors are placed in the outer tracks than the inner tracks (zone bit recording). Spinning the discs faster isn’t the way to increase HDD performance. 4K is the minimum you can read or write at a time. Seek time in the best disc is 3-6 ms, laptop drives are slower (15 ms). Data transfers take 100-220 Mbytes/sec. Adding a cache can improve writing data into the disc, not so much for reading data from disks. – BUSTED

      Myth 4 – SSDs provides random access – Reads and writes are great, works very fast (4K at a time). Delete is not like the real deletes, it’s marked deleted and not really deleted – as you can’t erase at  high resolution hence a whole block needs to be erased at a time (hence marked deleted). All this can cause fragmentation, and GC and compaction is required. Reads are smooth, writes are hindered by fragmentation, GC, compaction, etc…, also to be ware of write-amplification. A few disadvantages when using SSD but overall quite performant. – PLAUSIBLE


      Can we understand all of this and write better code?

      Conclusion: do not take everything in the space for granted just because it’s on the tin, examine, inspect and investigate for yourself the internals where possible before considering it to be possible, or plausible – in order to write good code and take advantage of these features.

      — Great talk and good coverage of the mechanical sympathy topic with some good humour, watch the video for performance statistics gathered on each of the above hardware components  —


      “Performance Testing Java Applications” by Martin Thompson


      “How to use a profiler?” or “How to use a debugger?” 
      What is Performance? Could mean two things like throughput or bandwidth (how much can you get through) and latency (how quickly the system responds).

      Response time changes as we apply more load on the system. Before we design any system, we need to gather performance requirements i.e. what is the throughput of the system or how fast you want the system to respond (latency)? Does your system scale economically with your business?

      Developer time is expensive, hardware is cheap! For anything, you need a transaction budget (with a good break-down of the different components and processes the system is going to go through or goes through).

      How can we do performance testing? Apply load to a system and see if the throughput goes up or down? And what is the response time of the system as we apply load. Stress testing is different from load testing (see Load testing) , stress testing (see Stress testing) is a point where things break (collapse of a system), an ideal system will continue with a flat line. Also it’s important to perform load testing not just from one point but from multiple points and concurrently. Most importantly high duration testing is very important – which bring a lot of anomalies to the surface i.e. memory leaks, etc…
      Building up a test suite is a good idea, a suite made up of smaller parts. We need to know the basic building blocks of the system we use and what we can get out of it. Do we know the different threshold points of our systems and how much its components can handle? Very important to know the algorithms we use, know how to measure them and use it accordingly.
      When should we test performance? 
      “Premature optimisation is the root of all evil” – Donald Knuth / Tony Hoare

      What does optimisation mean? Knowing and choosing your data and working around it for performance. New development practices: we need to test early and often!

      From a Performance perspective, “test first” practice is very important, and then design the system gradually as changes can cost a lot in the future.

      RedGreenDebugProfileRefactor, a new way of “test first” performance methodology as opposed to RedGreenRefactor methodology only! Earlier and shorter feedback cycle is better than finding something out in the future.

      Use “like live” pairing stations, Mac is a bad example to work on if you are working in the Performance space – a Linux machine is a better option. 

      Performance tests can fail the build – and it should fail a build in your CI system! What should a micro benchmark look like (i.e. calliper)? Division operations in your code can be very expensive, instead use a mask operator!

      What about concurrency testing? Is it just about performance? Invariants? Contention? 

      What about system performance tests? Should we be able to test big and small clients with different ranges. It’s fun to know deep-dive details of the system you work on. A business problem is the core and the most important one to solve and NOT to discuss on what frameworks to use to build it. Please do not use Java serialisation as it is not designed for on the-wire-protocol! Measure performance of a system using a observer rather than measure it from inside the system only.
      Performance Testing Lessons – lots of technical stuff and lots of cultural stuff. Technical Lessons – learn how to measure, check out histograms! Do not sample a system, we miss out when things when the system go strange, outliers, etc… – histograms help! Knowing what is going on around the areas where the system takes a long time is important! Capturing time from the OS is a very important as well. 
      With time you get – accuracy, precision and resolution, and most people mix all of these up. On machines with dual sockets, time might not be synchronised. Quality of the time information is very dependent on the OS you are using. Time might be an issue on virtualised systems, or between two machines. This issue can be resolved, do round-trip times between two systems (note the start and stop clock times) and half them to get a more accurate time. Time can go backwards on you on certain OSes (possibly due to NTP) – instead use monotonic time.
      Know your system, its underlying components – get the metrics and study them ! Use a Linux tool like perstat, will give lots of performance and statistics related information on your CPU and OS – branch predictions and cache-misses!

      RDTSC is not an ordered-instructions execution system, x86 is an ordered instruction systems and operations do not occur in a unordered fashion.

      Theory of constraints! – Always start with number 1 problem on the list, the one that takes most of the time – the bottleneck of your system, the remaining sequence of issues might be dependent on the number 1 issue and are not separate issues!

      Trying to create a performance team is an anti-pattern – make the experts help bring the skills out to the rest of the team, and stretch-up their abilities!

      Beware of YAGNI – about doing performance tests – smell of excuse!
      Commit builds > 3.40 mins = worrying, same for acceptance test build > 15 mins = lowers team confidence.
      Test environment should equal to production environment! Easy to get exactly similar hardware these days!
      Conclusion: Start with a “test first” performance testing approach when writing applications that are low latency dependent. Know your targets and work towards it. Know your underlying systems all the way from hardware to the development environment. Its not just technical things that matter, cultural things matter as much, when it comes to performance testing. Share and spread the knowledge across the team rather than isolating it to one or two people i.e. so called experts in the team. Its everyone’s responsibility not just a few seniors in the team. Learn more about time across various hardware and operating systems, and between systems.

      As it is not practical to review all such videos and articles, a number of them have been provided in the links below for further study. In many cases I have paraphrased or directly quoted what the authors have to say to preserve the message and meaning they wished to convey. A follow-on to this blog post will appear in the same space under the title (Part 3 of 3): Synopsis of articles & videos on Performance tuning, JVM, GC in Java, Mechanical Sympathy, et al on 23rd Dec 2013.


      Feel free to post your comments below or tweet at @theNeomatrix369!

      Useful resources

      This post is part of the Java Advent Calendar and is licensed under the Creative Commons 3.0 Attribution license. If you like it, please spread the word by sharing, tweeting, FB, G+ and so on!

      Performance tuning – measure don’t guess

      In my performance tuning career I have given the advice to measure and not guess more often than I can recall. And in many cases the target of this advice has given up after looking at the monolithic 500,000 LOC legacy application they are working.

      In the post we are about to share some simple tools and concepts how to setup the initial measurement, so you can have the baseline to start with more fine-grained performance tuning.

      Setting the goal

      Pretty much any non-trivial application can be optimized forever. Both in bad and in good way. Bad examples include tweaking random parts of the applications and then praying for the best. Whatever might be the reason – I have seen tens and tens of developers “being pretty sure that exactly this line of code needs attention” without basing their gut feeling to any measurements.

      The second category is almost as dangerous, as you can spend hundreds of man-years tuning your application “to be ready for the prime time”. The definition for prime time might vary, but are you really sure you are going to accommodate terabytes of media, tens of millions of records in your database and have to serve hundreds of thousands of concurrent users with less than 100ms latency? Unless you are aiming to put Google out of business, you most likely will not. Worse yet, spending so much time preparing for the primetime is a darn good way to assure the prime time never arrives. Instead of spending this time for squeezing out the extra 1ms in latency nobody is likely to notice, maybe the same time should have spent in ironing out this layout bug annoying the end users on Safari instead?

      So how to set a meaningful target? For this you need to understand the business you are in. Real-time strategy games and First Person Shooters tend to have different requirements than online shopping carts. As last time I checked, Java was not going strong in the gaming industry, lets expect you are dealing with a typical Java EE application with web based front-end.

      In this case, you should start now segmenting your application into different categories, which, based on the research are similar to the following:

      • 0.1 seconds gives the feeling of instantaneous response — that is, the outcome feels like it was caused by the user, not the computer.
      • 1 second keeps the user’s flow of thought seamless. Users can sense a delay, and thus know the computer is generating the outcome, but they still feel in control of the overall experience and that they’re moving freely rather than waiting on the computer. This degree of responsiveness is needed for good navigation.
      • 10 seconds keeps the user’s attention. From 1–10 seconds, users definitely feel at the mercy of the computer and wish it was faster, but they can handle it. After 10 seconds, they start thinking about other things, making it harder to get their brains back on track once the computer finally does respond.

      So you might have functionality such as adding products to the shopping carts or browsing through the recommended items which you need to squeeze into the “instant” bucket to provide the best overall user experience resulting in better conversion rates. Initially I am pretty sure your application is nowhere near this criteria, so feel free to replace the 100ms and 1,000ms threshold with something you actually can achieve, such as 300 and 1,500ms for example.

      On the other hand, you most likely have some operations which are expensive enough, such as the products search or user account registration which might fall into the second bucket.

      And last, you have the functionality which you can toss into the last bucket, such as generating PDF from the invoice. Note that you also might end up with fourth category – for example if you are generating the bills of delivery for your warehouse, your business might be completely OK if batch processing the daily bills takes 20 minutes.

      Pay attention that business persons have a tendency of tossing everything into the “instant” bucket. Now it is your task to explain the effort required and ask “if you had only three operations which are completing under our threshold, what should those be”.

      Now you should have your application functionality categorized into different categories, such as the following:

      Requirement Category
      Browse products instant
      Add product to a shopping cart instant
      Search product seamless
      Register new account attention
      Generate PDF invoice attention
      Generate daily bills of delivery slow

      Understanding the current situation

      Your next goal is to understand where you will perform your measurements. For end users, the experience is complete when the page has rendered the result of the operation in their browser, which takes into account the network latency and browser DOM operations for example.
      As this will be more complex to measure, let us assume your site has been well optimized against the Google Page Speed or YSlow recommendations and for simplicity’s sake lets focus on elements directly under your control.

      On most cases, the last piece of infrastructure still under your control will be your web server, such as the Apache HTTPD or nginx. If you have logging enabled, you will have access to something similar to the following in your nginx access.log:

      82.192.41.11 - - [04/Dec/2013:12:16:11 +0200]  "POST /register HTTP/1.1" 301 184 "https://myshop.com/home" "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:25.0) Gecko/20100101 Firefox/25.0" 0.428
      82.192.41.11 - - [04/Dec/2013:12:16:12 +0200] "POST /searchProduct HTTP/1.1" 200 35 "https://myshop.com/accountCreated" "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:25.0) Gecko/20100101 Firefox/25.0" 3.008
      82.192.41.11 - - [04/Dec/2013:12:16:12 +0200] "GET /product HTTP/1.1" 302 0 "https://myshop.com/products/searchProducts" "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:25.0) Gecko/20100101 Firefox/25.0" 0.623
      82.192.41.11 - - [04/Dec/2013:12:16:13 +0200] "GET /product HTTP/1.1" 200 35 "https://myshop.com/product/123221" "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:25.0) Gecko/20100101 Firefox/25.0" 0.828
      82.192.41.11 - - [04/Dec/2013:12:16:13 +0200] "GET /product HTTP/1.1" 200 35 "https://myshop.com/product/128759" "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:25.0) Gecko/20100101 Firefox/25.0" 1.038
      82.192.41.11 - - [04/Dec/2013:12:16:13 +0200] "GET /product HTTP/1.1" 200 35 "https://myshop.com/product/128773" "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:25.0) Gecko/20100101 Firefox/25.0" 0.627
      82.192.41.11 - - [04/Dec/2013:12:16:14 +0200] "GET /addToShoppingCart HTTP/1.1" 200 35 "https://myshop.com/product/128773" "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:25.0) Gecko/20100101 Firefox/25.0" 2.808
      82.192.41.11 - - [04/Dec/2013:12:16:14 +0200] "GET /purchase HTTP/1.1" 302 0 "https://myshop.com/addToShoppingCart" "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:25.0) Gecko/20100101 Firefox/25.0" 3.204
      82.192.41.11 - - [04/Dec/2013:12:16:16 +0200] "GET /viewPDFInvoice HTTP/1.1" 200 11562 "https://myshop.com/purchase" "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:25.0) Gecko/20100101 Firefox/25.0" 3.018

      Lets investigate now what we have found in the log. The snippet we have extracted contains actions of one user completing a full transaction in our web site. The user has created an account, searched for products, browsed through the results, found a product he has liked, added it to a shopping cart, completed the purchase and generated a PDF of the invoice. Those actions can be detected in the “POST /searchProduct HTTP/1.1” column. Next important part is the last column containing the total request time it took for a particular request to complete. In the case of /searchProduct it took 3.008 seconds to complete the search.

      Note that by default nginx does not have the request time logging enabled, so you might need to to modify your log_format by adding $request_time to the pattern.

      Now, with a little bit of grep/sed/excel magic, you will have something similar to the following at your fingertips:

      Requirement Category Mean 90%
      Browse products instant 0.734 0.902
      Add product to a shopping cart instant 2.422 3.490
      Search product seamless 2.800 3.211
      Register new account attention 0.428 0.480
      Generate PDF invoice attention 3.441 4.595
      Generate daily bills of delivery slow

      Picking the target for optimization

      From here you will get your list of optimization targets. The suspects are obvious – they fall in the categories where you violate the agreement, in the above case you have problems with three requirements – Browse products, Add product to a shopping cart and Search product all violate your performance requirements.

      This article does not focus on actual optimisation tasks, but in order to give food for thought – what should you do next, knowing that adding products to the shopping cart takes way too much time?

      Next steps are definitely more application specific than the prior tasks. Also, the tooling to be used can vary – you might now decide to go with APM or monitoring products. But if you do not have the budget or just do not wish to fight with procurement office, a simple solution would include adding logging trail in the component boundaries by using AOP means.

      For example, you can add something like the following in your application code to monitor the time spent in the service layer.

      @Aspect
      @Component
      public class TimingAspect {

      @Around("execution(* com.myshop..*Service+.*(..))")
      public Object time(ProceedingJoinPoint joinPoint) throws Throwable {
      StopWatch.WatchStatus watch = StopWatch.getInstance().start(joinPoint.getSignature().getName());
      final Object result = joinPoint.proceed();
      watch.stop();
      return result;
      }
      }

      Decomposing the request time further gives you the required insight to understand where the bottlenecks are hiding. More often than not you will quickly discover a violating query, faulty cache configuration or poorly decomposed functionality enabling you to net your first quick wins in a matter of days or even hours.

      This post is written by @iNikem from Plumbr and is part of the Java Advent Calendar. The post is licensed under the Creative Commons 3.0 Attribution license. If you like it, please spread the word by sharing, tweeting, FB, G+ and so on!

      (Part 1 of 3): Synopsis of articles & videos on Performance tuning, JVM, GC in Java, Mechanical Sympathy, et al

      I have been contemplating for a number of months about reviewing a cache of articles and videos on topics like Performance tuning, JVM, GC in Java, Mechanical Sympathy, etc… and finally took the time to do it – may be this was the point in my intellectual progress when was I required to do such a thing!
      Thanks to Attila-Mihaly for giving me the opportunity to write a post for his yearly newsletter Java Advent Calendar, hence a review on various Java related topics fits the bill! The selection of videos and articles are purely random, and based on the order in which they came to my knowledge. My hidden agenda is to mainly go through them to understand and broaden my own knowledge at the same time share any insight with others along the way.

      I’ll be covering three reviews of talks by Attila Szegedi (1 talk) and Ben Evans (2 talks). They speak on the subject of Java Performance and the GC. The first talk by Attila covers a lot of his experience as an Engineer at Twitter – so its lots of information out of live experience in the field on production systems. Making use of thin objects instead of fat ones is one of the buzzwords in his talk.

      Ben in his two talks covers Performance, JVM and GC in great depth. He points out about people’s misconception about Performance, the JVM and GC, things that people don’t have certain run-time flags enabled in production.  How the underlying machinery works, why it works the way it works?How efficient the machinery is and what best to do and not to do to get good throughput out of it?

      Here I go with my commentary, I decided to start with Attila Szegedi’s talk as I quite liked the title…..


      Everything I Ever Learned About JVM Performance Tuning @Twitter by Attila Szegedi
      (video & slides)

      Attila at the time of the talk worked for Twitter where he learnt a lot about the internals of the JVM and the Java language itself – Twitter being an organisation where tuning, optimising JVMs, low-latency are defacto practises.
      He covers interesting topics like:
      – contributors of latency
      – finished code not ready for production
      – areas of performance tuning (primarily memory tuning and lock contention tuning)
      – Memory footprint tuning (OOME, inefficient tuning, FAT data)
      – FAT data – a new terminology coined by him, and how to resolve issues created by it (pretty indepth and interesting) – learn about byte allocations to data types in the Java / JVM languages.
      Some deep dive topics like compressed object pointers, are one of the suggestions (including a pit-fall). Certain types in Scala 2.7.7 are inefficient – as revealed by a JVM profiler. Do not use Thrift – as it is not a friend of low-latency, as they are heavy – adds between 52 to 72 bytes of overhead per object, does not support 32-bit floats, etc… Be careful with thread locals – sticks around and uses more resources than expected.
      Performance triangle, Attila shares his insight into this concept. GC is the biggest threat of the JVM. Old gen uses ConcCollector, while the new gen goes through the STW process, and enlists a number of throughput and low-pause collectors.
      Improve GC by taking advantage of the Adaptive sizing policy, and give it a target to work on. Use a throughput collector with or without the adaptive policy and benchmark the results.  He takes us through the various -XX: +Print… flags and explains its uses. Keep fragmentations low and avoid full GC stops. Lots of detail on the workings of the GC and what can be done to improve GC (tuning both new and old gens).
      Latency that are not GC related – thread coordination optimization. Barriers and half-barriers can be used when using threads to improve latency – along with some tricks when using the Atomic values & AtomicReferences. Cassandra slab allocator – helps efficiency and performance – do not write your own memory manager. Attila is no longer a fan of “Soft references” – although great in theory but not in practice, more GC cycles are needed to clear them!

      Conclusion: know your code as often they may be the root to your problems – frameworks can many a times be the cause of performance issues. Lots of things can be done to squeeze performance out of the programs written, if one knows how to best use the fundamental building blocks of data structures of your development environment. Its a hard game to maintain the best throughput and get the best performance out of the JVM.

      — Recommend watching the video, lots more covered than the synopsis above  —


      9 Fallacies of Java Performance by Ben Evans (blog)

      In this article Ben goes about busting old myths and assumptions about Java, its performance, GC, etc… Areas covered being:
      1) Java is slow, 2) A single line of Java means anything in isolation, 3) A micro-benchmark means what you think it does , 4) Algorithmic slowness is the most common cause of performance problems, 5) Caching solves everything, 6)  All apps need to be concerned about Stop-The-World, 7) Hand-rolled Object Pooling is appropriate for a wide range of apps, 8) CMS is always a better choice of GC than Parallel Old, 9) Increasing the heap size will solve your memory problem
      – JIT compiled code is as fast as C++ in many cases
      – JIT compiler can optimize away dead and unused code, even on the basis of profiling data. In JVMs like JRockit, the JIT can decompose object operations.
      – For best results don’t prematurely optimize, instead correct your performance hot spots. 
      – Richard Feynman once said: “The first principle is that you must not fool yourself – and you are the easiest person to fool” – something to keep in mind when thinking of writing Java micro-benchmarks.
      The points being the ideas people have in their minds about Java is but the opposite of the reality of things. Basically suggesting the masses to revisit the ideas and make conclusions based on sheer facts and not assumptions or old beliefs.
      – GC, database access, misconfiguration, etc… are likely to cause application slowness as compared to algorithms.
      – Measure, don’t guess ! Use empirical production data to uncover the true causes of performance problems.
      – Don’t just add a cache to redirect the problem elsewhere and add complexity to the system, but collect basic usage statistics (miss rate, hit rate, etc.) to prove that the caching layer is actually adding value.
      – If the users haven’t complained or you are not in the low-latency stack – don’t worry about STOP-THE-WORLD pauses (circa 200 ms depending on the heap size).
      – Object pooling is very difficult and should only be used when GC pauses are unacceptable, and intelligent attempts at tuning and refactoring have been unable to reduce pauses to an acceptable level.
      – Check if CMS is your correct GC strategy, you should first determine that STW pauses from Parallel Old are unacceptable and can’t be tuned. Ben stresses: be sure that all metrics are obtained on a production-equivalent system.  
      – Understanding the dynamics of object allocation and lifetime before changing heap size or tuning other parameters is essential. Acting without measuring can make matters worse. The tenuring distribution information from the garbage collector is especially important here.
      Conclusion: The GC subsystem has incredible potential for tuning and for producing data to guide tuning, and then to use a tool to analyse the logs – either handwritten scripts and some graph generation, or a visual tool such as the (open-source) GCViewer or a commercial product.


      Visualizing Java GC by Ben Evans (video & slides)

      Misunderstanding or shortcomings in people’s understanding of GC. Its not just Mark & Sweep. Many run-times these days have GC! Two schools of thoughts – GC & Reference counting! Humans make mistakes as compared to machines which requires high levels of precision. True GC is incredibly efficient, reference counting is expensive – pioneered by Java (comments from +Gil Tene: On the correctness side, I’d be careful saying “pioneered by Java” for anything in GC. Java’s GC semantics are fairly classic, and present no new significant problems that predating environments did not. Most core GC techniques used in JVMs were researched and well known in other environments (smalltalk, lisp, etc.) and are also available in other Runtimes. While it is fair to say that from a practical perspective, JVMs tend to have the most mature GC mechanisms these days, that’s because Java is a natural place to apply new GC techniques that actually work. But innovation and pioneering in GC is not strongly tied to Java.)

      The allocation list is where all objects are rooted from. You can’t get an accurate picture of all the objects of a running object at any given point of time of a running live application without stopping the application that’s why we have STW (Stop-The-World)(comments from +Gil Tene: In addition, the notion that “you can’t get an accurate picture of all the objects of a running object at any given point of time of a running live application without stopping the application that’s why we have STW (Stop-The-World)!” is wrong. Concurrent marking and concurrent compaction are very real things that achieve just that without stopping the application. “Just needs some good engineering”, and “you just can’t do X” are very different things.)

      Golden rules of GC
      – must collect all the garbage (sensitive rule)
      – must never collect a live object
      (trick: but they are never created equal)

      Hotspot is C/C++/Assembly application. Heap is a contiguous block of memory with different memory pools – Young Gen, Old Gen, and PermGen pools. Objects are created by application (mutator) threads and removed by GC. Applications are not slow due to GC all the time.
      PermG – not desirable, going away in Java 8 (known issue: causes OOME exceptions), to be replaced by Metaspace outside the heap (native memory).
      GC is based on ‘Weak generational hypothesis’ – objects die young, or die old – found out through empirical research. (comments from +Michael Barker: I think this statement:
      “GC is based on ‘Weak generational hypothesis’ – objects die young, or die old – found out through empirical research.”
      Is not correct.  I think I can guess at what you mean, but you may want to consider rewording it so that it is not misleading.  There are GC implementations in real world VMs that are not generational collectors.

      comments from +Kirk Pepperdine: Indeed. the ParcPlace VM had 7 different memory spaces that have a strong resemblance to the todays generational spaces. There is Eden with two hemi spaces plus 4 other spaces for different types of long lived data.)
      Re-worded version: GC in the JVM is based on ‘Weak generational hypothesis’ – objects die young, or die old – found out through empirical research.

      Tenuring threshold is the number of GC you survive before your get moved to the Old Gen (Tenuring space). JavaFX is bundled with jdk7u6 and up.
      Source code of JavaFX Memory Visualizer written in Java replacing the Flash version  – https://github.com/kittylyst/jfx-mem – written using FlexML (FXML). An extensive explanation of how the program is written in FlexML, a nice programming language – uses the builder pattern in combination with DSL like expressions. The program models the way GC works and how objects are created, destroyed and moved about the different pools. 
      List of mandatory flags, which do not have any performance impact
      -verbose: gc
      -Xloggc:<pathtofile>
      -XX:+PrintGCDetails
      -XX:+PrintTenuringDistribution
      All the information needed about an executing application and GC are recorded by the above. Also covers basic heap sizing flags. Setting the heap flags to equal do not apply anymore since recent versions of the JDK. Also there’s more than 200 flags to the GC and VM not including all the undocumented ones.
      GC log files are useful for post-processing, but sometimes are not recorded correctly. MXBeans impact the running application but also do not give more information than the log files.
      GC log files have a general format giving information on change of allocation, occupancy, tenuring info, collection info, etc…,  – explosion of GC log file formats and not much tooling out there. Many of the free tools cover some sort of dashboard like output showing various GC related metrics, the commercial versions have a better approach and useful information in general.

      Premature promotion – under pressure of creation of new objects, objects are moved directly from YG to OG without going through the Survivor spaces.

      Use tools, measure and don’t guess!

      Conclusion: know the facts and find out details if they are not known but do not guess or assume. False conceptions have lead to assumptions and incorrect understanding of the JVM and the GC process at times. Don’t just changes flags or use tools, know why to and what they do. For e.g. switching on GC logging (with appropriate flags enabled) does not have a visible impact on the performance of the JVM but is a boon in the medium to long run.


      — Highly recommend watching the video, lots more covered than the synopsis above, Ben has explained GC in the simplest form one could, covering many important details  —

      As it is not practical to review all such videos and articles, a number of them have been provided in the links below for further study. In many cases I have paraphrased or directly quoted what the authors have to say to preserve the message and meaning they wished to convey. A follow-on to this blog post will appear in the same space under the titled (Part 2 of 3): Synopsis of articles & videos on Performance tuning, JVM, GC in Java, Mechanical Sympathy, et al on 19th Dec 2013.

      Thanks

      Thanks to +Gil Tene, +Michael Barker, @Ryan Rawson, +Kirk Pepperdine, and +Richard Warburton for read the post and providing using feedback.

      Feel free to post your comments below or tweet at @theNeomatrix369!

      Useful resources


        This post is part of the Java Advent Calendar and is licensed under the Creative Commons 3.0 Attribution license. If you like it, please spread the word by sharing, tweeting, FB, G+ and so on!