A Serpentine Path To Music

A snapshot of the ffmpeg visualisation for the
Goldberg Variations as synthesised by Sonic Field.

Art And Programming Collide: An Introduction To The Post

Whilst there are a number of ways of making music with Java, as far as I can tell, Sonic Field is the only fully fledged, end to end synthesis system written in pure Java (no JNI and no call out to native servers etc.) which does not use the Javax Synthesiser extensions (so does not rely on the operating system to supply the actual synthesis engine).

Also, in many other ways, its heavy weight, uncompromisingly mathematical approach to music is unique. This project has consumed far too much of my time for the better part of three years; what can be said about this ab-initio synthesis system in just one post and what can we learn about Java programming from the project?

The answer is why?

  • Why write a mathematical engine in Java?
  • Why control it from a dynamic language (Python)?
  • Why make music using programs at all?
  • Why use ab-initio synthesis (and what is it)?

Maybe,  most interesting of all: 
  • Why is large scale musical mathematics so computationally demanding?
I will attempt to answer all these questions and shed some light into very ill frequented corners of Java/JVM programming. My approach will not be to answer each in turn, but these questions will form the armature upon which what follows will be constructed.

Sonic Field and how to achieve complex audio synthesis on the JVM proved to be too large a subject for one post. I have am extremely lucky to have been given chance to make two. This post is the hard work; we will cover the difficult subjects like threading, memory management and some digital audio concepts. The next post is a lot more fun; it will look at why I chose Python and some thoughts on creativity.

Now let us look at how digital sounds is represented and manipulated. 

Believe it or not, there are two ways right from the start. However, everyone (including myself with Sonic Field) picks the same one. Sound is made of waves. Waves can be represented mathematically using trigonometrical functions like sin and cos. We can then apply non linear differential equations to those mathematical forms to produce more complex equations. Keep adding more and more transformations and eventually we end up with one equation which represents an entire piece of music. This never happens because it is far, far too hard! Such an approach is referred to as ‘analytic’ and is practical only for the simplest of sounds.

The spectrogram of the piece of music
I rendered with Sonic Field for the
Java Advent Calendar
More tractable are numerical approaches. Here we consider sounds as a series of samples. Just like the samples that make up a video, one frame at a time, sound is generated and manipulated one sample at a time. We must understand this is an compromise and it brings with it a whole series of computational challenges of its own.

A quick note on licensing: Sonic Field is free as freedom:


Sonic Field Audio Processing And Synthesis Program
Copyright (C) 2012-2014 Dr Alexander J Turner

This program is free software: you can redistribute it and/or modify
it under the terms of the GNU Affero General Public License as
published by the Free Software Foundation, either version 3 of the
License, or (at your option) any later version.

This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU Affero General Public License for more details.

You should have received a copy of the GNU Affero General Public License
along with this program. If not, see <http://www.gnu.org/licenses/>.
All the source code, including any Sonic Field patch scripts in this post are covered by the above licence. It only seems fair that anyone reading this should respect this as I am giving it away for free on Github!

Why samples are a challenge:

Consider that we want audio to be modelled up to 20kHz, i.e. we want to generate sounds which have frequencies up to 20 000 cycles per second. To actually generate that all using a sampling, digital system we need 40 000 samples per second. This is intuitively easy to understand as a sound goes ‘up and down’ in a wave so we need an up sample and a down sample for each cycle of the wave.
Now let us consider making a sound. We can start with a sine wave. This is a pure sound, and it is very boring indeed. How about making a more interesting sound? Well, real sounds consist of many different sine waves mixed together. A note at concert pitch A is 440Hz, it this were played on a trumpet we might get a sequence of frequencies like this:
440,880,1320,1760,2200,2640,3080,3520,3960,4400 and so on.
At the beginning of our piece we can see
that even a complex note is made from
bands of frequencies. 

Each frequency will be slightly less intense then the one beneath (usually) but they are all there. Now let us just consider 4400 or 8800; what happens when we pass that through a distortion filter to make it sound cooler? Well, we will end up adding frequencies which are multiples of the frequencies which made up the original trumpet sound. All this multiples are call harmonics.

Bored? we need some code! Sonic Field is controlled using Jython, the Python interpreter which runs on the JVM. I hope the reason for this will become obvious, but I will explicitly discuss this int he second post. Here is a piece of code which makes the basic sound like a trumpet (well, at least the trumpet rank from an organ):


def trumpetBase(length,freq,z=1.0):
voxA=[]
hc=1.0
freq=float(freq)
while hc*freq<20000:
hf=hc*freq
voxA.append(sf.NumericVolume(sf.PhasedSineWave(length,hf,random.random()),(1.0/hc)**z))
hc+=1

vox=sf.Mix(voxA)
vox=sf.Clean(vox)
vox=polish(sf.FixSize(vox),freq)
return sf.FixSize(vox)

OK, so you might not know Python (this being a Java Advent Calendar and all) so I will explain a little of what is happening here. Simply out their is a while loop which creates samples of different frequencies of sine waves to create the harmonics of the trumpet sound. The phase of the sine waves is randomised. It really does not matter if you do not know what that means. Anyhow, the loop which accumulates sine waves, then we mix them together and finally fix the volume (FixSize) to deviate +-1 from zero.
Not long now and we will be deep diving into Java, but a few more sentences to set the scene. But just back to that distortion. Consider I want the trumpet to sound more fragile and bright?

sig=sf.Power(trumpetBase(440,10000),1.25)
Now I have created a new ‘signal’ (i.e. a sequence of samples) which is a distortion of the original. In so doing multiples of the original harmonics will have been introduced. 5×4400 is over 20000. Our 40000 samples per second will no longer be enough. What we need to do is sample at a higher rate. But eventually we are defeated by this approach as we add distortion on distortion. The solution is to sample at a higher rate and then filter out the high frequencies periodically. We perform synthesis by repeating two basic steps, process then filter. The process adds frequencies and the filter takes out everything above when we need.
OK, so what rate? It seems that for most of my work 96 000 samples per second is about right. More makes near no difference to the sound quality but as I go below this number a horrid harsh effect called ‘aliasing’ starts to be unavoidable. This is caused by trying to model a frequency higher than twice the sample rate.

Here is a video I made talking about, explaining and illustrating aliasing.
Please note that the Sonic Field patch shown is pre the use of Python
and so is in a bespoke language I wrote called SFPL which is 
no longer used.
Finally we have started to see why making music with Java is so hard. 96 000 samples per second is a lot! In my research anything other than double precision floating point mathematics adds noise and reduces the over all quality of the sound. So, 96 000 * 8 bytes for 1 second of music. Basically, a single note can take many megabytes of memory to store.

How Sonic Field Represents Sounds:

The simplest way would be to just use a Java array of doubles. Sadly this is not anything like enough. I use a 16Gig machine and it would run out of memory very quickly if I used simple arrays. The reason being that sounds are made of sounds (as we saw before). Each of those sine waves in the trumpet sound might be 10 seconds long (about 5 megabytes) and there could be 40 harmonics, so that is 400 mega bytes. Then we want to generate notes in multiple threads, for 8 threads that might be 8 notes at once which is 1.6 Gigabytes. That is just to great a 10 second note which is relatively simple. More complex and longer notes (especially in ambient music where notes can last a very long time) quickly exhaust the capacity of my machine. I could use a bigger machine and address say 128 Gig; that would do for most music, but again, such a machine is likely to have more cores and the parallel memory use issue appears again. Basically, sound processing is memory lop sided compared to most other forms of processing.
Sonic Field address this issue with two techniques: Memory mapped files and generators. Of these, generators are probably the most interesting programmatically because they are surprisingly hard to implement and do not work as well as we might hope.


package com.nerdscentral.audio;

import java.util.concurrent.atomic.AtomicInteger;
import java.util.concurrent.atomic.AtomicLong;

import com.nerdscentral.audio.pitch.CubicInterpolator;
import com.nerdscentral.sython.SFMaths;
import com.nerdscentral.sython.SFPL_RuntimeException;

public abstract class SFSignal implements AutoCloseable
{

private final static AtomicLong uniqueId = new AtomicLong(0);
private final long myId;
protected final AtomicInteger referenceCount = new AtomicInteger(1);

public abstract boolean isKilled();

public abstract SFSignal replicate();

private static ThreadLocal<String> pythonStack = new ThreadLocal<>();

public final SFData replicateEmpty()
{
return SFData.build(this.getLength());
}

public abstract double getSample(int index);

protected SFSignal()
{
myId = uniqueId.incrementAndGet();
}

/**
* Returns a linearly interpolated sample based on the samples either side of the the passed index. This is used for super
* sampling or pitch effects.
*
* @param index
* @return
*/
public final double getSampleLinear(double index)
{
double s = SFMaths.floor(index);
double e = SFMaths.ceil(index);
if (s < 0 || e >= getLength())
{
return 0;
}
if (s == e) return getSample((int) s);
double a = getSample((int) s);
double b = getSample((int) e);
return ((index - s) * b + (e - index) * a);
}

/**
* Returns a cubic interpolated sample based on the samples either side of the the passed index. This is used for super
* sampling or pitch effects. Cubic interpolation uses two samples either side of the required point and so at the ends of
* the sample this will fall back to linear interpolation.
*
* @param index
* @return
*/
public final double getSampleCubic(double index)
{
int s = (int) SFMaths.floor(index);
int e = (int) SFMaths.ceil(index);
if (s < 0 || e >= getLength())
{
return 0;
}
if (s > getLength() - 3 || index < 1)
{
if (s == e) return getSample(s);
double a = getSample(s);
double b = getSample(e);
return ((index - s) * b + (e - index) * a);
}
return CubicInterpolator.getValue(getSample(s - 1), getSample(s), getSample(s + 1), getSample(s + 2), index - s);
}

public abstract double setSample(int index, double value);

public abstract int getLength();

public abstract void setAt(int pos, SFSignal data) throws SFPL_RuntimeException;

public abstract void setFrom(int pos, SFSignal dataIn) throws SFPL_RuntimeException;

public abstract double[] getDataInternalOnly();

/**
* Get a globally unique identifier for this SFSingal
*
* @return
*/
public final long getUniqueId()
{
return myId;
}

public abstract void release();

@Override
public void close() throws RuntimeException
{
int c = referenceCount.decrementAndGet();
if (c == 0) release();
else if (c < 0) throw new RuntimeException(Messages.getString("SFSignal.1")); //$NON-NLS-1$
}

public void incrReference()
{
referenceCount.incrementAndGet();

}

public SFSignal __pos__()
{
incrReference();
return this;
}

public SFSignal __neg__()
{
close();
return this;
}

public int getReference()
{
return referenceCount.get();
}

public void decrReference()
{
referenceCount.decrementAndGet();
}

public static String getPythonStack()
{
return pythonStack.get();
}

public static void setPythonStack(String ps)
{
SFSignal.pythonStack.set(ps);
}

public void clear()
{
// NOOP on root class
}
}

[This AGPL code is on Github]

This is the abstract root class for all signal data in Sonic Field. It is a bit of a mess to be honest, the perfect example of a one person generated class which has ended up a little unfocused. However, it works and concrete sub-classes of this make up generators and memory mapped file handling. We can also see a bunch of code for integrating with Python in a nice way and for explicit resource management; normal Java garbage collection is no use for memory mapped file based storage!

So what is a generator?

The basic architecture of synthesis ends up being (you can argue theoretically about this, but in reality this is what happens):

  1. Make a signal
  2. Manipulate that signal making a new signal
  3. Repeat until the wanted sound appears

The obvious architecture is then to:

  1. Signals are samples.
  2. Read samples.
  3. Apply and algorithm.
  4. Write samples.

We represent samples as indexed sequences of double . Therefore, for the above model, each sound manipulator algorithm reads from a sequence and writes out to a sequence. However, this involves a huge amount of memory access and memory access is very slow. When we run out of memory the operating system starts flushing data to the memory mapped files to help cope; reading and writing then becomes even slower. Consider a manipulation like this:

  1. Distort a signal using the ‘Saturate’ algorithm
  2. Halve the resulting volume
  3. Mix with another signal

We do not need to create intermediate data storage for these steps, they can be chained together. This is what generators do; they chain with other generators overriding the getSample method. Here is the saturate implementation:

 
public static class Translator extends SFSingleTranslator
{

protected Translator(SFSignal input)
{
super(input);
}

@Override
public double getSample(int index)
{
double x = getInputSample(index);
double y = x >= 0 ? x / (x + 1) : x / (1 - x);
return y;
}
}
Here the base class SFSingleTranslator implements the ability to contain and access the signal source. Mix is a little more complex as it has multiple source signals, but again, it is hardly expensive:

static class Translator extends SFMultipleTranslator
{

protected Translator(List<SFSignal> input)
{
super(input);
}

@Override
public double getSample(int index)
{
double d = 0;
for (int mindex = 0; mindex < getNMembers(); ++mindex)
{
d += getInputSample(mindex, index);
}
return d;
}
}
I hope it is clear how each of these can use the previous as a source and so no storage is required.

Why generators are not the solution to everything:

Generators seem so cool that we might think signal processing is a solved problem; just make all manipulations a generator and the reading/writing issue only comes at the end of signal generation when we write our finished music to disk. Sadly, the real world is more complex. Whilst this ideal concept might work for a restricted notion of real time signal manipulation, for the complex world of non realtime manipulation in Sonic Field we have non sequential access and multi-threading to contend with.
First, let us consider the simplest form of non sequential access; this is multiple access to the same signal. Whilst the previous algorithms were simple, some are not. If we consider a 6 pole infinite impulse response filter, then we are looking at something which will take hundreds or thousands of cycles to compute one signal sample. We do not want to accidentally do this twice! However, generators can cause that to happen. Consider that we split a signal into two (for left and right).

# Read the signal
signal=sf.ReadFile("mono.wav")
# Filter it
signal=sf.ButterworthLowPass(signal,1000,6)
# Split it
left=+signal
right=signal
# Add Haas effect
left  = sf.Concatenate(sf.Silence(30),left)
# Write it out
sf.WriteFile32((left,right,"stereo,wav")
If the Butterworth Low Pass filter is a generator then it will get run twice, once for the left and once for  the right. This would be insanely expensive. In the case of expensive algorithms we want to write and the result not keep recomputing it. So, for low cost transformations Sonic Field uses generators and for high cost it does indeed translate to a SFData; i.e. a memory mapped file.
Left get even more complex than that; we also have some processing which can take place in many, many non sequential subsections. A processor might cut a signal into pieces, for granular synthesis for example. In granular synthesis we take sound and chop it into tiny pieces and then manipulate each individually. Here is a video I did describing the technique:

In cases like this we also do not want the signal being non sequentially processes to be a generator. To cope with such situations we have the ‘realise’ method which converts generators to ‘real’ signals but has no effect on those which are already storage backed.

Similarly, consider white noise. This is ‘random’ so one might think it could easily be created from a generator. However, we want the same point in a signal (say sample 112121) to always have the same value each time we read it. So, in effect, white noise must be ‘recorded’ into a real data store. Finally we have the filters which have an internal state. Things like low pass filters are implemented as a state machine. That state is based on the order in which samples are passed through the filter. As such, they cannot be implemented as a generator as samples in Sonic Field are random access exactly because we might want to do things like granulate!

How Java Is Terrible At Memory Management (and approaching a solution):

Now we have some background to why managing audio data is such a challenge we are ready to see why Java is a bit of a catastrophe for managing memory. Java has garbage collection which means we do not know how much memory is actually being used. There are some forms of instrumentation which are supposed to tell us the amount of memory being used, but they are not very ‘real time’. In a C++ based system (for example) we could intercept malloc and know, to the byte, the amount of memory being used from the heap. This luxury is not given to us in Java. I tried, oh I tried so hard, to get a reliable way of estimating when the JVM was about to throw an out of memory exception; but as far as I can tell, there is no fail-safe way of doing so.

However, all is not lost as there is a usable work around. Indeed, the work around is always evolving and I would not be shocked it it became almost as effective as a hand crafted memory management system in C++ but with the platform independence of Java. It is not there yet, but it is getting there.

Memory Mapped Signal Pages

Memory mapped files are somewhat supported in Java. It is frustrating as anything that we can only memory map byte buffers and not double arrays; nevertheless, we can use them with the appropriate accessor methods. So, what Sonic Field does is have 2 megabyte ‘chunks’ which are ByteBuffers created by memory mapping a file. Using explicit resource management (thanks to Java 7) we know when a chunk is being used and when it is free. When one is freed it is added to a ConcurrentLinkedDeque of free chunks. This means our chunks are recyclable. How about we take a look at some of the code which does this:


static
{
tempDir = new File(System.getProperty(SONIC_FIELD_TEMP));
String swapLimitraw = System.getProperty(SONIC_FIELD_SWAP_LIMIT);
if (swapLimitraw == null)
{
swapLimit = MAX_IN_RAM; // 64 megabytes or 87 seconds at 96000
}
else
{
swapLimit = Long.parseLong(swapLimitraw);
}
chunkLen = swapLimit / 2;

String pid = ManagementFactory.getRuntimeMXBean().getName();
try
{
if (tempDir != null)
{
coreFile = File.createTempFile("SonicFieldSwap" + pid, ".mem", tempDir); //$NON-NLS-1$ //$NON-NLS-2$
}
else
{
coreFile = File.createTempFile("SonicFieldSwap" + pid, ".mem"); //$NON-NLS-1$//$NON-NLS-2$
}
coreFile.deleteOnExit();
// Now create the actual file
coreFileAccessor = new RandomAccessFile(coreFile, "rw"); //$NON-NLS-1$
}
catch (IOException e)
{
throw new RuntimeException(e);
}
channelMapper = coreFileAccessor.getChannel();
}

}

[This AGPL code is on Github]

Here we can see the creation of the swap file (the name I use for the memory mapped file). The most important thing is that we get the channel for the swap file. This is the object we can then use for memory mapping. Now we can see the code which creates the mapping for a new SFData object:


private void makeMap(long size) throws IOException
{
long countDown = size;
int chunkCount = 0;
while (countDown > 0)
{
++chunkCount;
countDown -= chunkLen;
}

countDown = size;
chunks = new ByteBuffer[chunkCount];
chunkCount = 0;
while (countDown > 0)
{
ByteBuffer chunk = freeChunks.poll();
if (chunk == null) break;
chunks[chunkCount] = chunk;
++chunkCount;
countDown -= chunkLen;
}
if (countDown > 0)
{
synchronized (coreFile)
{
long from = coreFile.length();
while (countDown > 0)
{
ByteBuffer chunk = channelMapper.map(MapMode.READ_WRITE, from, chunkLen);
chunk.order(ByteOrder.nativeOrder());
chunks[chunkCount] = chunk;
++chunkCount;
from += chunkLen;
countDown -= chunkLen;
}
}
}
}

[This AGPL code is on Github]

It is lovely and simple! We just use up chunks on the free ‘stack’ and if there are not enough to represent all the signal we create a few more until we have enough.  I use a deque as a stack because the most recently added chunk is also most likely not to be ‘swapped out’ by the operating system; optimum performance is achieve by reusing the same chunks over and over. There is no concurrent stack in the standard JDK, so the deque made a good choice.

It is worth noting that recycling the last used chunk rather and using a first in first out approach gave a huge performance improvement on my machine. When I tried a first in first out approach the system would run fast until swapping started and then slow right down as it constantly read back in swapped out chunks. This was a complete waste of time because the swapped out chunks contained data that was not going to be used again. By recycling last in first out, this performance issue disappeared and Sonic Field achieved very efficient use of the operating system file cache and virtual memory systems.

Using explicit resource management, we can ensure that chunks which are no longer required get reused:


@Override
public void close() throws RuntimeException
{
int c = referenceCount.decrementAndGet();
if (c == 0) release();
else if (c < 0) throw new RuntimeException(Messages.getString("SFSignal.1")); //$NON-NLS-1$
}

Yes, Sonic Field uses reference counting. This is tide into the explicit resource management in that SFSingal implements AutoCloseable. I will discuss the mechanism behind this a little more when we discuss the way the Java interacts with Python. However, what matters for now is that when the reference count hits zero ‘release()’ is called:


@Override
public void release()
{
for (ByteBuffer chunk : chunks)
{
freeChunks.push(chunk);
}
chunks = null;
resourceTracker.remove(this);
}

Release is responsible for putting the now unused chunks back onto the free deque. As the free deque is concurrent, we do not have to use synchronization blocks anywhere other than when creating new chunks.

Heavy disk IO as the OS starts serialising
blocks of the memory mapped file to disk

Now I can hear you screaming at me. My ears are ringing and it has nothing todo with this noisy train I am on. “But you should use a direct double buffer” you say. Yes, nio has these double buffers so we would use put() and get() rather than putDouble() and getDouble(). Surely these are ‘blessed’ in some magical way to make them perform. Nope, that approach is very slightly slower than using ByteBuffers directly. In other words, getDouble and putDouble being used ‘under the covers’ or something very similar. Here is a simple test which creates a memory mapped signal and reads from it then creates another etc:

from java.lang import System
sig=sf.WhiteNoise(10000)
t0=System.nanoTime()
for x in range(1,10):
    sig=sf.Swap(sig)
    for z in range(1,5):
        sig1=sf.Realise(sf.Pcnt10(+sig))
    sig=sig1
        
t1=System.nanoTime()
for y in range(1,5):
    t0=System.nanoTime()
    for x in range(1,10):
        sig=sf.Swap(sig)
        for z in range(1,5):
            sig1=sf.Realise(sf.Pcnt10(+sig))
        sig=sig1
    t1=System.nanoTime()
    print t1-t0

Times using a double buffer:
1497275000
1474564000
1410076000
1474493000
Sonic Field using 48 Gig of swap file
on a machine with 16 Gig of RAM

Times using a byte buffer only:

1375515000
1376532000
1478389000
1438900000
On my machine the memory mapped files would not have had to actually resort to disk (SSD actually) as there was plenty of RAM available. Making sure the byte order is correct for the current platform (in my case LITTLE_ENDIAN, which one can get from ByteOrder.nativeOrder()) does help a little, which is interesting as it shows some form of optimisation going on.

A Quick Note On Caching

To anyone who has worked in software performance tuning, seeing a system with serialisation and deserialisation and writing to memory mapped files will immediately cause the word ‘cache’ to appear in bright neon in their mind’s eye. This is certainly the case for me, and until recently Sonic Field did have a caching scheme which simultaneously helped a bit, caused immense complexity and produced constant out of memory errors. As of now I have given up on that approach. Basically, the serialisation costs are painful but acceptable, especially considering that they can be reduced by ever more  aggressive use of generators. Caching systems add an overhead in themselves and so they offer less benefit than one might expect. One day I might dream up a better solution, but for now, this seems to be reliable if not exactly stella performance wise.

To finish this section on memory management, here is a piece of music which required a 48G swap file. This entire 32.5 minute sound scape, including all the reverberation, harmonic excitation and delay effects were all done in Sonic Field using double precision floating point mathematics at 96 000 samples per second. When I say it like that, I guess it is not such a shock that it took a lot of resources!

What Is Arbitrary Complexity Ab-initio Synthesis?

A quick lesson in synthesis. The pioneers of electronic music made notes and sounds. Later came the work of the legendary Dr Robert Moog. He created the concept of what we now call a synthesiser. As always, we could include many others in this creation, some working with him and some in parallel work. But let us just think about his work. He realised that very flexible sound creation could be achieved by oscillators, which create tones, and filters, which remove unwanted tones. These were ‘analogue’ electronic circuits which produced a limited number of tones and could filter them in a limited number of ways to produce sounds in ‘real time’.  The complexity of the sound was controlled by the complexity of the synthesiser. However, limitations in analogue technology set an upper limit to complexity; every circuit added some distortion and some noise, eventually these artefacts would overwhelm the desired sound and so the upper complexity limit was set.

These early pioneers created what we now think of as the ‘principles of synthesis’ and by so doing formed a wonderful working model for people to learn but also restricted creativity. Computational and digital synthesis tends to continue to follow the same path as the original analogue work. It is real time (press a key and out comes sound) and restricted (there are a fixed number of digital sound sources and filters etc). “Arbitrary complexity” simply means that there is no upper limit to the number of filters, oscillators and such that can be used. One key concept behind Sonic Field (which I have not yet quite gotten around to implementing) is that it can function in a cloud environment an coordinate an arbitrarily large number of nodes to create an arbitrarily complex piece of music or soundscape. Whilst the implementation is not quite there yet, the current architecture is very much suited to this approach.

There exists a completely different sort of digital sound creation which some people call synthesis (but I do not). This is sample looped based. Here sounds are recorded or created digital and stored as a ‘sample’ of that sound. These samples are then played back in a loop to create the notes heard. This is not ‘from first principles’ because the sounds already existed. Creating sounds by the direct manipulation of mathematical equations or the direct application of electronic circuits creates the sounds from first principles. This is where ‘ab-initio’ comes from.

So Why Java?

I get asked this quite a bit when I discuss Sonic Field with people. The first question often is why not do this in C++? There are many answers and as this post is in a Java advent calendar I do not intend to defend Java. I think the question comes from an incorrect view that Java is somehow non performant. I often hear people saying Java is interpreted, even now! My view is that Java has some huge specific benefits for this work. They centre around multi-threading, clustering and portability:
  1. I originally intended to run Sonic Field on top of hadoop so processing could be distributed in a cloud. I might well still do something like this. Java’s serialisation system and the maturity of the tooling around cloud computing in Java makes it an ideal candidate.
  2. Heavy multi-threading is more mature in Java than in C++. At the time is started the project C++11 was only just coming on line, so the contrast was even more stark.
  3. Being able to run Sonic Field anywhere and not having to recompile it for different platforms has been a huge benefit. I have run it on Windows, Linux and (almost exclusively for a couple of years now) on the Mac. I have never had to change a line of code to do these things.
  4. I like doing strange things which people do not expect; proving Java can handle large scale mathematical calculations has a perverse fun streak running through it!
On personal reflection, I do not think the project would have survived this long if it had been in C++. The effort of maintaining the build and updating from one version of gcc to the next and across to Visual Studio would have caused me to give up. Seriously, the memory management issues might one day start me porting off the JVM but I still hold hope I will keep coming up with a fixes for them rather than give up on this amazing platform; current evidence is that the memory mapped files approach is ‘good enough’ and I live in hope of coming up with either more tweaks to that or an even more effective approach.

An Aside On Threading

I said Java was good for threading. If you know anything about Python in C you will know it is single threaded (well, the interpreter is through a thing called the global interpreter lock). Jython is very multi-threaded. One of the driving concerns in Sonic Field has always been parallel execution be that via clustering or local threading. So I think it is time we took a look at the current threading model in Sonic Field (which changes almost as often as the memory management).

Now, this is a bit complex for a single post like this, so I will skim over a lot of the details. Nevertheless, I think this mechanism is a good demonstration of how Jython and Java can work well together.

import threading
import time
from java.util.concurrent import Executors, TimeUnit
from java.util.concurrent import Callable, Future
from java.lang import System
from java.lang import ThreadLocal
from java.lang import Thread
from java.util.concurrent import TimeUnit
from java.util.concurrent.locks import ReentrantLock

SF_MAX_CONCURRENT = int(System.getProperty("synthon.threads"))
SF_MAX_CONQUEUE = SF_MAX_CONCURRENT

print "Concurrent Threads: " + SF_MAX_CONCURRENT.__str__()
SF_POOL = Executors.newCachedThreadPool()

class sf_callable(Callable):
def __init__(self,toDo):
self.toDo=toDo

def call(self):
return self.toDo()

class sf_futureWrapper(Future):
def __init__(self,toDo):
self.toDo=toDo
self.gotten=False

def __iter__(self):
return iter(self.get())

def get(self):
if self.gotten:
return self.result
else:
self.result=self.toDo.get()
self.gotten=True
return self.result

def __pos__(self):
obj=self.get()
return +obj

def __neg__(self):
obj=self.get()
return -obj

class sf_getter(Future):
def __init__(self,toDo):
self.toDo=toDo
self.result=self.toDo()

def get(self):
return self.result

def __iter__(self):
return iter(self.get())

def __pos__(self):
obj=self.get()
return +obj

def __neg__(self):
obj=self.get()
return -obj

class sf_taskQueue(ThreadLocal):
def initialValue(self):
return []

SF_TASK_QUEUE=sf_taskQueue()

class sf_superFuture(Future):

def __init__(self,toDo):
self.toDo=toDo
queue=SF_TASK_QUEUE.get()
queue.append(self)
if len(queue)>SF_MAX_CONQUEUE:
self.submitAll()

def submit(self):
count=SF_POOL.getActiveCount()
if count<SF_MAX_CONCURRENT:
task=sf_callable(self.toDo)
self.future=sf_futureWrapper(SF_POOL.submit(task))
else:
self.future=sf_getter(self.toDo)

def submitAll(self):
queue=SF_TASK_QUEUE.get()
while(len(queue)):
queue.pop().submit()

def get(self):
self.submitAll()
while not hasattr(self,'future'):
Thread.yield()
r = self.future.get()
return r

def __iter__(self):
return iter(self.get())

def __pos__(self):
obj=self.get()
return +obj


def __neg__(self):
obj=self.get()
return -obj

def sf_do(toDo):
return sf_superFuture(toDo)

def shutdown_and_await_termination(pool, timeout):
pool.shutdown()
try:
if not pool.awaitTermination(timeout, TimeUnit.SECONDS):
pool.shutdownNow()
if not pool.awaitTermination(timeout, TimeUnit.SECONDS):
print >> sys.stderr, "Pool did not terminate"
except InterruptedException, ex:
pool.shutdownNow()
Thread.currentThread().interrupt()

def shutdownConcurrnt():
shutdown_and_await_termination(SF_POOL, 5)
This mechanism launches futures to process threads. It does this using closures which are really easy to do in Python. It gets much more complex than one might think though because of the issue of recursive launching (which I will touch on here in a bit but is covered in more detail in my post ‘The Embrace Of Meh‘). Sonic Field aims to have a simple approach to threading. The very idea that such a thing exists is a little naive! However, it does get some way towards this goal. Here is an example. Consider that we have function sf.FrequencyDomain() which is very expensive and a function xmulti() which cross multiplies the output of this expensive processor.


# Form A
def mixer(siga,sigb):
    return sf.CrossMultiply(
        sf.FrequencyDomain(siga),
        sf.FrequencyDomain(sigb)
    )

We can convert that into running in parallel simply by making the bulk of the work happen in a closure:


# Form B
def mixer(siga,sigb):
    def mixerInner():
        return sf.CrossMultiply(
            sf.FrequencyDomain(siga),
            sf.FrequencyDomain(sigb)
       )
    return sf_do(mixerInner)

I love this because it requires so little syntax and just works. sf_do creates a future for the execution of the closure. The snag comes when we want to use that future. Because in the Java world we need the result (a signal) not a future. How can we  fix this type safety issue? Despite the parallel submission code being written in Python it uses java.util.concurrent.Future from the JDK. This means we can dynamically dispatch of the instance type of arguments. All Sonic Field processors convert their argument dynamically from Object to what ever they require. They do this via the static methods on the com.nerdscentral.synthon.Caster class. The key one of those methods being this:


public static Object checkAutoTranslation(Object o) throws SFPL_RuntimeException
{

if (o instanceof Future)
{
Future doer = (Future) o;
try
{
Object d = doer.get();
return checkAutoTranslation(d);
}
catch (Throwable t)
{
throw new SFPL_RuntimeException(t);
}
}

if (o == null)
{
throw new SFPL_RuntimeException(Messages.getString("Caster.12")); //$NON-NLS-1$
}

return o;
}

I have highlighted in bold where the conversion occurs.

Naturally, tasks can submit tasks, so the system must be recursive. It will keep calling get() on Futures until it find a non Future result. However, you might have noticed that the code for creating futures is not as simple as it might be. What is a ‘SuperFuture’? This is a huge challenge. There is a finite number of threads available on any computer; when we have tasks submitting new tasks and each is in a new thread we can exhaust the maximum number easily. The classic solution is to use a thread pool. Sadly, if say we have 16 threads in a pool then one task submits 2 and then those submit 2 each and then those submit 2 each we have 1+2+4+8=15 threads. If any of the new leaf tasks then submit tasks we will see the pool exhausted. This means they the new tasks queue up for the pool to empty. But the pool never empties because it is full of blocked threads which are waiting for the completion of tasks which are queued up.

SuperFutures work around this because they can either result in a true Future which is asynchronously executed or they can, if the pool is getting dangerously full, be executed serially. i.e. a SuperFuture converts into either a Future (parallel) or Getter (serial) when it is executed. Its execution is delayed until an input queue hits a particular size of the result of the SuperFuture is requested from another Future or Getter. This delay effect helps fine grained tasks to end up as Getters and their parents to be the Futures.

I have not implemented a full work stealing system because the complexity of working out the dependencies in the Python puts me off. I want the system to be easy to use and I have not worked out an easy to use work stealing system yet; so for now the super future appraoch appears to work OK.

Wrapping It Up?

There is so much I wanted to discuss and did not! For example, I wanted to discuss explicit resource management in much more detail. I would love to have discussed caching in FFT. How about sub-normal floating point performance? Immutable data structures and threading for large scale mathematics?

What I hope I have shown is how some of the ‘heavy engineering’ in Sonic Field has allowed something one would often not associated with Java – huge scale mathematics – to work rather cleanly in this language. Sonic Field has taught me so much about practical threading issues, real memory management challenges and effective flexible processing architecture. I hope that at least some of these concepts will be of use to you the reader.

This post is part of the Java Advent Calendar and is licensed under the Creative Commons 3.0 Attribution license. If you like it, please spread the word by sharing, tweeting, FB, G+ and so on!

Big Data the ‘reactive’ way

A metatrend going on in the IT industry is a shift from query-based, batch oriented systems to (soft) realtime updated systems. While this is associated with financial trading only, there are many other examples such as “Just-In-Time”-logistic systems, flight companies doing realtime pricing of passenger seats based on demand and load, C2C auction system like EBay, real time traffic control and many more.

It is likely this trend will continue, as the (commercial) value of information is time dependent, value decreases with age of information.

Automated trading in the finance sector is just a forerunner in this area, because some microseconds time advantage can be worth millions of dollars. Its natural real time processing systems evolve in this domain faster.

However big parts of traditional IT infrastructure is not designed for reactive, event based systems. From query based databases to request-response based Http protcol, the common paradigm is to store and query data “when needed”.

Current Databases are static and query-oriented

Current approaches to data management such as SQL and NOSQL databases focus on data transactions and static query of data. Databases provide convenience in slicing and dicing data but they do not support update of complex queries in real time. Uprising NOSQL databases still focus on computing a static result.
Databases are clearly not “reactive”.

Current Messaging Products provide poor query/filtering options

Current messaging products are weak at filtering. Messages are separated into different streams (or topics), so clients can do a raw preselection on the data received. However this frequently means a client application receives like 10 times more data than needed, doing fine grained filtering ‘on-top’.
A big disadvantage is, that the topic approach builts filter capabilities “into” the system’s data design.
E.g. if a stock exchange system splits streams on a per-stock base, a client application still needs to subscribe to all streams in order to provide a dynamically updated list of “most active” stocks. Querying usually means “replay+search the complete message history”.
 

A scalable, “continuous query” distributed Datagrid. 

I had the enjoyment to do conceptional & technical design for a large scale realtime system, so I’d like to share a generic scalable solution for continuous query processing at high volume and large scale.

It is common, that real-time processing systems are designed “event sourced”. This means, persistence is replaced by journaling transactions. System state is kept in memory, the transaction journal is required for historic analysis and crash recovery only.
Client applications do not query, but listen to event streams instead. A common issue with event sourced systems is the problem of “late joining client”. A late client would have to replay the whole system event journal in order to get an up-to-date snapshot of the system state.
In order to support late joining clients, a kind of “Last Value Cache” (LVC) component is required. The LVC holds current system state and allows late joiners to bootstrap by querying.
In a high performance, large data system, the LVC component becomes a bottleneck as the number of clients rises.

Generalizing the Last Value Cache: Continuous Queries

In a continuous query data cache, a query result is kept up to date automatically. Queries are replaced by subscriptions.

subscribe * from Orders where
   symbol in [‘ALV’, ‘BMW’] and
   volume > 1000 and
   owner=’MyCompany’

creates a message stream, which initially performs a query operation, after that updates the result set whenever a data change affecting the query result happened (transparent to the client application). The system ensures each subscriber receives exactly the change notifications necessary to keep its “live” query results up-to-date.

A distributed continous query system: The LVC Nodes hold data. Transactions are sent to them on a message bus (red). The LVC nodes compute the actual difference caused by a transaction and send change notifications on a message bus (blue).  This enables “processing nodes” to keep a mirror of their relevant data partition up-to-date. External clients connected via TCP/Http do not listen to the message bus (because multicast is not an option in WAN). “Subscription processors” keep the client’s continuous queries up-to-date by listening to the (blue) message bus and dispatching required change notifications only to client’s point2point connection.







   
  

Difference of data access patterns compared to static data management:

  • High write volume
    Real time systems create a high volume of write access/change in data.
  • Fewer full table scans.
    Only late-joining clients or changes of a query’s condition require a full data scan. Because continuous queries make “refreshing” a query result obsolete, Read/Write ratio is ~ 1:1 (if one counts the change notification resulting from a transaction as “Read Access”).
  • The majority of load is generated, when evaluating queries of active continuous subscriptions with each change of data. Consider a transaction load of 100.000 changes per second with 10.000 active continuous queries: this requires 100.000*10.000 = 1 Billion evaluations of query conditions per second. That’s still an underestimation: When a record gets updated, it must be tested whether the record has matched a query condition before the update and whether it matches after the update. A record’s update may result in an add (because it matches after the change) or a remove transaction (because the record does not match anymore after a change) to a query subscription (or ‘update’, or ‘skip’ ofc).

Data Cluster Nodes (“LastValueCache Nodes”)

Data is organized in tables, column oriented. Each table’s data is evenly partitioned amongst all data grid nodes (=last value cache node=”LVC node”). By adding data nodes to the cluster, capacity is increased and snapshot queries (initializing a subscription) are sped up by increased concurrency.

There are three basic transactions/messages processed by the data grid nodes:

  • AddRow(table,newRow), 
  • RemoveRow(table,rowId), 
  • UpdateRow(table, rowId, diff). 

The data grid nodes provide a lambda-alike (row iterator) interface supporting the iteration of a table’s rows  using plain java code. This can be used to perform map-reduce jobs and as a specialization, the initial query required by newly subscribing clients. Since ongoing computation of continuous queries is done in the “Gateway” nodes, the load of data nodes and the number of clients correlate weakly only.

All transactions processed by a data grid node are (re-)broadcasted using multicast “Change Notification” messages.

Gateway Nodes

Gateway nodes track subscriptions/connections to client applications. They listen to the global stream of change notifications and check whether a change influences the result of a continuous query (=subscription). This is very CPU intensive.

Two things make this work:

  1. by using plain java to define a query, query conditions profit from JIT compilation, no need to parse and interpret a query language. HotSpot is one of the best optimizing JIT compilers on the planet.
  2. Since multicast is used for the stream of global changes, one can add additional Gateway nodes with ~no impact on throughput of the cluster.

Processor (or Mutator) Nodes

These nodes implement logic on-top of the cluster data. E.g. a statistics processor does a continuous query for each table, incrementally counts the number of rows of each table and writes the results back to a “statistics” table, so a monitoring client application can subscribe to realtime data of current table sizes. Another example would be a “Matcher processor” in a stock exchange, listening to orders for a stock, if orders match, it removes them and adds a Trade to the “trades” table.
If one sees the whole cluster as kind of a “giant spreadsheet”, processors implement the formulas of this spreadsheet.

Scaling Up

  • with data size:
    increase number of LVC nodes
  • Number of Clients
    increase subscription processor nodes.
  • TP/S
    scale up processor nodes and LVC nodes

Of cause the system relies heavily on availability of a “real” multicast messaging bus system. Any point to point oriented or broker-oriented networking/messaging will be a massive bottleneck.

Conclusion

Building real time processing software backed by a continuous query system simplifies application development a lot.
  • Its model-view-controller at large scale.
    Astonishing: patterns used in GUI applications for decades have not been extended regulary to the backing data storage systems.
  • Any server side processing can be partitioned in a natural way. A processor node creates an in-memory mirror of its data partition using continuous queries. Processing results are streamed back to the data grid. Computing intensive jobs, e.g. risk computation of derivatives can be scaled by adding processor instances subscribing to distinct partitions of the data (“sharding”).
  • The size of the Code Base reduces significantly (both business logic and Front-End).
    A lot of code in handcrafted systems deals with keeping data up to date.

About me

I am a technical architect/senior developer consultant at an european company involved heavily in stock & derivative trading systems.
Other blog posts of me can be found here.

This post is part of the Java Advent Calendar and is licensed under the Creative Commons 3.0 Attribution license. If you like it, please spread the word by sharing, tweeting, FB, G+ and so on!

Performance tuning – measure don’t guess

In my performance tuning career I have given the advice to measure and not guess more often than I can recall. And in many cases the target of this advice has given up after looking at the monolithic 500,000 LOC legacy application they are working.

In the post we are about to share some simple tools and concepts how to setup the initial measurement, so you can have the baseline to start with more fine-grained performance tuning.

Setting the goal

Pretty much any non-trivial application can be optimized forever. Both in bad and in good way. Bad examples include tweaking random parts of the applications and then praying for the best. Whatever might be the reason – I have seen tens and tens of developers “being pretty sure that exactly this line of code needs attention” without basing their gut feeling to any measurements.

The second category is almost as dangerous, as you can spend hundreds of man-years tuning your application “to be ready for the prime time”. The definition for prime time might vary, but are you really sure you are going to accommodate terabytes of media, tens of millions of records in your database and have to serve hundreds of thousands of concurrent users with less than 100ms latency? Unless you are aiming to put Google out of business, you most likely will not. Worse yet, spending so much time preparing for the primetime is a darn good way to assure the prime time never arrives. Instead of spending this time for squeezing out the extra 1ms in latency nobody is likely to notice, maybe the same time should have spent in ironing out this layout bug annoying the end users on Safari instead?

So how to set a meaningful target? For this you need to understand the business you are in. Real-time strategy games and First Person Shooters tend to have different requirements than online shopping carts. As last time I checked, Java was not going strong in the gaming industry, lets expect you are dealing with a typical Java EE application with web based front-end.

In this case, you should start now segmenting your application into different categories, which, based on the research are similar to the following:

  • 0.1 seconds gives the feeling of instantaneous response — that is, the outcome feels like it was caused by the user, not the computer.
  • 1 second keeps the user’s flow of thought seamless. Users can sense a delay, and thus know the computer is generating the outcome, but they still feel in control of the overall experience and that they’re moving freely rather than waiting on the computer. This degree of responsiveness is needed for good navigation.
  • 10 seconds keeps the user’s attention. From 1–10 seconds, users definitely feel at the mercy of the computer and wish it was faster, but they can handle it. After 10 seconds, they start thinking about other things, making it harder to get their brains back on track once the computer finally does respond.

So you might have functionality such as adding products to the shopping carts or browsing through the recommended items which you need to squeeze into the “instant” bucket to provide the best overall user experience resulting in better conversion rates. Initially I am pretty sure your application is nowhere near this criteria, so feel free to replace the 100ms and 1,000ms threshold with something you actually can achieve, such as 300 and 1,500ms for example.

On the other hand, you most likely have some operations which are expensive enough, such as the products search or user account registration which might fall into the second bucket.

And last, you have the functionality which you can toss into the last bucket, such as generating PDF from the invoice. Note that you also might end up with fourth category – for example if you are generating the bills of delivery for your warehouse, your business might be completely OK if batch processing the daily bills takes 20 minutes.

Pay attention that business persons have a tendency of tossing everything into the “instant” bucket. Now it is your task to explain the effort required and ask “if you had only three operations which are completing under our threshold, what should those be”.

Now you should have your application functionality categorized into different categories, such as the following:

Requirement Category
Browse products instant
Add product to a shopping cart instant
Search product seamless
Register new account attention
Generate PDF invoice attention
Generate daily bills of delivery slow

Understanding the current situation

Your next goal is to understand where you will perform your measurements. For end users, the experience is complete when the page has rendered the result of the operation in their browser, which takes into account the network latency and browser DOM operations for example.
As this will be more complex to measure, let us assume your site has been well optimized against the Google Page Speed or YSlow recommendations and for simplicity’s sake lets focus on elements directly under your control.

On most cases, the last piece of infrastructure still under your control will be your web server, such as the Apache HTTPD or nginx. If you have logging enabled, you will have access to something similar to the following in your nginx access.log:

82.192.41.11 - - [04/Dec/2013:12:16:11 +0200]  "POST /register HTTP/1.1" 301 184 "https://myshop.com/home" "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:25.0) Gecko/20100101 Firefox/25.0" 0.428
82.192.41.11 - - [04/Dec/2013:12:16:12 +0200] "POST /searchProduct HTTP/1.1" 200 35 "https://myshop.com/accountCreated" "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:25.0) Gecko/20100101 Firefox/25.0" 3.008
82.192.41.11 - - [04/Dec/2013:12:16:12 +0200] "GET /product HTTP/1.1" 302 0 "https://myshop.com/products/searchProducts" "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:25.0) Gecko/20100101 Firefox/25.0" 0.623
82.192.41.11 - - [04/Dec/2013:12:16:13 +0200] "GET /product HTTP/1.1" 200 35 "https://myshop.com/product/123221" "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:25.0) Gecko/20100101 Firefox/25.0" 0.828
82.192.41.11 - - [04/Dec/2013:12:16:13 +0200] "GET /product HTTP/1.1" 200 35 "https://myshop.com/product/128759" "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:25.0) Gecko/20100101 Firefox/25.0" 1.038
82.192.41.11 - - [04/Dec/2013:12:16:13 +0200] "GET /product HTTP/1.1" 200 35 "https://myshop.com/product/128773" "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:25.0) Gecko/20100101 Firefox/25.0" 0.627
82.192.41.11 - - [04/Dec/2013:12:16:14 +0200] "GET /addToShoppingCart HTTP/1.1" 200 35 "https://myshop.com/product/128773" "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:25.0) Gecko/20100101 Firefox/25.0" 2.808
82.192.41.11 - - [04/Dec/2013:12:16:14 +0200] "GET /purchase HTTP/1.1" 302 0 "https://myshop.com/addToShoppingCart" "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:25.0) Gecko/20100101 Firefox/25.0" 3.204
82.192.41.11 - - [04/Dec/2013:12:16:16 +0200] "GET /viewPDFInvoice HTTP/1.1" 200 11562 "https://myshop.com/purchase" "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:25.0) Gecko/20100101 Firefox/25.0" 3.018

Lets investigate now what we have found in the log. The snippet we have extracted contains actions of one user completing a full transaction in our web site. The user has created an account, searched for products, browsed through the results, found a product he has liked, added it to a shopping cart, completed the purchase and generated a PDF of the invoice. Those actions can be detected in the “POST /searchProduct HTTP/1.1” column. Next important part is the last column containing the total request time it took for a particular request to complete. In the case of /searchProduct it took 3.008 seconds to complete the search.

Note that by default nginx does not have the request time logging enabled, so you might need to to modify your log_format by adding $request_time to the pattern.

Now, with a little bit of grep/sed/excel magic, you will have something similar to the following at your fingertips:

Requirement Category Mean 90%
Browse products instant 0.734 0.902
Add product to a shopping cart instant 2.422 3.490
Search product seamless 2.800 3.211
Register new account attention 0.428 0.480
Generate PDF invoice attention 3.441 4.595
Generate daily bills of delivery slow

Picking the target for optimization

From here you will get your list of optimization targets. The suspects are obvious – they fall in the categories where you violate the agreement, in the above case you have problems with three requirements – Browse products, Add product to a shopping cart and Search product all violate your performance requirements.

This article does not focus on actual optimisation tasks, but in order to give food for thought – what should you do next, knowing that adding products to the shopping cart takes way too much time?

Next steps are definitely more application specific than the prior tasks. Also, the tooling to be used can vary – you might now decide to go with APM or monitoring products. But if you do not have the budget or just do not wish to fight with procurement office, a simple solution would include adding logging trail in the component boundaries by using AOP means.

For example, you can add something like the following in your application code to monitor the time spent in the service layer.

@Aspect
@Component
public class TimingAspect {

@Around("execution(* com.myshop..*Service+.*(..))")
public Object time(ProceedingJoinPoint joinPoint) throws Throwable {
StopWatch.WatchStatus watch = StopWatch.getInstance().start(joinPoint.getSignature().getName());
final Object result = joinPoint.proceed();
watch.stop();
return result;
}
}

Decomposing the request time further gives you the required insight to understand where the bottlenecks are hiding. More often than not you will quickly discover a violating query, faulty cache configuration or poorly decomposed functionality enabling you to net your first quick wins in a matter of days or even hours.

This post is written by @iNikem from Plumbr and is part of the Java Advent Calendar. The post is licensed under the Creative Commons 3.0 Attribution license. If you like it, please spread the word by sharing, tweeting, FB, G+ and so on!

Applying ForkJoin – from optimal to fast

JDK 7 is well into the hands of developers by now and most people have heard of ForkJoin, yet not so many have the time or chance in daily work to try it.

It caused, and probably still causes a bit of confusion on how is it any different than a normal thread pool.[1]

My goal in this article is to present a more elaborate, yet still simple example of ForkJoin usage through a code example.

I time and measure the performance of a Serial vs a Thread pool vs a ForkJoin aproach.

Here is the github upfront : https://github.com/fbunau/javaadvent-forkjoin

Practical problem

Imagine we have some sort of component in our system that keeps the last price of a stock for every millisecond of time.

This could be held in memory as an array of integers. (if we count in bps)

The clients of this component make queries like : what is the moment of time between time1 and time2 when the price was the lowest?

This could either be an automated algorithm or just someone in a GUI making rectangle selections.

7 queries in the example image

Let us also imagine that we get many such queries from a client batched up in a Task.

They may be batched up for reducing network traffic and round trip time.
We have different sizes of tasks that the component might get, up to 10 queries (someone with a GUI), up to 100, .. up to 1 000 0000 (some automated algorithm). We have many such clients for the component each producing tasks of different sizes. See Task.TaskType

Core problem and solution

The problem at it’s core we have to solve is the RMQ problem. Here is Wikipedia on it [2]:

“Given an array of objects taken from a well-ordered set (such as numbers), a Range Minimum Query (or RMQ) from i to j asks for the position of a minimum element in the sub-array A[i, j].”

“For example, A = [0, 5, 2, 5, 4, 3, 1, 6, 3] when , then the answer to the range minimum query for the A[3, 8] = [2, 5, 4, 3, 1, 6] is 7, as A[7] = 1 .”

There exists an efficient datastructure for solving this problem called “Segment Tree”.

I won’t go into detail on this as it is excellently covered in this classic Topcoder article [3]. This in itself is not important for this ForkJoin example, I have chosen this because it’s more interesting than a simple sum and it’s essence is kind of in the spirit of fork-join. It divides the task to be computed and then it join the results!

The data structure has O(n) initialization time and O(log N) query time, where N is the number of elements in our price per time unit value array.
So a task T contains M such queries to be made.

In an academic Computer Science approach you would just say that we’ll process each task with this efficient data structure and the complexity will be :

You can’t get more efficient than that !? Yes, in a theoretical von Neumann machine, but you can in practice.

An easy confusion to make is that because O(n/4) == O(n), then when writing a program constant factors don’t count, but they do!
Stop and think, is it the same to wait 10 or 40 minutes / hours / years ?

Going parallel

So thinking on the problem to be solved, how can we make it faster? Since every computing device now has more cores for computations, let’s put them to good use and do more things at once.
We can easily do that using the Fork Join framework.

I was first tempted to tweek the RMQ data structure and execute it’s operations in parallel. I attacked something that was already log N. But it was a big failure, it’s too much overhead for the scheduler to micromanage such short running logic.

The answer was in the end attack the M_i constant factor.

Thread pool

Before presenting how a ForkJoin solution might be applied, let’s imagine how we might apply a thread pool. See : TaskProcessorPool.java

We can have a pool of 4 workers, when we have a Task to do, we add it to the queue. As soon as a worker is available, it will retrieve from the head of the queue a pending task, and execute it.

While this is fine for tasks having the same size, and the size is relatively medium and predictable, it runs into problems when the tasks to be executed are of different sizes. One worker might be choked up with a long running task, and the others sit doing nothing.

In this image the threadpool will do only 9 out of 16 possible units of work in the 4 units of time (56% efficiency), if no more tasks will be added to the queue

Fork Join

Fork join is useful when you are in a problem domain where you can split the task to be solved into smaller ones.

What is special about a fork-join pool is that it is a work-stealing thread pool.

Each worker thread maintains a local dequeue of tasks. When taking a new task for execution it can either :

  • split the task into smaller ones
  • execute the task if it’s small enough

When a thread has no local threads in it’s dequeue, it ‘steals’ , pops tasks from the back of the queue of another random thread, and puts it in his own. There is a high chance that this task is not yet split. So he’ll have quite some work on his hands.

Comparing to the thread pool, instead of the other threads waiting for some new work, they could split the existing task into smaller ones and help the other thread with that large task.

Here is the original paper by Doug Lea for a more detailed explanation : http://gee.cs.oswego.edu/dl/papers/fj.pdf

Coming back to our example a large batch of operations could be split into multiple batches of smaller number of operations. See : TaskProcessorFJ.java

Most problems have linear series of operations like this one, it doesn’t have to be a special parallel problem for which we need to apply a specialized parallel algorithm to leverage the cores we have on the processor.

How much do you split? You split a task until you reach a threshold where generally splitting makes no sense anymore. Ex : ( splitting + a thread getting a job + context switching is more than actually executing the task as it is )

For a big XXL, task we have to do 1000000 query operations. We could split this into 2 500000 operation tasks, and do that in parallel. Is 500000 still large? Yes, we can split it more. I have chosen a group of 10000 operations to be the threshold under which there is no use in splitting and we can just execute them on the current thread.

Fork join does not split all the tasks upfront, but rather as it works through it.

Performance results

I ran 4 iterations for each implementation of processor on my i5-2500 CPU @ 3.30GHz that has 4 cores / 4 threads, after a clean reboot
Here are the results :

Doing 4 runs for each of the 3 processors. Pls wait ...
TaskProcessorSimple: 7963
TaskProcessorSimple: 7757
TaskProcessorSimple: 7748
TaskProcessorSimple: 7744
TaskProcessorPool: 3933
TaskProcessorPool: 2906
TaskProcessorPool: 4477
TaskProcessorPool: 4160
TaskProcessorFJ: 2498
TaskProcessorFJ: 2498
TaskProcessorFJ: 2524
TaskProcessorFJ: 2511
Test completed.

Conclusions

Even if you have chosen the right optimal data structure, it’s not fast until you use all the resources you have. i.e. exploiting all cores

ForkJoin is definetly an improvement over the thread pool in certain problem domains and it’s worth exploring where it can be applied, and we’ll get to see more and more parallel code.
This is the kind of processor you can buy today 12 cores / 24 threads. Now we just have to write the software to exploit the cool hardware that we have and will get in the future.

The code is here : https://github.com/fbunau/javaadvent-forkjoin if you want to play with it

Thanks for your time, drop some comments if you see any errors or have things to add.

Meta: this post is part of the Java Advent Calendar and is licensed under the Creative Commons 3.0 Attribution license. If you like it, please spread the word by sharing, tweeting, FB, G+ and so on!

Using Intel Performance Counters To Tune Garbage Collection

<

div dir=”ltr” style=”text-align: left;”>

Introduction

I have to admit that I was shocked. Indeed, quite shaken when I realised this advent calendar post would be about garbage collection. The topic of GC has raised such passion amongst the advocates of Java and those who believe memory management should be manual. Many an article has been written regarding tiny subtle changes in strange looking command line arguments which have fractions of a percentage point performance impact on Java applications. How could I add to this huge body work? I hope this post will not add to the GC hot air but rather be a breath of fresh air instead. Let us not look at the CPU time consumed by the garbage collector or pause times; how about looking at a hidden yet potentially critical aspect of memory management in general and garbage collection in particular: Data caching is one of the major challenges of modern computer software design (the others being instruction caching and multi-core work distribution). Modern CPUs run so fast that main memory has no hope of keeping up. The way that some of this catastrophic performance penalty can be clawed back is caching. Memory is pulled in parallel into high speed cache memory and then the CPU accesses this cache. If we are lucky, and the code causes the CPU to read and write the same memory a few times (inside a loop for example) then the CPU can happily access the cache and be saved from waiting for loads and stores to and from main memory. “How does garbage collection impact on the performance of caches” one might ask? There are many ways, some of them very subtle, however, here is a grab bag of some important ones: The garbage collector traverses references in memory. This causes cache lines (blocks of memory in the cache) to contain the memory surrounding the reference and hence no longer hold other data which the program is using. Whilst we call it a garbage collector, it is actually an allocator, mover and collector. This really matters when we think about data caching:

  • Allocation: Based on hardware rules, memory addresses are matched to cache lines. If pieces of memory share a cache line but are actually accessed from different threads we get an effect called false sharing. However, if little bits of data are spread out but accessed from the same thread we get poor cache utilisation.
  • Moving: Objects are not left in one place throughout their life times. The garbage collector avoids memory fragmentation by moving objects around. This has the interesting effect of guaranteeing that the cache lines associated with the object will no longer be associated with it after a move.
  • Collecting: The funny thing is that collecting is the easy bit. It can be as simple as just marking the memory as available for reuse. It is the traversal of the object graphs (multiple roots) to find out what can be collected which is going to cause data cache line loads and thus evict lines from the cache which were being read from or written to by user code.

So we can now see that the design of the garbage collector is critical to the operation of the data cache. Swapping which collector we use will not only have a impact on GC pauses and other obvious issues, it will also effect, at a low level and in a fundamental way, all of user code.

An Example

I am not going to present an exhaustive scientific paper on this concept. The purpose of this post is to show an alternative way of approaching JVM tuning. So, I ran a simple, short, multi-threaded patch in my personal synthesiser program Sonic-Field. The patch uses feedback, resonance and a bunch of other concepts to synthesise string instruments and then convolution to place the sounds in an acoustic environment.The reason for picking sonic field is not just because it is of reasonable complexity, highly threaded and uses Spring but because I recently found I could get better performance from it using the CMS garbage collector. Latency with Sonic-Field is of no interest because it is a batch processor. However, the standard Java 7 garbage collector interacted badly with the way Sonic Field writes swap files out to disk when running low on memory. I tried CMS because it keeps the memory down the whole time (in theory – don’t flame me) because it constantly attempts to do small garbage collections along side the user threads. If we put all this together we might well come up with a reasonable theory “The CMS garbage collector might give fewer pauses and might be able to keep memory use down but in so doing it will almost certainly cause more data cache misses”. Constantly traversing the reference graph in memory to try and collect dead objects is going to cause cache loads and those loads will cause other data to be flushed from the cache (it has finite size). Thus, when user threads come to read again they will cause more cache misses and so on. Does it matter? That answer will be entirely down to the application and the hardware and the load on the application. I am not, I repeat not, advocating one garbage collector over another! Nevertheless, it is a question I would like to answer so let’s answer it for my little test patch. These data cache effects of the garbage collector are not visible from the normal VM profiling tools. This means that they do not get discussed much in the JVM community and they get considered in JVM tuning even less. However, there is a tool (actually several – but I am going to talk about the easiest to use) which can shed some light on the topic. I am talking about Intel’s PCM (Performance Counter Monitor). It can be used for code tuning as well, but I thought talking about the GC would be more fun today.

A Worked Example

pcm is just a command line tool. We pass the command line to run Java to it in quotes and it does its measurements. With other tooling, the performance counters can be used to get all sorts of other detail about an application. The benefit of the pcm command line tool is its simplicity and lack of intrusion into the over all application run. The disadvantage is that it will measure the JVM and application warm up phases. However, for server style applications or batch processors (like Sonic Field) these overheads are usually trivial compared to the actual application run. I ran my patch on my personal Macbook Pro Retina (2012) with 16Gig of RAM. The JVM was:

java version "1.8.0-ea"

Java(TM) SE Runtime Environment (build 1.8.0-ea-b61)

Java HotSpot(TM) 64-Bit Server VM (build 25.0-b05, mixed mode)

Readings from pcm are simply written to standard out when the application exits. I compared runs with no settings for the garbage collector (the default therefore) and with my currently preferred set of tweaks. To be honest, I am not sure if the tweaks are optimal; I kind of lifted them from a bunch of online articles… Here is the launch script for Java:

/Users/alexanderturner/x/IntelPerformanceCounterMonitorV2.5.1 2/pcm.x "java -Xmx12G -Xms12G  -DsonicFieldTemp=/Users/alexanderturner/temp -DsonicFieldThreads=12 -DsonicFieldSwapLimit=4.0  -XX:+UseConcMarkSweepGC -XX:+UseCompressedOops -XX:ParallelGCThreads=8 -XX:+CMSParallelRemarkEnabled -XX:CMSInitiatingOccupancyFraction=60 -XX:+UseCMSInitiatingOccupancyOnly -classpath bin:spring-framework-3.1.2.RELEASE/dist/org.springframework.asm-3.1.2.RELEASE.jar:spring/spring-framework-3.1.2.RELEASE/dist/org.springframework.beans-3.1.2.RELEASE.jar:spring-framework-3.1.2.RELEASE/dist/org.springframework.core-3.1.2.RELEASE.jar:spring/spring/spring-framework-3.1.2.RELEASE/dist/org.springframework.context-3.1.2.RELEASE.jar:
spring/spring-framework-3.1.2.RELEASE/dist/org.springframework.context-support-3.1.2.RELEASE.jar:
spring/spring-framework-3.1.2.RELEASE/dist/org.springframework.expression-3.1.2.RELEASE.jar:
spring/spring-framework-3.1.2.RELEASE/dist/org.springframework.test-3.1.2.RELEASE.jar:
spring/otherJars/commons-logging-1.1.1.jar com.nerdscentral.sfpl.RenderRunner $1"

Hopefully it is clear just how simple running Java under Intel Performance Counter Monitor v2 really is. So, here is the output:

Standard GC

Core (SKT) EXEC IPC FREQ AFREQ L3MISS L2MISS L3HIT L2HIT L3CLK L2CLK READ WRITE TEMP
0 0.53 0.75 0.70 1.31 422 M 621 M 0.32 0.32 0.14 0.01 N/A N/A 32
2 0.56 0.77 0.73 1.31 346 M 466 M 0.26 0.31 0.11 0.01 N/A N/A 28
1 0.22 0.69 0.32 1.31 144 M 192 M 0.25 0.28 0.11 0.01 N/A N/A 32
3 0.21 0.68 0.31 1.31 135 M 171 M 0.21 0.28 0.10 0.01 N/A N/A 28
4 0.55 0.77 0.71 1.31 332 M 410 M 0.19 0.38 0.11 0.01 N/A N/A 22
7 0.18 0.68 0.26 1.30 124 M 134 M 0.08 0.30 0.11 0.00 N/A N/A 27
5 0.19 0.68 0.29 1.31 133 M 155 M 0.14 0.30 0.11 0.00 N/A N/A 22
6 0.61 0.79 0.78 1.32 343 M 382 M 0.10 0.35 0.10 0.00 N/A N/A 27

—————————————————————————————————-

SKT 0 0.38 0.75 0.51 1.31 1982 M 2533 M 0.22 0.33 0.11 0.01 N/A N/A 22

Instructions retired: 2366 G ; Active cycles: 3166 G ; Time (TSC): 773 Gticks ; C0 (active,non-halted) core residency: 39.04 % —————————————————————————————————

TOTAL * 0.38 0.75 0.51 1.31 1982 M 2533 M 0.22 0.33 0.11 0.01 N/A N/A N/A

C: 1.49 => corresponds to 37.36 % utilization for cores in active state. Instructions per no C1 core residency: 23.92 %; C3 core residency: 0.01 %; C6 core residency: 0.00 %; C7 core residency: 37.02 % C2 package residency: 0.00 %; C3 package residency: 0.00 %; C6 package residency: 0.00 %; C7 package residency: 0.00 % PHYSICAL CORE I Pminal CPU cycle: 0.76 => corresponds to 19.12 % core utilization over time interval

Concurrent Mark Sweep

Rather than

-XX:+UseConcMarkSweepGC
-XX:+UseCompressedOops
-XX:+CMSParallelRemark
-XX:ParallelGCThreads=Enabled
-XX:CMSInitiatingOccupancyFraction=60
-XX:+UseCMSInitiatingOccupancyOnly

Core (SKT) EXEC IPC FREQ AFREQ L3MISS L2MISS L3HIT L2HIT L3CLK L2CLK READ WRITE TEMP
0 0.53 0.69 0.76 1.31 511 M 781 M 0.35 0.35 0.17 0.02 N/A N/A 26
2 0.54 0.71 0.75 1.31 418 M 586 M 0.29 0.40 0.14 0.01 N/A N/A 29
1 0.31 0.66 0.47 1.30 207 M 285 M 0.27 0.26 0.11 0.01 N/A N/A 26
3 0.21 0.68 0.31 1.31 135 M 171 M 0.21 0.28 0.10 0.01 N/A N/A 28
4 0.55 0.77 0.71 1.31 332 M 410 M 0.19 0.38 0.11 0.01 N/A N/A 22
7 0.18 0.68 0.26 1.30 124 M 134 M 0.08 0.30 0.11 0.00 N/A N/A 27
5 0.19 0.68 0.29 1.31 133 M 155 M 0.14 0.30 0.11 0.00 N/A N/A 22
6 0.61 0.79 0.78 1.32 343 M 382 M 0.10 0.35 0.10 0.00 N/A N/A 27
3 0.30 0.66 0.46 1.30 198 M 258 M 0.23 0.27 0.11 0.01 N/A N/A 29
4 0.59 0.73 0.81 1.31 397 M 504 M 0.21 0.46 0.12 0.01 N/A N/A 29
7 0.30 0.66 0.45 1.30 186 M 204 M 0.09 0.29 0.11 0.00 N/A N/A 30
7 0.30 0.66 0.45 1.30 186 M 204 M 0.09 0.29 0.11 0.00 N/A N/A 30
5 0.30 0.66 0.45 1.30 188 M 225 M 0.16 0.28 0.11 0.01 N/A N/A 29
6 0.58 0.73 0.79 1.31 414 M 466 M 0.11 0.49 0.13 0.00 N/A N/A 30

—————————————————————————————————-

SKT 0 0.43 0.70 0.62 1.31 2523 M 3313 M 0.24 0.38 0.13 0.01 N/A N/A 25
Instructions retired: 2438 G ; Active cycles: 3501 G ; Time (TSC): 708 Gticks ; C0 (active,non-halted) core residency: 47.22 %

—————————————————————————————————- TOTAL * 0.43 0.70 0.62 1.31 2523 M 3313 M 0.24 0.38 0.13 0.01 N/A N/A N/A

C: 1.39 => corresponds to 34.83 % utilization for cores in active state. Instructions per no C1 core residency: 17.84 %; C3 core residency: 0.01 %; C6 core residency: 0.01 %; C7 core residency: 34.92 % C2 package residency: 0.00 %; C3 package residency: 0.00 %; C6 package residency: 0.00 %; C7 package residency: 0.00 %

PHYSICAL CORE I
Pminal CPU cycle: 0.86 => corresponds to 21.51 % core utilization over time interval. All the information given here is of interest, however, there is so much of it I figure the best thing to do is cut the the case and test my assertion about the CMS collector. To do that we can look at just two lines form the output for each run:

Default:
SKT 0 0.38 0.75 0.51 1.31 1982 M 2533 M 0.22 0.33 0.11 0.01 N/A N/A 22
Instructions retired: 2366 G ; Active cycles: 3166 G ; Time (TSC): 773 Gticks ; C0 (active,non-halted) core residency: 39.04 %
CMS:
0 0.43 0.70 0.62 1.31 2523 M 3313 M 0.24 0.38 0.13 0.01 N/A N/A 25
SKT:
Instructions retired: 2438 G ; Active cycles: 3501 G ; Time (TSC): 708 Gticks ; C0 (active,non-halted) core residency: 47.22 %

Discussion

We can see that under the CMS collector there were substantially more cache misses. The L2 misses were 30% greater and L2 were up 27% over the default collector. Nevertheless, the total time taken in giga ticks (708CMS/773Default) shows that all this extra data missing has not negatively impacted over all performance at all. I guess this means that a lot more research could and should be done before drawing any conclusions as to the correct approach for this application!
If you leave this post thinking that I did not fully discuss the subject, you are correct. My intention here has been to get the reader interested in thinking about this aspect of Java performance and opening the door to a new approach.

Meta: this post is part of the Java Advent Calendar and is licensed under the Creative Commons 3.0 Attribution license. If you like it, please spread the word by sharing, tweeting, FB, G+ and so on!