Introduction

In the world of data science, Python is the most popular programming language. There are plenty of other contenders in the non-JVM world including C++, R, MATLAB, Julia, and JavaScript. Python is often preferred for its friendly syntax, many built-in capabilities, and widely available libraries.

groovy logo JVM languages have also been widely used for data science. Languages like Java and Scala are often considered for data science projects since running on the JVM can bring many benefits. Groovy sits alongside those languages as a great alternative to consider for your next data science project. It is sometimes referred to as the Python of the JVM. It also offers a friendly syntax, many built-in capabilities and can use a wide variety of available libraries.

In this blog post, we’ll examine several common data science and machine learning tasks and see how to perform them using Groovy and JVM libraries. Along the way, we’ll mention beneficial aspects that Groovy brings and highlight advantages such as speed and ability to scale that comes from using the JVM.

We’ll cover the following data science activities:

Using dataframes and visualization libraries to explore candle ratings and reviews.
Predicting house price using linear regression.
Classifying Iris flowers using traditional algorithms and neural networks.
Clustering single-malt Scotch whiskies by flavor characteristics.
Various natural language processing tasks.
Detecting objects within images.

Key takeaways for Groovy are:

Groovy offers a friendly Java-like syntax with dynamic or static typing capabilities. Its metaprogramming capabilities often simplify the code.
Groovy aligns closely with Java, which has multiple benefits:
- The learning curve is reduced. Data scientists can cut and paste most Java examples when learning a new library and add Groovy idioms over time.
- As the JVM evolves, Groovy automatically obtains new features and improvements by piggy-backing on the great work of the JVM developers.
- No special Groovy support is needed for frameworks. Frameworks which offer Java support, automatically offer Groovy support. Additional Groovy enhancements can be added if desired.
Groovy data science implementations can take advantage of the many options for scaling that exist on the JVM

About Groovy

Apache Groovy is a multi-faceted programming language for the JVM. Its goal is to provide a Java-like experience to users of the language but allow for greatly simplified code in many scenarios.

As an example, we could write the following Java program to calculate Fibonacci numbers using matrix manipulation:

Java Fibonacci calculation using Apache Commons Math

We can set up Groovy to know about this library and customize the output, to instead allow code and execution like this:

Groovy Fibonacci calculation using Apache Commons Math

This simplifies the code and cognitive load for the data scientist yet makes identical calls to our matrix library. The output is a lot prettier too!

We’ll see a little more output customization when we get to the natural language examples.

We just illustrated a matrix example. Data scientists will likely end up using matrices all the time but might rarely do so directly. They will often be used under the covers by higher level algorithms. Readers interested in matrices can have a look at a Groovy example that creates neural networks by hand using matrices for digit recognition. The following blog post may also be of interest. It looks at a range of additional matrix calculations including one exciting area which is speeding up matrix calculations using the (currently incubating) Vector API.

Before diving further into our examples, it is worthwhile briefly talking about typing. Groovy was originally designed as a dynamically-typed complement to Java. Groovy’s dynamic nature allows the language to be augmented at runtime using techniques similar to those found in Python, Ruby, Smalltalk and Clojure. Groovy also has a static nature allowing improved compile-time type checking similar to Java, Scala and Kotlin. Both the dynamic and static natures offer extensibility options.

While it isn’t the focus of this blog, Groovy has great support for writing Domain Specific Languages (DSLs). As an example, here is a line of code that might be used to control a Mars rover robot:

move right by 2.m at 5.cm/s

To compile such code, we might declare a move method, and use metaprogramming to define m and cm properties for numbers, among other things. We call this process “defining our DSL”. We have options to leverage Groovy’s dynamic or static natures when designing the DSL. If we have a very dynamic DSL, we can catch accidental (or malicious) incorrect commands for the rover, e.g. we might throw an exception or return some error code for the following command:

move forward by 2.kgs

If we have a type-rich DSL, attempts to compile the above line might result in a compile-time error like this:

[Static type checking] - Cannot call by(Quantity<Length>) with arguments [Quantity<Mass>]

Another aspect we might want to incorporate into our rover DSL is speed limiting the rover for energy conservation or safety reasons. Perhaps the speed should be limited to 5 cm/s, so that the following line would be considered an invalid rover command:

move right by 2.m at 6.cm/s

We could put the appropriate defensive programming guards into our move method to detect invalid speeds at runtime. However, the type checker itself is also extensible, so we can bake such constraints into the type system if we choose in which case we might see a compile-time error like this:

[Static type checking] - Speed of 6 is too fast!

There are numerous ways to encode such a constraint into a type system. The approach shown here puts the burden of doing such an encoding on the DSL designer, not the DSL user. To see more about how to design such DSLs, including incorporating Java’s Units of Measurement API 2.0 (JSR 385) see this blog post.

Most of our examples use only the built-in metaprogramming enhancements in Groovy. These simplify code using lists, maps and Strings among other things. We will show one example later of more specialised metaprogramming when we look at using Apache Beam for scaling linear regression. It is worthwhile keeping in mind that if we have many similar data science scripts to write, creating a DSL may further increase productivity when writing those scripts.

For more information on Apache Groovy, you can visit the project website, read more about Groovy’s history, and see some more information about Groovy and data science.

Data Science Libraries

When performing data science tasks, Groovy has many useful built-in general purpose features, but there are many libraries you’ll probably want to use for more data science or machine learning specific tasks. The following table provides a non-exhaustive list of such libraries that we use or mention in this blog.

	Technologies/libraries covered
Data manipulation	Weka, Tablesaw, Apache POI, Apache Camel, Apache Commons CSV, Encog, Datavec, Tribuo
Data science algorithms	Weka, Smile, Encog, Tribuo, DeepLearning4J, Deep Netts, Apache Commons Math
Scaling data science	Apache Spark, Apache Ignite, Apache Beam, Apache Wayang (incubating), GPars, Spark-NLP, DJL with Tensorflow, DeepLearning4J with Apache MXNet, GraalVM
Visualization	XChart, Tablesaw Plot.ly, JavaFX, GroovyFX

Candles

An example that looks at reading spreadsheets, using dataframes, and creating graphs.

An interesting series of tweets around scented candles emerged about a year into the pandemic. One of the symptoms of COVID was loss of smell. About the time that infection rates were increasing, complaints about the lack of scent in scented candles were also increasing. Several folks explored the data in more detail including as shown in the following tweet.

Let’s explore the same data using Groovy.

We’ll look first at the review data which is contained in the spreadsheet Scented_all.xlsx which is on the classpath.

var url = getClass().classLoader.getResource('Scented_all.xlsx')
var table = new XlsxReader().read(builder(url).build())

Here we are using the Tablesaw library which provides a dataframe abstraction and has an add-on for reading Excel spreadsheet files.

For most of these code snippets we have not shown the relevant imports (they are in the complete listing in the repo). For this example, we use import aliasing to make the code more succinct. Since aliasing may be less familiar to some readers, we’ll show one example import:

import static tech.tablesaw.api.StringColumn.create as sCol

This is the same as a standard static import but also renames (or rather provides an alias for) the method. This is handy for our example where we’d otherwise have multiple create methods, and we’d only be able to have one of them as a static import. Now we can use the imported sCol method to create a new String column in our table and later we’ll use similarly defined dCol and bCol aliases for creating double and Boolean columns.

Our table has a Date column already but we’ll create an additional Month column containing just the month name as a string:

var monthCol = sCol('Month', table.column('Date').collect {
    it.month.toString()
})

Then we’ll create an additional Noscent Boolean column. Values in that column will be true if the review text matches any of a number of regex patterns:

var candidates = ['[Nn]o scent', '[Nn]o smell', '[Ff]aint smell',
    '[Ff]aint scent', "[Cc]an't smell", '[Dd]oes not smell like',
    "[Dd]oesn't smell like", '[Cc]annot smell', "[Dd]on't smell",
    '[Ll]ike nothing']
var noScentCol = bCol('Noscent', table.column('Review').collect {
    review -> candidates.any { review =~ it }
})

We’ll add our newly created columns to our table.

table.addColumns(monthCol, noScentCol)

Next, let’s collect the reviews which happened after COVID started and summarize the counts per month and counts of negative reviews per month.

var start2020 = LocalDateTime.of(2020, JANUARY, 1, 0, 0)
var byMonth2020 = table
    .where(r -> r.dateTimeColumn('Date').isAfter(start2020))
    .sortAscendingOn('Date')
    .summarize('Noscent', countTrue, count)
    .by('Month')

Next, we’ll count the proportion of “noscent” to total reviews.

double[] nsprop = byMonth2020.collect {
    it.getDouble('Number True [Noscent]') /
    it.getDouble('Count [Noscent]')
}

Now, we’ll calculate the standard error and high and low values for the error bars. Some libraries might be able to show error bars automatically. That’s not the case here, but it’s easy enough to create them ourselves:

var indices = 0..<byMonth2020.size()
double[] se = indices.collect {
    sqrt(nsprop[it] * (1 - nsprop[it]) /
    byMonth2020[it].getDouble('Count [Noscent]')) }
double[] barLower = indices.collect { nsprop[it] - se[it] }
double[] barHigher = indices.collect { nsprop[it] + se[it] }
byMonth2020.addColumns(dCol('nsprop', nsprop),
                       dCol('barLower', barLower),
                       dCol('barHigher', barHigher))

Now, we graph the results of our calculations:

var title = 'Proportion of top 5 scented candles on Amazon mentioning lack of scent by month 2020'
var layout = Layout.builder(title, 'Month',
    'Proportion of reviews')
    .showLegend(false).width(1000).height(500).build()
var trace = BarTrace.builder(
    byMonth2020.categoricalColumn('Month'),
    byMonth2020.nCol('nsprop'))
    .orientation(VERTICAL).opacity(0.5).build()
var errors = ScatterTrace.builder(
    byMonth2020.categoricalColumn('Month'),
    byMonth2020.nCol('barLower'), byMonth2020.nCol('barHigher'),
    byMonth2020.nCol('barLower'), byMonth2020.nCol('barHigher'))
    .type("candlestick").opacity(0.5).build()
var chart = new Figure(layout, trace, errors)
var parentDir = new File(url.file).parentFile
Plot.show(chart, new File(parentDir, 'ReviewBarchart.html'))

Our example uses the Tablesaw Plot.ly integration which fires open a browser page showing the following chart:

Barchart of customer reviews for scented candles during 2020

We can use similar code to look at how ratings for the top 3 best-selling candles have changed before and after COVID for scented and unscented candles. This results in the following graphs:

Ratings for unscented candles during 2020

While this analysis doesn’t attempt to analyze all reasons for the change in candle ratings, we can see that the ratings for the scented candles drop off more dramatically than unscented ones once COVID infections increased.

As a final topic for this candle example, data scientists are often familiar with SQL, so instead of using Tablesaw’s Excel integration and subsequent table aggregation functions, we could just as easily read the spreadsheet using Apache POI (details here) and calculate the “noscent” proportions using Groovy’s language integrated query capability (also known as Ginq or GQuery).

from row in table
where row.Date > start2020
groupby row.Month
orderby row.Date
select row.Month,
       agg(_g.toList().count{ it.row.NoScent }) / count(row.Date)

We could similarly go on to calculate the error bars and display our results graphically.

Linear Regression

An example covering reading CSV files, using ordinary least squares, some additional graphing options including GroovyFX, and how to scale regression using Apache Beam.

Regression analysis is widely used for prediction and forecasting. It provides a statistical process for determining the relationship between some independent variables (or features) and some dependent variable (or desired outcome). Linear regression looks for a linear relationship between such variables.

For us, house price is the desired outcome, and we’ll look for a relationship with features like, number of bedrooms, number of bathrooms, square feet of living space, and others. We’ll use the Kaggle dataset for King County between May 2014 and May 2015.

Preliminary steps

In our Candle example, we dived right in and started working with the data. In general, we might want to explore the data first and potentially perform some clean-up to remove anomalous data and work out how to handle potentially missing data.

Let’s look at the data again with Tablesaw:

var file = getClass().classLoader.getResource('kc_house_data.csv').file
Table rows = Table.read().csv(file)
println rows.shape()
println rows.structure()
println rows.column("bedrooms").summary().print()
println rows.where(rows.column("bedrooms").isGreaterThan(10))

It has this output:

kc_house_data.csv: 21613 rows X 21 cols
      Structure of kc_house_data.csv       
 Index  |   Column Name   |  Column Type  |
-------------------------------------------
     0  |             id  |         LONG  |
     1  |           date  |       STRING  |
     2  |          price  |       DOUBLE  |
     3  |       bedrooms  |      INTEGER  |
     4  |      bathrooms  |       DOUBLE  |
     5  |    sqft_living  |      INTEGER  |
     6  |       sqft_lot  |      INTEGER  |
     7  |         floors  |       DOUBLE  |
   ...  |            ...  |          ...  |
         Column: bedrooms          
 Measure   |        Value         |
-----------------------------------
    Count  |               21613  |
      sum  |               72854  |
     Mean  |   3.370841623097218  |
      Min  |                   0  |
      Max  |                  33  |
    Range  |                  33  |
 Variance  |  0.8650150097573497  |
 Std. Dev  |   0.930061831147451  |

kc_house_data.csv                                                                                                                                          
    id     | price  | bedrooms | bathrooms | sqft_living | sqft_lot | floors | ...
----------------------------------------------------------------------------------
1773100755 | 520000 |       11 |         3 |        3000 |     4960 |      2 | ...
2402100895 | 640000 |       33 |      1.75 |        1620 |     6000 |      1 | ...

The summary for the bedroom feature showed a maximum value of 33, so we displayed all houses with more than 10 bedrooms. Given the number of bathrooms and sqft_living size, the second of these appears like an anomaly in the data. Possibly someone typed 33, rather than 3, when entering the bedroom value.

Let’s remove all properties with more than 30 bedrooms and examine the number of bedrooms as a histogram. We’ll use Apache Commons CSV to read the CSV file, Apache Commons Math to collate our histogram and produce statistics, and GroovyFX for our graph.

var full = getClass().classLoader.getResource('kc_house_data.csv').file
var csv  = CSV.withFirstRecordAsHeader().parse(new FileReader(full))
var all  = csv.collect { it.bedrooms.toInteger() }.findAll{ it < 30 }

var stats = new SummaryStatistics()
all.each{ stats.addValue(it as double) }
println stats.summary

var dist = new EmpiricalDistribution(all.max()).tap{load(all as double[])}
var bins = dist.binStats.withIndex().collectMany { v, i ->
    [i.toString(), v.n] }

start {
  stage(title: 'Number of bedrooms histogram',
        show: true, width: 800, height: 600) {
    scene {
      barChart(title: 'Bedroom count', barGap: 0, categoryGap: 2) {
        series(name: 'Number of properties', data: bins)
      }
    }
  }
}

Which has this output:

StatisticalSummaryValues:
n: 21612
min: 0.0
max: 11.0
mean: 3.3694706644456733
std dev: 0.907981787328914
variance: 0.8244309261210092
sum: 72821.0

And produces the following graph.

If we have more heavy duty data integration needs, we could consider incorporating Apache Camel into our workflow. We might for instance use it when exploring for outliers. We might also want to become a little more systematic in finding our outliers by using ZScores as found in Apache Commons Math, or a Support Vector Machines anomaly detector as found in Tribuo.

Once we are happy with exploring and potentially cleaning the data, we can move into building and using our prediction model.

Ordinary least squares

Ordinary least squares finds our regression relationship by minimizing residual errors.

The work is already done for us, we just need to use the appropriate regression class.

Let’s start by exploring a model with just the bedrooms feature as our independent variable. We’ll use Apache Commons CSV, Apache Commons Math, and GroovyFX.

var feature = 'bedrooms'
var nonOutliers = feature == 'bedrooms' ? { it[0] < 30 } : { true }
var file = getClass().classLoader.getResource('kc_house_data.csv').file
var csv  = CSV.withFirstRecordAsHeader().parse(new FileReader(file))
var all  = csv.collect {
    [it[feature].toDouble(), it.price.toDouble()] }.findAll(nonOutliers)
var reg = new SimpleRegression().tap{ addData(all as double[][]) }
def (min, max) = all.transpose().with{ [it[0].min(), it[0].max()] }
var predicted = [[min, reg.predict(min)], [max, reg.predict(max)]]
start {
  stage(title: "Price vs $feature", show: true, width: 800, height: 600) {
    scene {
      lineChart(stylesheets: resource('/style.css')) {
        series(name: 'Actual', data: all)
        series(name: 'Predicted', data: predicted)
      }
    }
  }
}

This produces the following graph:

You should note that the data is spread widely on this graph and hence our model won’t be particularly good at predicting house prices.

To improve our model, we can also use multi linear regression which factors multiple features into the model. The algorithm automatically adjusts the coefficients for each feature. Features with a large positive impact on price will have a large coefficient. Features with a negative impact on price will have a negative coefficient. Features which aren’t really related to price will have a coefficient close to zero.

smile logo Let’s try multi-regression with Smile. Smile supports dataframes, various machine learning algorithms, NLP, and visualization. Here we’ll use its OLS regression class.

var price = table.column('price').toDoubleArray()
var model = OLS.fit(Formula.lhs('price'), table)
var predicted = model.predict(table)
double[][] data = [price, predicted].transpose()

var from = [price.toList().min(), predicted.min()].min()
var to = [price.toList().max(), predicted.max()].max()
var pts = [[from, from], [to, to]]
var ideal = LinePlot.of(pts as double[][], DASH, RED)

ScatterPlot.of(data, BLUE).canvas().with {
    title = 'Actual vs predicted price'
    setAxisLabels('Actual', 'Predicted')
    add(ideal)
    window()
}

We tell the fit method that price is our dependent variable. It will attempt to find the relationship between that variable and all other variables.

It produces this output:

regression graph for price versus all other features

Here we are using Smile’s Java Swing-based visualization capabilities.

Note that the spread is much smaller than for simple regression, but it is still fairly spread. What this tells us is that multi regression is much better than simple regression, but overall, predicting house prices based solely on this raw data is hard.

Other algorithms

Ordinary least squares is only one algorithm we have up our sleeve for regression. We might consider Scalable Vector Machine (SVM), Stochastic Gradient Descent (SGD), or Classification and Regression Trees (CART). Unfortunately, for our dataset, all models have similar prediction capability.

Scaling options

For our small dataset, scaling is not a high priority. For datasets with millions of rows or hundreds of features, scaling becomes paramount. The good news is that numerous options exist for us to scale linear regression on the JVM.

Two great options are to use Apache Spark (covered next) or Apache Ignite (as shown here) to run our regression calculations. The standard ordinary least squares algorithm isn’t particularly well suited for parallel distribution but alternative algorithms which are better suited for execution within parallel clusters are included as part of those platforms machine learning libraries.

We can slightly adapt ordinary least squares to get reasonable results with concurrent evaluation. We essentially place random subsets of the data across our clusters, and later average out the slopes and intercepts found within the models from each cluster. This adapted algorithm can be used with any framework which supports concurrent execution. GPars logo For example, this example shows how to code that algorithm using GPars. If you are interested in GPars, you might want to also check out this blog post which goes into further GPars examples including how to use it with virtual threads.

For this blog post, we are going to show how to implement the adapted algorithm with Apache Beam, but first we’ll look at the cluster-friendly algorithm that comes with Apache Spark’s machine learning library.

Scaling with Apache Spark

spark logo Apache Spark is an open-source unified analytics engine for large-scale data processing. We’ll use the spark-mllib component to calculate our regression in a cluster:

def spark = builder().config('spark.master', 'local[8]').appName('HousePrices').orCreate
def file = HousePricesSpark.classLoader.getResource('kc_house_data.csv').file
int k = 5
Dataset<Row> ds = spark.read()
    .format('csv')
    .options('header': 'true', 'inferSchema': 'true')
    .load(file)
double[] splits = [80, 20]
def (training, test) = ds.randomSplit(splits)

String[] colNames = ds.columns().toList() - ['id', 'date', 'price']
def assembler = new VectorAssembler(inputCols: colNames,
                                    outputCol: 'features')
Dataset<Row> dataset = assembler.transform(training)
def lr = new LinearRegression(labelCol: 'price', maxIter: 10)
def model = lr.fit(dataset)
println 'Coefficients:'
println model.coefficients().values()[1..-1]
    .collect { sprintf '%.2f', it }.join(', ')
def testSummary = model.evaluate(assembler.transform(test))
printf 'RMSE: %.2f%n', testSummary.rootMeanSquaredError
printf 'r2: %.2f%n', testSummary.r2
spark.stop()

We’ll split our data into training and test datasets; using the test dataset to see how well our model performs. When run, we’ll see the following output (we show the coefficients for our model and the root mean squared error):

22/12/05 16:49:00 INFO SparkContext: Running Spark version 3.3.1
22/12/05 16:49:01 INFO SparkContext: Submitted application: HousePrices
...
41979.78, 80853.89, 0.15, 5412.83, 564343.22, 53834.10, 24817.09, 93195.29, -80662.68, -80694.28, -2713.58, 19.02, -628.67, 594468.23, -228397.19, 21.23, -0.42
RMSE: 187242.12
r2: 0.70
...
22/12/05 16:49:09 INFO SparkContext: Successfully stopped SparkContext

Scaling with Apache Beam

beam logo Apache Beam provides a unified batch and streaming data processing framework which works with multiple languages like Java, Python and Groovy. It lets workloads be run across different runners like Apache Spark, Apache Flink and numerous others. We’ll show two implementation which use the native Java runner.

First, we define some helper methods which split our data into chunks to be run on different clusters and combine the results when we are done:

def features = [
        'price', 'bedrooms', 'bathrooms', 'sqft_living', 'sqft_living15',
        'lat', 'sqft_above', 'grade', 'view', 'waterfront', 'floors'
]

def readCsvChunks = new DoFn<String, double[][]>() {
    @ProcessElement
    void processElement(@Element String path, OutputReceiver<double[][]> receiver) throws IOException {
        def chunkSize = 6000
        def table = Read.csv(new File(path).toPath(),
            CSV.withFirstRecordAsHeader())
        table = table.select(*features)
        table = table.stream().filter {
            it.apply('bedrooms') <= 30 
        }.collect(DataFrame.collect())
        def idxs = 0..<table.nrows()
        for (nextChunkIdxs in idxs.shuffled().collate(chunkSize)) {
            def all = table.toArray().toList()
            receiver.output(all[nextChunkIdxs] as double[][])
        }
    }
}

def fitModel = new DoFn<double[][], double[]>() {
    @ProcessElement
    void processElement(@Element double[][] rows, OutputReceiver<double[]> receiver) throws IOException {
        def model = OLS.fit(Formula.lhs('price'),
            DataFrame.of(rows, features as String[])).coefficients()
        receiver.output(model)
    }
}

def evalModel = { double[][] chunk, double[] model ->
    double intercept = model[0]
    double[] coefficients = model[1..-1]
    def predicted = chunk.collect { row ->
        intercept + dot(row[1..-1] as double[], coefficients) }
    def residuals = chunk.toList().indexed()
        .collect { idx, row -> predicted[idx] - row[0] }
    def rmse = sqrt(sumSq(residuals as double[]) / chunk.size())
    [rmse, residuals.average(), chunk.size()] as double[]
}

def model2out = new DoFn<double[], String>() {
    @ProcessElement
    void processElement(@Element double[] ds, OutputReceiver<String> out) {
        out.output("** intercept: ${ds[0]}, coeffs: ${ds[1..-1].join(', ')}".toString())
    }
}

def stats2out = new DoFn<double[], String>() {
    @ProcessElement
    void processElement(@Element double[] ds, OutputReceiver<String> out) {
        out.output("** rmse: ${ds[0]}, mean: ${ds[1]}, count: ${ds[2]}".toString())
    }
}

With these helper methods in place, we can define our execution pipeline:

var csvChunks = p
        .apply(Create.of(filename))
        .apply('Create chunks', ParDo.of(readCsvChunks))

var model = csvChunks
        .apply('Fit chunks', ParDo.of(fitModel))
        .apply(Combine.globally(new MeanDoubleArrayCols()))

var modelView = model
        .apply(View.<double[]>asSingleton())

csvChunks
        .apply(ParDo.of(new EvaluateModel(modelView,
                evalModel)).withSideInputs(modelView))
        .apply(Combine.globally(new AggregateModelStats()))
        .apply('Log stats', ParDo.of(stats2out)).apply(Log.ofElements())

model
        .apply('Log model', ParDo.of(model2out)).apply(Log.ofElements())

If we apply a little bit of Groovy metaprogramming, we can tweak the execution pipeline to look like this:

var csvChunks = p
        | Create.of(filename)
        | 'Create chunks' >> ParDo.of(readCsvChunks)

var model = csvChunks
        | 'Fit chunks' >> ParDo.of(fitModel)
        | Combine.globally(new MeanDoubleArrayCols())

var modelView = model
        | View.<double[]>asSingleton()

csvChunks
        | ParDo.of(new EvaluateModel(modelView,
                evalModel)).withSideInputs(modelView)
        | Combine.globally(new AggregateModelStats())
        | 'Log stats' >> ParDo.of(stats2out) | Log.ofElements()

model
        | 'Log model' >> ParDo.of(model2out) | Log.ofElements()

It may seem like a small difference, but this code now looks very similar to the Python code that achieves the same thing. This could be a great productivity gain for projects which have a mix of Python and Groovy BEAM code.

Classification

An example covering reading CSV files, using traditional and neural network based classification algorithms, a glimpse at Jupyter Notebook options. Neural network solutions use Encog, Eclipse DeepLearning4J, and Deep Netts. Speeding up classification using GraalVM is also explored.

Iris flowers showing petals and sepals A classic data science dataset captures flower characteristics of Iris flowers. It captures the width and length of the sepals and petals for three species (Setosa, Versicolor, and Virginica).

The Iris project in the groovy-data-science repo is dedicated to this example. It includes a number of Groovy scripts and a Jupyter/BeakerX notebook highlighting this example comparing and contrasting various libraries and various classification algorithms.

Let’s look at how to classify the flowers using Weka’s decision tree algorithm:

def file = getClass().classLoader.getResource('iris_data.csv').file as File
def species = ['Iris-setosa', 'Iris-versicolor', 'Iris-virginica']
def loader = new CSVLoader(file: file)
def model = new J48()
def allInstances = loader.dataSet
allInstances.classIndex = 4
model.buildClassifier(allInstances)
println model

Which has this output:

J48 pruned tree
------------------
Petal width  0.6
|   Petal width <= 1.7
|   |   Petal length  4.9
|   |   |   Petal width  1.5: Iris-versicolor (3.0/1.0)
|   Petal width > 1.7: Iris-virginica (46.0/1.0)
Number of Leaves  :     5
Size of the tree :      9

This model can visualized as follows:

Iris flower classification decision tree

Feel free to browse the other examples and the Jupyter/BeakerX notebook if you are interested in exploring additional classification techniques like naive Bayes or logistic regression.

iris calculations with Groovy using jupyter notebook

For this blog, let’s dive further into just the deep learning classification examples.

Deep Learning

We’ll look at solutions using Encog, Eclipse DeepLearning4J and Deep Netts (with standard Java and as a native image using GraalVM) but first a brief introduction.

About Deep Learning

Deep learning falls under the branches of machine learning and artificial intelligence. It involves multiple layers (hence the “deep”) of an artificial neural network. There are lots of ways to configure such networks and the details are beyond the scope of this blog post, but we can give some basic details. We will have four input nodes corresponding to the measurements of our four characteristics. We will have three output nodes corresponding to each possible class (species). We will also have one or more additional layers in between.

Each node in this network mimics to some degree a neuron in the human brain. Again, we’ll simplify the details. Each node has multiple inputs, which are given a particular weight, as well as an activation function which will determine whether our node “fires”. Training the model is a process which works out what the best weights should be.

The math involved for converting inputs to output for any node isn’t too hard. We could write it ourselves (as shown here using matrices and Apache Commons Math for a digit recognition example) but luckily we don’t have to. The libraries we are going to use do much of the work for us. They typically provide a fluent API which let’s us specify, in a somewhat declarative way, the layers in our network.

If you want to see more content about deep learning, consider also checking out this earlier JVM Advent blog post.

Just before exploring our examples, we should pre-warn folks that while we do time running of the examples, no attempt was made to rigorously ensure that the examples were identical across the different technologies. The different technologies support slightly different ways to set up their respective network layers. The parameters were tweaked so that when run there was typically at most one or two errors in the validation. Also, the initial parameters for the runs can be set with random or pre-defined seeds. When random ones are used, each run will have slightly different errors. We’d need to do some additional alignment of examples and use a framework like JMH if we wanted to get a more rigorous time comparison between the technologies. Never-the-less, it should give a very rough guide as to the speed to the various technologies.

Encog

encogk logo Encog is a pure Java machine learning framework that was created in 2008. There is also a C# port for .Net users. Encog is a simple framework that supports a number of advanced algorithms not found elsewhere but isn’t as widely used as other more recent frameworks.

The complete source code for our Iris classification example using Encog is here, but the critical piece is:

def model = new EncogModel(data).tap {
    selectMethod(data, TYPE_FEEDFORWARD)
    report = new ConsoleStatusReportable()
    data.normalize()
    holdBackValidation(0.3, true, 1001) // test with 30%
    selectTrainingType(data)
}

def bestMethod = model.crossvalidate(5, true) // 5-fold cross-validation

println "Training error: " + pretty(calculateRegressionError(bestMethod, model.trainingDataset<))
println "Validation error: " + pretty(calculateRegressionError(bestMethod, model.validationDataset))

When we run the example, we see:

paulk@pop-os:/extra/projects/iris_encog$ time groovy -cp "build/lib/*" IrisEncog.groovy 
1/5 : Fold #1
1/5 : Fold #1/5: Iteration #1, Training Error: 1.43550735, Validation Error: 0.73302237
1/5 : Fold #1/5: Iteration #2, Training Error: 0.78845427, Validation Error: 0.73302237
...
5/5 : Fold #5/5: Iteration #163, Training Error: 0.00086231, Validation Error: 0.00427126
5/5 : Cross-validated score:0.10345818553910753
Training error:  0.0009
Validation error:  0.0991
Prediction errors:
predicted: Iris-virginica, actual: Iris-versicolor, normalized input: -0.0556, -0.4167,  0.3898,  0.2500
Confusion matrix:      Iris-setosa     Iris-versicolor      Iris-virginica
       Iris-setosa              19                   0                   0
   Iris-versicolor               0                  15                   1
    Iris-virginica               0                   0                  10
real	0m3.073s
user	0m9.973s
sys	0m0.367s

We won’t explain all of the stats, but it basically says we have a pretty good model with low errors in prediction. If you see the green and purple points in the notebook image earlier in this blog, you’ll see there are some points which are going to be hard to predict correctly all the time. The confusion matrix shows that the model predicted one flower incorrectly on the validation dataset.

One very nice aspect of this library is that it is a single jar dependency!

Eclipse DeepLearning4j

deeplearning4j logo Eclipse DeepLearning4j is a suite of tools for running deep learning on the JVM. It has support for scaling up to Apache Spark as well as some integration with python at a number of levels. It also provides integration to GPUs and C/++ libraries for native integration.

The complete source code for our Iris classification example using DeepLearning4J is here, with the main part shown below:

MultiLayerConfiguration conf = new NeuralNetConfiguration.Builder()
    .seed(seed)
    .activation(Activation.TANH) // global activation
    .weightInit(WeightInit.XAVIER)
    .updater(new Sgd(0.1))
    .l2(1e-4)
    .list()
    .layer(new DenseLayer.Builder().nIn(numInputs).nOut(3).build())
    .layer(new DenseLayer.Builder().nIn(3).nOut(3).build())
    .layer(new OutputLayer.Builder(LossFunctions.LossFunction.NEGATIVELOGLIKELIHOOD)
        .activation(Activation.SOFTMAX) // override activation with softmax for this layer
        .nIn(3).nOut(numOutputs).build())
    .build()

def model = new MultiLayerNetwork(conf)
model.init()

model.listeners = new ScoreIterationListener(100)

1000.times { model.fit(train) }

def eval = new Evaluation(3)
def output = model.output(test.features)
eval.eval(test.labels, output)
println eval.stats()

When we run this example, we see:

paulk@pop-os:/extra/projects/iris_dl4j$ time groovy -cp "build/lib/*" IrisDl4j.groovy 
[main] INFO org.nd4j.linalg.factory.Nd4jBackend - Loaded [CpuBackend] backend
[main] INFO org.nd4j.nativeblas.NativeOpsHolder - Number of threads used for linear algebra: 4
[main] INFO org.nd4j.nativeblas.Nd4jBlas - Number of threads used for OpenMP BLAS: 4
[main] INFO org.nd4j.linalg.api.ops.executioner.DefaultOpExecutioner - Backend used: [CPU]; OS: [Linux]
...
[main] INFO org.deeplearning4j.optimize.listeners.ScoreIterationListener - Score at iteration 0 is 0.9707752535968273
[main] INFO org.deeplearning4j.optimize.listeners.ScoreIterationListener - Score at iteration 100 is 0.3494968712782093
...
[main] INFO org.deeplearning4j.optimize.listeners.ScoreIterationListener - Score at iteration 900 is 0.03135504326480282
========================Evaluation Metrics========================
 # of classes:    3
 Accuracy:        0.9778
 Precision:       0.9778
 Recall:          0.9744
 F1 Score:        0.9752
Precision, recall & F1: macro-averaged (equally weighted avg. of 3 classes)
=========================Confusion Matrix=========================
  0  1  2
----------
 18  0  0 | 0 = 0
  0 14  0 | 1 = 1
  0  1 12 | 2 = 2
Confusion matrix format: Actual (rowClass) predicted as (columnClass) N times
==================================================================
real	0m5.856s
user	0m25.638s
sys	0m1.752s

Again the stats tell us that the model is good. There is only one error in the confusion matrix for our testing dataset. DeepLearning4J does have an impressive range of technologies that can be used to enhance performance in certain scenarios. For this example, I enabled AVX (Advanced Vector Extensions) support but didn’t try using the CUDA/GPU support nor make use of any Apache Spark integration. The GPU option might have sped up the application but given the size of the dataset and the amount of calculations needed to train our network, it probably wouldn’t have sped up much.

What does this tell us? For this little example, the overheads of putting the plumbing in place to access native C++ implementations and so forth, outweighed the gains. Those features generally would come into their own for much larger datasets or massive amounts of calculations; tasks like intensive video processing spring to mind.

A downside of the impressive scaling options is the added complexity. The code was slightly more complex (around 30% greater line count) than the other options we are comparing. This stems from requirements that would be needed if we did want to make use of Spark integration even though we didn’t here. The good news is that once the work is done, if we did want to use Spark, that would now be relatively straight forward.

The other increase in complexity is the number of jar files needed in the classpath. I went with the easy option of using the nd4j-native-platform dependency plus added the org.nd4j:nd4j-native:1.0.0-M2:linux-x86_64-avx2 dependency for AVX support. This made my life easy but brought in over 170 jars including many for unneeded platforms. Having all those jars is great if users on other platforms want to try the example but it can be a little troublesome with certain tooling that breaks with long command lines on certain platforms. I could certainly do some more work to shrink those dependency lists if it became a real problem.

[For the interested reader, the groovy-data-science repo has other DeepLearning4J examples. The Weka library can wrap DeepLearning4J as shown for this Iris example here. There are also two variants of the digit recognition example we alluded to earlier using one and two layer neural networks.]

Deep Netts

Deep Netts is a company offering a range of products and services related to deep learning. Here we are using the free open-source Deep Netts community edition pure java deep learning library. It provides support for the Java Visual Recognition API (JSR381). The expert group from JSR381 released their final spec earlier this year, so hopefully we’ll see more compliant implementations soon.

The complete source code for our Iris classification example using Deep Netts is here and the important part is below:

var splits = dataSet.split(0.7d, 0.3d)  // 70/30% split
var train = splits[0]
var test = splits[1]
var neuralNet = FeedForwardNetwork.builder()
    .addInputLayer(numInputs)
    .addFullyConnectedLayer(5, ActivationType.TANH)
    .addOutputLayer(numOutputs, ActivationType.SOFTMAX)
    .lossFunction(LossType.CROSS_ENTROPY)
    .randomSeed(456)
    .build()
neuralNet.trainer.with {
    maxError = 0.04f
    learningRate = 0.01f
    momentum = 0.9f
    optimizer = OptimizerType.MOMENTUM
}
neuralNet.train(train)
new ClassifierEvaluator().with {
    println "CLASSIFIER EVALUATION METRICS\n${evaluate(neuralNet, test)}"
    println "CONFUSION MATRIX\n$confusionMatrix"
}

When we run this command we see:

paulk@pop-os:/extra/projects/iris_graalvm$ time groovy -cp "build/lib/*" Iris.groovy 
16:49:27.089 [main] INFO deepnetts.core.DeepNetts - ------------------------------------------------------------------------
16:49:27.091 [main] INFO deepnetts.core.DeepNetts - TRAINING NEURAL NETWORK
16:49:27.091 [main] INFO deepnetts.core.DeepNetts - ------------------------------------------------------------------------
16:49:27.100 [main] INFO deepnetts.core.DeepNetts - Epoch:1, Time:6ms, TrainError:0.8584314, TrainErrorChange:0.8584314, TrainAccuracy: 0.5252525
16:49:27.103 [main] INFO deepnetts.core.DeepNetts - Epoch:2, Time:3ms, TrainError:0.52278274, TrainErrorChange:-0.33564866, TrainAccuracy: 0.52820516
...
16:49:27.911 [main] INFO deepnetts.core.DeepNetts - Epoch:3031, Time:0ms, TrainError:0.029988592, TrainErrorChange:-0.015680967, TrainAccuracy: 1.0
TRAINING COMPLETED
16:49:27.911 [main] INFO deepnetts.core.DeepNetts - Total Training Time: 820ms
16:49:27.911 [main] INFO deepnetts.core.DeepNetts - ------------------------------------------------------------------------
CLASSIFIER EVALUATION METRICS
Accuracy: 0.95681506 (How often is classifier correct in total)
Precision: 0.974359 (How often is classifier correct when it gives positive prediction)
F1Score: 0.974359 (Harmonic average (balance) of precision and recall)
Recall: 0.974359 (When it is actually positive class, how often does it give positive prediction)
CONFUSION MATRIX
                       none    Iris-setosa Iris-versicolor Iris-virginica
           none           0              0               0              0
    Iris-setosa           0             14               0              0
Iris-versicolor           0              0              18              1
 Iris-virginica           0              0               0             12
real	0m3.160s
user	0m10.156s
sys	0m0.483s

This is faster than DeepLearning4j and similar to Encog. This is to be expected given our small data set and isn’t indicative of performance for larger problems.

Another plus is the dependency list. It isn’t quite the single jar situation as we saw with Encog but not far off. There is the Encog jar, the JSR381 VisRec API which is in a separate jar, and a handful of logging jars.

Deep Netts with GraalVM

graalvm logo Another technology we might want to consider if performance is important to us is GraalVM. GraalVM is a high-performance JDK distribution designed to speed up the execution of applications written in Java and other JVM languages. We’ll look at creating a native version of our Iris Deep Netts application. We used GraalVM 22.1.0 Java 17 CE and Groovy 4.0.3. We’ll cover just the basic steps but there are other places for additional setup info and troubleshooting help like here, here and here.

Groovy has two natures. Its dynamic nature supports adding methods at runtime through metaprogramming and interacting with method dispatch processing through missing method interception and other tricks. Some of these tricks make heavy use of reflection and dynamic class loading and cause problems for GraalVM which is trying to determine as much information as it can at compile time. Groovy’s static nature has a more limited set of metaprogramming capabilities but allows bytecode much closer to Java to be produced. Luckily, we aren’t relying on any dynamic Groovy tricks for our example. We’ll compile it up using static mode:

paulk@pop-os:/extra/projects/iris_graalvm$ groovyc -cp "build/lib/*" --compile-static Iris.groovy

Next we build our native application:

paulk@pop-os:/extra/projects/iris_graalvm$ native-image  --report-unsupported-elements-at-runtime \
   --initialize-at-run-time=groovy.grape.GrapeIvy,deepnetts.net.weights.RandomWeights \
   --initialize-at-build-time --no-fallback  -H:ConfigurationFileDirectories=conf/  -cp ".:build/lib/*" Iris

We told GraalVM to initialize GrapeIvy at runtime (to avoid needing Ivy jars in the classpath since Groovy will lazily load those classes only if we use @Grab statements). We also did the same for the RandomWeights class to avoid it being locked into a random seed fixed at compile time.

Now we are ready to run our application:

paulk@pop-os:/extra/projects/iris_graalvm$ time ./iris
...
CLASSIFIER EVALUATION METRICS
Accuracy: 0.93460923 (How often is classifier correct in total)
Precision: 0.96491224 (How often is classifier correct when it gives positive prediction)
F1Score: 0.96491224 (Harmonic average (balance) of precision and recall)
Recall: 0.96491224 (When it is actually positive class, how often does it give positive prediction)
CONFUSION MATRIX
                       none    Iris-setosa Iris-versicolor Iris-virginica
           none           0              0               0              0
    Iris-setosa           0             21               0              0
Iris-versicolor           0              0              20              2
 Iris-virginica           0              0               0             17
real    0m0.131s
user    0m0.096s
sys     0m0.029s

We can see here that the speed has dramatically increased. This is great, but we should note, that using GraalVM often involves some tricky investigation especially for Groovy which by default has its dynamic nature. There are a few features of Groovy which won’t be available when using Groovy’s static nature and some libraries might be problematical. As an example, Deep Netts has log4j2 as one of its dependencies. At the time of writing, there are still issues using log4j2 with GraalVM. We excluded the log4j-core dependency and used log4j-to-slf4j backed by logback-classic to sidestep this problem.

Clustering

Looks at K-Means and other algorithms for clustering as well as using Apache Wayang and Apache Ignite for scaling clustering.

In an attempt to find the perfect single-malt Scotch whiskey, the whiskies produced from 86 distilleries have been ranked by expert tasters according to 12 criteria (Body, Sweetness, Malty, Smoky, Fruity, etc.).

While those rankings might prove interesting reading to some Whiskey advocates, it is difficult to draw many conclusions from the raw data alone. Clustering is a well-established area of statistical modelling where data is grouped into clusters. Members within a cluster should be similar to each other and different from the members of other clusters. Clustering is an unsupervised learning method. The categories are not predetermined but instead represent natural groupings which are found as part of the clustering process.

K-Means is the most common form of centroid clustering. The K represents the number of clusters to find. If we imagine points in 2D space, for k=3, we would start by picking 3 random points as our starting centroids.

We allocate all points to their closest centroid:

Given this allocation, we re-calculate each centroid from all of its points:

We repeat this process until either a stable centroid selection is found, or we have reached a certain number of iterations. For our case, we don’t have two dimensions but twelve. This makes it a little harder to visualize. We’ll cover that topic shortly.

Let’s first look at how we might use Tablesaw and Smile to create our KMeans model.

def file = getClass().classLoader.getResource('whiskey.csv').file
def helper = new TablesawUtil(file)
def rows = Table.read().csv(file)

def cols = ['Body', 'Sweetness', 'Smoky', 'Medicinal', 'Tobacco', 'Honey',
            'Spicy', 'Winey', 'Nutty', 'Malty', 'Fruity', 'Floral']
def data = rows.as().doubleMatrix(*cols)

def pca = PCA.fit(data)
def dims = 3
pca.projection = dims
def projected = pca.project(data)
def clusters = KMeans.fit(data, 5)
def labels = clusters.y.collect { 'Cluster ' + (it + 1) }
rows = rows.addColumns(
    *(0..<dims).collect { idx ->
        DoubleColumn.create("PCA${idx+1}", (0..<data.size()).collect{
            projected[it][idx]
        })
    },
    StringColumn.create('Cluster', labels),
    DoubleColumn.create('Centroid', [10] * labels.size())
)
def centroids = pca.project(clusters.centroids)
def toAdd = rows.emptyCopy(1)
(0..<centroids.size()).each { idx ->
    toAdd[0].setString('Cluster', 'Cluster ' + (idx+1))
    (1..3).each { toAdd[0].setDouble('PCA' + it, centroids[idx][it-1]) }
    toAdd[0].setDouble('Centroid', 50)
    rows.append(toAdd)
}

def title = 'Clusters x Principal Components w/ centroids'
def type = dims == 2 ? ScatterPlot : Scatter3DPlot
helper.show(type.create(title, rows, *(1..dims).collect { "PCA$it" }, 'Centroid', 'Cluster'), 'KMeansClustersPcaCentroids')

There are a few points to note here. In order to display a graph, we need to reduce the number of dimensions. We use a technique called Principle Component Analysis (PCA) to do that.

The output will be:

Scaling Options

We can scale clustering in numerous ways. We might decide to use Apache Spark directly (shown here) since it has a clusterable K-Means implementation in its spark-mllib module. Lets instead explore Apache Wayang over the top of either a Java runner or Apache Spark. We’ll also look at using Apache Ignite.

Scaling with Apache Wayang

wayang logo Apache Wayang (incubating) is an API for big data cross-platform processing. It provides an abstraction over other platforms like Apache Spark and Apache Flink as well as a default built-in stream-based “platform”. The goal is to provide a consistent developer experience when writing code regardless of whether a light-weight or highly-scalable platform may eventually be required. Execution of the application is specified in a logical plan which is again platform agnostic. Wayang will transform the logical plan into a set of physical operators to be executed by specific underlying processing platforms.

We’ll start with defining a Point record:

record Point(double[] pts) implements Serializable {
    static Point fromLine(String line) {
        new Point(line.split(',')[2..-1]*.toDouble() as double[]) }
}

Our class is Serializable (more on that later) and contains a fromLine factory method to help us make points from a CSV file. We’ll do that ourselves rather than rely on other libraries which could assist. It’s not a 2D or 3D point for us but 12D corresponding to the 12 criteria. We just use a double array, so any dimension would be supported but the 12 comes from the number of columns in our data file.

We’ll define a related TaggedPointCounter record. It’s like a Point but tracks a cluster Id and count used when clustering the “points”:

record TaggedPointCounter(double[] pts, int cluster, long count) implements Serializable {
    TaggedPointCounter plus(TaggedPointCounter that) {
        new TaggedPointCounter((0..<pts.size()).collect{
            pts[it] + that.pts[it]
        } as double[], cluster, count + that.count)
    }

    TaggedPointCounter average() {
        new TaggedPointCounter(pts.collect{ double d ->
            d/count } as double[], cluster, 0)
    }
}

We have plus and average methods which will be helpful in the map/reduce parts of the algorithm.

Another aspect of the KMeans algorithm is assigning points to the cluster associated with their nearest centroid. For 2 dimensions, recalling pythagoras’ theorem, this would be the square root of x squared plus y squared, where x and y are the distance of a point from the centroid in the x and y dimensions respectively. We’ll do the same across all dimensions and define the following helper class to capture this part of the algorithm:

class SelectNearestCentroid implements ExtendedSerializableFunction<Point, TaggedPointCounter> {
    Iterable<TaggedPointCounter> centroids

    void open(ExecutionContext context) {
        centroids = context.getBroadcast("centroids")
    }

    TaggedPointCounter apply(Point p) {
        def minDistance = Double.POSITIVE_INFINITY
        def nearestCentroidId = -1
        for (c in centroids) {
            def distance = sqrt((0..<p.pts.size()).collect{
                p.pts[it] - c.pts[it]
            }.sum{ it ** 2 } as double)
            if (distance < minDistance) {
                minDistance = distance
                nearestCentroidId = c.cluster
            }
        }
        new TaggedPointCounter(p.pts, nearestCentroidId, 1)
    }
}

In Wayang parlance, the SelectNearestCentroid class is a UDF, a User-Defined Function. It represents some chunk of functionality where an optimization decision can be made about where to run the operation.

Once we get to using Spark, the classes in the map/reduce part of our algorithm will need to be serializable. Method closures in dynamic Groovy aren’t serializable. We have a few options to avoid using them. I’ll show one approach here which is to use some helper classes in places where we might typically use method references. Here are the helper classes:

class Cluster
implements SerializableFunction<TaggedPointCounter, Integer> {
    Integer apply(TaggedPointCounter tpc) { tpc.cluster() }
}

class Average implements SerializableFunction<TaggedPointCounter, TaggedPointCounter> {
    TaggedPointCounter apply(TaggedPointCounter tpc) { tpc.average() }
}

class Plus implements SerializableBinaryOperator<TaggedPointCounter> {
    TaggedPointCounter apply(TaggedPointCounter tpc1, TaggedPointCounter tpc2) { tpc1.plus(tpc2) }
}

Now we are ready for our KMeans script:

int k = 5
int iterations = 20

// read in data from our file
def url = WhiskeyWayang.classLoader.getResource('whiskey.csv').file
def pointsData = new File(url).readLines()[1..-1].collect{ Point.fromLine(it) }
def dims = pointsData[0].pts.size()

// create some random points as initial centroids
def r = new Random()
def initPts = (1..k).collect { (0..<dims).collect { r.nextGaussian() + 2 } as double[] }

// create planbuilder with Java and Spark enabled
def configuration = new Configuration()
def context = new WayangContext(configuration)
    .withPlugin(Java.basicPlugin())
    .withPlugin(Spark.basicPlugin())
def planBuilder = new JavaPlanBuilder(context, "KMeans ($url, k=$k, iterations=$iterations)")

def points = planBuilder
    .loadCollection(pointsData).withName('Load points')

def initialCentroids = planBuilder
    .loadCollection((0..<k).collect{ idx -> new TaggedPointCounter(initPts[idx], idx, 0) })
    .withName("Load random centroids")

def finalCentroids = initialCentroids
    .repeat(iterations, currentCentroids ->
        points.map(new SelectNearestCentroid())
            .withBroadcast(currentCentroids, "centroids").withName("Find nearest centroid")
            .reduceByKey(new Cluster(), new Plus()).withName("Add up points")
            .map(new Average()).withName("Average points")
            .withOutputClass(TaggedPointCounter)).withName("Loop").collect()

println 'Centroids:'
finalCentroids.each { c ->
    println "Cluster$c.cluster: ${c.pts.collect{ sprintf('%.3f', it) }.join(', ')}"
}

Here, k is the desired number of clusters, and iterations is the number of times to iterate through the KMeans loop. The pointsData variable is a list of Point instances loaded from our data file. We’d use the readTextFile method instead of loadCollection if our data set was large. The initPts variable is some random starting positions for our initial centroids. Being random, and given the way the KMeans algorithm works, it is possible that some of our clusters may have no points assigned.

Our algorithm works by assigning, at each iteration, all the points to their closest current centroid and then calculating the new centroids given those assignments. Finally, we output the results.

Using Wayang with the Java streams-backed platform

As we mentioned earlier, Wayang selects which platform(s) will run our application. It has numerous capabilities whereby cost functions and load estimators can be used to influence and optimize how the application is run. For our simple example, it is enough to know that even though we specified Java or Spark as options, Wayang knows that for our small data set, the Java streams option is the way to go.

Since we prime the algorithm with random data, we expect the results to be slightly different each time the script is run, but here is one output:

> Task :WhiskeyWayang:run
Centroids:
Cluster0: 2.548, 2.419, 1.613, 0.194, 0.097, 1.871, 1.742, 1.774, 1.677, 1.935, 1.806, 1.613
Cluster2: 1.464, 2.679, 1.179, 0.321, 0.071, 0.786, 1.429, 0.429, 0.964, 1.643, 1.929, 2.179
Cluster3: 3.250, 1.500, 3.250, 3.000, 0.500, 0.250, 1.625, 0.375, 1.375, 1.375, 1.250, 0.250
Cluster4: 1.684, 1.842, 1.211, 0.421, 0.053, 1.316, 0.632, 0.737, 1.895, 2.000, 1.842, 1.737
...

Which if plotted looks like this:

If you are interested, check out the examples in the repo links at the end of this article to see the code for producing this centroid spider plot or the Jupyter/BeakerX notebook in this project’s github repo.

Using Wayang with Apache Spark

Given our small dataset size and no other customization, Wayang will choose the Java streams based solution. We could use Wayang optimization features to influence which processing platform it chooses, but to keep things simple, we’ll just disable the Java streams platform in our configuration by making the following change in our code:

Disabling the Whiskey Wayang Java runner

Now when we run the application, the output will be something like this (a solution similar to before but with 1000+ extra lines of Spark and Wayang log information – truncated for presentation purposes):

[main] INFO org.apache.spark.SparkContext - Running Spark version 3.3.0
[main] INFO org.apache.spark.util.Utils - Successfully started service 'sparkDriver' on port 62081.
...
Centroids:
Cluster4: 1.414, 2.448, 0.966, 0.138, 0.034, 0.862, 1.000, 0.483, 1.345, 1.690, 2.103, 2.138
Cluster0: 2.773, 2.455, 1.455, 0.000, 0.000, 1.909, 1.682, 1.955, 2.091, 2.045, 2.136, 1.818
Cluster1: 1.762, 2.286, 1.571, 0.619, 0.143, 1.714, 1.333, 0.905, 1.190, 1.952, 1.095, 1.524
Cluster2: 3.250, 1.500, 3.250, 3.000, 0.500, 0.250, 1.625, 0.375, 1.375, 1.375, 1.250, 0.250
Cluster3: 2.167, 2.000, 2.167, 1.000, 0.333, 0.333, 2.000, 0.833, 0.833, 1.500, 2.333, 1.667
...
[shutdown-hook-0] INFO org.apache.spark.SparkContext - Successfully stopped SparkContext
[shutdown-hook-0] INFO org.apache.spark.util.ShutdownHookManager - Shutdown hook called

A goal of Apache Wayang is to allow developers to write platform-agnostic applications. While this is mostly true, the abstractions aren’t perfect. As an example, if I know I am only using the streams-backed platform, I don’t need to worry about making any of my classes serializable (which is a Spark requirement). In our example, we could have omitted the “implements Serializable” part of the TaggedPointCounter record, and we could have used a method reference TaggedPointCounter::average instead of our Average helper class. This isn’t meant to be a criticism of Wayang, after all if you want to write cross-platform UDFs, you might expect to have to follow some rules. Instead, it is meant to just indicate that abstractions often have leaks around the edges. Sometimes those leaks can be beneficially used, other times they are traps waiting for unknowing developers.

To summarise, if using the Java streams-backed platform, you can run the application on JDK17 (which uses native records) as well as JDK11 and JDK8 (where Groovy provides emulated records). Also, we could make numerous simplifications if we desired. When using the Spark processing platform, the potential simplifications aren’t applicable, and we can run on JDK8 and JDK11 (Spark isn’t yet supported on JDK17).

Scaling with Apache Ignite

ignite logo Apache Ignite is a distributed database for high-performance computing with in-memory speed. It makes a cluster (or grid) of nodes appear like an in-memory cache.

This explanation drastically simplifies Ignite’s feature set. Ignite can be used as:

an in-memory cache with special features like SQL querying and transactional properties
an in-memory data-grid with advanced read-through & write-through capabilities on top of one or more distributed databases
an ultra-fast and horizontally scalable in-memory database
a high-performance computing engine for custom or built-in tasks including machine learning

It is mostly this last capability that we will use. Ignite’s Machine Learning API has purpose built, cluster-aware machine learning and deep learning algorithms for Classification, Regression, Clustering, and Recommendation among others. We’ll use the distributed K-means Clustering algorithm from their library.

Apache Ignite has special capabilities for reading data into the cache. We could use IgniteDataStreamer or IgniteCache.loadCache() and load data from files, stream sources, various database sources and so forth. This is particularly relevant when using a cluster.

For our little example, our data is in a relatively small CSV file and we will be using a single node, so we’ll just read our data using Apache Commons CSV:

var file = getClass().classLoader.getResource('whiskey.csv').file as File
var rows = file.withReader {r -> RFC4180.parse(r).records*.toList() }
var data = rows[1..-1].collect{ it[2..-1]*.toDouble() } as double[][]

We’ll configure our single node Ignite data cache using code (but we could place the details in a configuration file in more complex scenarios):

var cfg = new IgniteConfiguration(
        peerClassLoadingEnabled: true,
        discoverySpi: new TcpDiscoverySpi(
                ipFinder: new TcpDiscoveryMulticastIpFinder(
                        addresses: ['127.0.0.1:47500..47509']
                )
        )
)

Next, we’ll create a few helper variables:

var features = ['Body', 'Sweetness', 'Smoky', 'Medicinal', 'Tobacco',
                'Honey', 'Spicy', 'Winey', 'Nutty', 'Malty', 'Fruity', 'Floral']
var pretty = this.&sprintf.curry('%.4f')
var dist = new EuclideanDistance()
var vectorizer = new DoubleArrayVectorizer().labeled(FIRST)

Now we start the node, populate the cache, run our k-means algorithm, and print the result.

Ignition.start(cfg).withCloseable { ignite ->
    println ">>> Ignite grid started for data: ${data.size()} rows X ${data[0].size()} cols"
    var dataCache = ignite.createCache(new CacheConfiguration<Integer, double[]>(
          name: "TEST_${UUID.randomUUID()}",
          affinity: new RendezvousAffinityFunction(false, 10)))
    data.indices.each { int i -> dataCache.put(i, data[i]) }
    var trainer = new KMeansTrainer().withDistance(dist).withAmountOfClusters(5)
    var mdl = trainer.fit(ignite, dataCache, vectorizer)
    println ">>> KMeans centroids:\n${features.join(', ')}
    var centroids = mdl.centers*.all()
    centroids.each { c -> println c*.get().collect(pretty).join(', ') 
    dataCache.destroy()

Here is the output:

[18:13:11]    __________  ________________
[18:13:11]   /  _/ ___/ |/ /  _/_  __/ __/
[18:13:11]  _/ // (7 7    // /  / / / _/
[18:13:11] /___/\___/_/|_/___/ /_/ /x___/
[18:13:11]
[18:13:11] ver. 2.14.0#20220929-sha1:951e8deb
[18:13:11] 2022 Copyright(C) Apache Software Foundation
...
[18:13:11] Configured plugins:
[18:13:11]   ^-- ml-inference-plugin 1.0.0
[18:13:14] Ignite node started OK (id=f731e4ab)
...
>>> Ignite grid started for data: 86 rows X 13 cols
>>> KMeans centroids
Body, Sweetness, Smoky, Medicinal, Tobacco, Honey, Spicy, Winey, Nutty, Malty, Fruity, Floral
2.7037, 2.4444, 1.4074, 0.0370, 0.0000, 1.8519, 1.6667, 1.8519, 1.8889, 2.0370, 2.1481, 1.6667
1.8500, 1.9000, 2.0000, 0.9500, 0.1500, 1.1000, 1.5000, 0.6000, 1.5500, 1.7000, 1.3000, 1.5000
1.2667, 2.1333, 0.9333, 0.1333, 0.0000, 1.0667, 0.8000, 0.5333, 1.8000, 1.7333, 2.2667, 2.2667
3.6667, 1.5000, 3.6667, 3.3333, 0.6667, 0.1667, 1.6667, 0.5000, 1.1667, 1.3333, 1.1667, 0.1667
1.5000, 2.8889, 1.0000, 0.2778, 0.1667, 1.0000, 1.2222, 0.6111, 0.5556, 1.7778, 1.6667, 2.0000
[18:13:15] Ignite node stopped OK [uptime=00:00:00.663]

We can plot the centroid characteristics in a spider plot.

Natural Language Processing

Covers various natural language processing examples including detecting the language in use, parts of speech, entities, sentiment analysis, and universal sentence encoding using Apache OpenNLP, Stanford CoreNLP, and Datumbox. Also looks at scaling natural language processing using Spark NLP and DJL with TensorFlow

Natural Language Processing is certainly a large and sometimes complex topic with many aspects. Some of those aspects deserve entire blogs in their own right. For this blog, we will briefly look at a few simple use cases illustrating where you might be able to use NLP technology in your own project.

Language Detection

opennlp logo Knowing what language some text represents can be a critical first step to subsequent processing. Let’s look at how to predict the language using a pre-built model and Apache OpenNLP. Here, ResourceHelper is a utility class used to download and cache the model. The first run may take a little while as it downloads the model. Subsequent runs should be fast. Here we are using a well-known model referenced in the OpenNLP documentation.

def helper = new ResourceHelper('https://dlcdn.apache.org/opennlp/models/langdetect/1.8.3/')
def model = new LanguageDetectorModel(helper.load('langdetect-183'))
def detector = new LanguageDetectorME(model)

[ spa: 'Bienvenido a Madrid', fra: 'Bienvenue à Paris',
  dan: 'Velkommen til København', bul: 'Добре дошли в София'
].each { k, v ->
    assert detector.predictLanguage(v).lang == k
}

The LanguageDetectorME class lets us predict the language. In general, the predictor may not be accurate on small samples of text but it was good enough for our example. We’ve used the language code as the key in our map and we check that against the predicted language.

A more complex scenario is training your own model. Let’s look at how to do that with Datumbox. Datumbox has a pre-trained models zoo but its language detection model didn’t seem to work well for the small snippets in the next example, so we’ll train our own model. First, we’ll define our datasets:

def loader = getClass().classLoader
def datasets = [
  English: loader.getResource("training.language.en.txt").toURI(),
  French: loader.getResource("training.language.fr.txt").toURI(),
  German: loader.getResource("training.language.de.txt").toURI(),
  Spanish: loader.getResource("training.language.es.txt").toURI(),
  Indonesian: loader.getResource("training.language.id.txt").toURI()
]

The de training dataset comes from the Datumbox examples. The training datasets for the other languages are from Kaggle.

We set up the training parameters needed by our algorithm:

def trainingParams = new TextClassifier.TrainingParameters(
    numericalScalerTrainingParameters: null,
    featureSelectorTrainingParametersList: [new ChisquareSelect.TrainingParameters()],
    textExtractorParameters: new NgramsExtractor.Parameters(),
    modelerTrainingParameters: new MultinomialNaiveBayes.TrainingParameters()
)

Here, we’ll use a Naïve Bayes model with Chisquare feature selection.

Next we create our algorithm, train it with our training dataset, and then validate it against the training dataset. We’d normally want to split the data into training and testing datasets, to give us a more accurate statistic of the accuracy of our model. But for simplicity, while still illustrating the API, we’ll train and validate with our entire dataset:

def config = Configuration.configuration
def classifier = MLBuilder.create(trainingParams, config)
classifier.fit(datasets)
def metrics = classifier.validate(datasets)
println "Classifier Accuracy (using training data): $metrics.accuracy"

When run, we see the following output:

Classifier Accuracy (using training data): 0.9975609756097561

Our test dataset will consist of some hard-coded illustrative phrases. Let’s use our model to predict the language for each phrase:

println 'Classifying                   Predicted   Probability'
[ 'Bienvenido a Madrid', 'Bienvenue à Paris', 'Welcome to London',
  'Willkommen in Berlin', 'Selamat Datang di Jakarta'
].each { txt ->
    def r = classifier.predict(txt)
    def predicted = r.YPredicted.center(10)
    def probability = sprintf '%6.2f',
        r.YPredictedProbabilities.get(predicted)
    println "${txt.padRight(30)}$predicted$probability

When run, it has this output:

Classifying                   Predicted   Probability
Bienvenido a Madrid            Spanish    0.83
Bienvenue à Paris               French    0.71
Welcome to London              English    1.00
Willkommen in Berlin            German    0.84
Selamat Datang di Jakarta     Indonesian  1.00

Given these phrases are very short, it is nice to get them all correct, and the probabilities all seem reasonable for this scenario.

Parts of Speech

Parts of speech (POS) analysers examine each part of a sentence (the words and potentially punctuation) in terms of the role they play in a sentence. A typical analyser will assign or annotate words with their role like identifying nouns, verbs, adjectives and so forth. This can be a key early step for tools like the voice assistants from Amazon, Apple and Google.

We’ll start by looking at a perhaps lesser known library Nlp4j before looking at some others. In fact, there are multiple Nlp4j libraries. We’ll use the one from nlp4j.org, which seems to be the most active and recently updated.

This library uses the Stanford CoreNLP library under the covers for its English POS functionality. The library has the concept of documents, and annotators that work on documents. Once annotated, we can print out all of the discovered words and their annotations:

var doc = new DefaultDocument()
doc.putAttribute('text', 'I eat sushi with chopsticks.')
var ann = new StanfordPosAnnotator()
ann.setProperty('target', 'text')
ann.annotate(doc)
println doc.keywords.collect{  k -> "${k.facet - 'word.'}(${k.str})" }.join(' ')

When run, we see the following output:

PRP(I) VBP(eat) NN(sushi) IN(with) NNS(chopsticks) .(.)

The annotations, also known as tags or facets, for this example are as follows:

PRP	Personal pronoun
VBP	Present tense verb
NN	Noun, singular
IN	Preposition
NNS	Noun, plural

The documentation for the libraries we are using give a more complete list of such annotations.

A nice aspect of this library is support for other languages, in particular, Japanese. The code is very similar but uses a different annotator:

doc = new DefaultDocument()
doc.putAttribute('text', '私は学校に行きました。')
ann = new KuromojiAnnotator()
ann.setProperty('target', 'text')
ann.annotate(doc)
println doc.keywords.collect{ k -> "${k.facet}(${k.str})" }.join(' ')

When run, we see the following output:

名詞(私) 助詞(は) 名詞(学校) 助詞(に) 動詞(行き) 助動詞(まし) 助動詞(た) 記号(。)

Before progressing, we’ll highlight the result visualization capabilities of the GroovyConsole. This feature lets us write a small Groovy script which converts results to any swing component. In our case we’ll convert lists of annotated strings to a JLabel component containing HTML including colored annotation boxes. The details aren’t included here but can be found in the repo. We need to copy that file into our ~/.groovy folder and then enable script visualization as shown here:

Enabling visualization in the Groovy Console

Then we should see the following when running the script:

Result of run in Groovy Console with visualiation enabled

The visualization is purely optional but adds a nice touch. If using Groovy in notebook environments like Jupyter/BeakerX, there might be visualization tools in those environments too.

Let’s look at a larger example using the Smile library.

First, the sentences that we’ll examine:

def sentences = [
    'Paul has two sisters, Maree and Christine.',
    'No wise fish would go anywhere without a porpoise',
    'His bark was much worse than his bite',
    'Turn on the lights to the main bedroom',
    "Light 'em all up",
    'Make it dark downstairs'
]

A couple of those sentences might seem a little strange but they are selected to show off quite a few of the different POS tags.

Smile has a tokenizer class which splits a sentence into words. It handles numerous cases like contractions and abbreviations (“e.g.”, “’tis”, “won’t”). Smile also has a POS class based on the hidden Markov model and a built-in model is used for that class. Here is our code using those classes:

def tokenizer = new SimpleTokenizer(true)
sentences.each {
    def tokens = Arrays.stream(tokenizer.split(it)).toArray(String[]::new)
    def tags = HMMPOSTagger.default.tag(tokens)*.toString()
    println tokens.indices.collect{tags[it] == tokens[it] ? tags[it] : "${tags[it]}(${tokens[it]})" }.join(' ')
}

We run the tokenizer for each sentence. Each token is then displayed directly or with its tag if it has one.

Running the script gives this visualization:

Paul
NNP

has
VBZ

two
CD

sisters
NNS

Maree
NNP

and
CC

Christine
NNP

No
DT

wise
JJ

fish
NN

would
MD

go
VB

anywhere
RB

without
IN

a
DT

porpoise
NN

His
PRP$

bark
NN

was
VBD

much
RB

worse
JJR

than
IN

his
PRP$

bite
NN

Turn
VB

on
IN

the
DT

lights
NNS

to
TO

the
DT

main
JJ

bedroom
NN

Light
NNP

’em
PRP

all
RB

up
RB

Make
VB

it
PRP

dark
JJ

downstairs
NN

[Note: the scripts in the repo just print to stdout which is perfect when using the command-line or IDEs. The visualization in the GoovyConsole kicks in only for the actual result. So, if you are following along at home and wanting to use the GroovyConsole, you’d change the each to collect and remove the println, and you should be good for visualization.]

The OpenNLP code is very similar:

def tokenizer = SimpleTokenizer.INSTANCE
sentences.each {
    String[] tokens = tokenizer.tokenize(it)
    def posTagger = new POSTaggerME('en')
    String[] tags = posTagger.tag(tokens)
    println tokens.indices.collect{tags[it] == tokens[it] ? tags[it] : "${tags[it]}(${tokens[it]})" }.join(' ')
}

OpenNLP allows you to supply your own POS model but downloads a default one if none is specified.

When the script is run, it has this visualization:

Paul
PROPN

has
VERB

two
NUM

sisters
NOUN

,
PUNCT

Maree
PROPN

and
CCONJ

Christine
PROPN

.
PUNCT

No
DET

wise
ADJ

fish
NOUN

would
AUX

go
VERB

anywhere
ADV

without
ADP

a
DET

porpoise
NOUN

His
PRON

bark
NOUN

was
AUX

much
ADV

worse
ADJ

than
ADP

his
PRON

bite
NOUN

Turn
VERB

on
ADP

the
DET

lights
NOUN

to
ADP

the
DET

main
ADJ

bedroom
NOUN

Light
NOUN

‘
PUNCT

em
NOUN

all
ADV

up
ADP

Make
VERB

it
PRON

dark
ADJ

downstairs
NOUN

The observant reader may have noticed some slight differences in the tags used in this library. They are essentially the same but using slightly different names. This is something to be aware of when swapping between POS libraries or models. Make sure you look up the documentation for the library/model you are using to understand the available tag types.

Entity Detection

Named entity recognition (NER), seeks to identity and classify named entities in text. Categories of interest might be persons, organizations, locations dates, etc. It is another technology used in many fields of NLP.

We’ll start with our sentences to analyse:

String[] sentences = [
    "A commit by Daniel Sun on December 6, 2020 improved Groovy 4's language integrated query.",
    "A commit by Daniel on Sun., December 6, 2020 improved Groovy 4's language integrated query.",
    'The Groovy in Action book by Dierk Koenig et. al. is a bargain at $50, or indeed any price.',
    'The conference wrapped up yesterday at 5:30 p.m. in Copenhagen, Denmark.',
    'I saw Ms. May Smith waving to June Jones.',
    'The parcel was passed from May to June.',
    'The Mona Lisa by Leonardo da Vinci has been on display in the Louvre, Paris since 1797.'
]

For this example, we’ll use some well-known models, we’ll focus on the person, money, date, time, and location models:

def base = 'http://opennlp.sourceforge.net/models-1.5'
def modelNames = ['person', 'money', 'date', 'time', 'location']
def finders = modelNames.collect { model ->
    new NameFinderME(DownloadUtil.downloadModel(new URL("$base/en-ner-${model}.bin"), TokenNameFinderModel))
}

We’ll now tokenize our sentences:

def tokenizer = SimpleTokenizer.INSTANCE
sentences.each { sentence ->
    String[] tokens = tokenizer.tokenize(sentence)
    Span[] tokenSpans = tokenizer.tokenizePos(sentence)
    def entityText = [:]
    def entityPos = [:]
    finders.indices.each {fi ->
        // could be made smarter by looking at probabilities and overlapping spans
        Span[] spans = finders[fi].find(tokens)
        spans.each{span ->
            def se = span.start..<span.end
            def pos = (tokenSpans[se.from].start)..<(tokenSpans[se.to].end)
            entityPos[span.start] = pos
            entityText[span.start] = "$span.type(${sentence[pos]})"
        
    }
    entityPos.keySet().sort().reverseEach {
        def pos = entityPos[it]
        def (from, to) = [pos.from, pos.to + 1]
        sentence = sentence[0..<from] + entityText[it] + sentence[to..-1]
    }
    println sentence
}

And when visualized, shows this:

A
commit
by

Daniel Sun
person

December 6, 2020
date

improved Groovy 4’s
language
integrated query.

A
commit
by

Daniel
person

on
Sun.,

December 6, 2020
date

improved Groovy 4’s
language
integrated query.

The Groovy in
Action book by

Dierk Koenig
person

et. al. is a
bargain at

$50
money

, or indeed
any price.

The
conference
wrapped up

yesterday
date

5:30 p.m.
time

Copenhagen
location

Denmark
location

I saw Ms.

May Smith
person

waving to

June Jones
person

The parcel was passed from

May to June
date

The Mona
Lisa by

Leonardo da Vinci
person

has been on
display in
the Louvre,

Paris
location

since 1797
date

We can see here that most examples have been categorized as we might expect. We’d have to improve our model for it to do a better job on the “May to June” example.

Scaling Entity Detection

For large problems, we can also run our named entity detection algorithms on platforms like Spark NLP which adds NLP functionality to Apache Spark. We’ll use glove_100d embeddings and the onto_100 NER model.

var assembler = new DocumentAssembler(inputCol: 'text', outputCol: 'document', cleanupMode: 'disabled')
var tokenizer = new Tokenizer(inputCols: ['document'] as String[], outputCol: 'token')
var embeddings = WordEmbeddingsModel.pretrained('glove_100d').tap {
    inputCols = ['document', 'token'] as String[]
    outputCol = 'embeddings'
}
var model = NerDLModel.pretrained('onto_100', 'en').tap {
    inputCols = ['document', 'token', 'embeddings'] as String[]
    outputCol ='ner'
}
var converter = new NerConverter(inputCols: ['document', 'token', 'ner'] as String[], outputCol: 'ner_chunk')
var pipeline = new Pipeline(stages: [assembler, tokenizer, embeddings, model, converter] as PipelineStage[])
var spark = SparkNLP.start(false, false, '16G', '', '', '')
var text = [
    "The Mona Lisa is a 16th century oil painting created by Leonardo. It's held at the Louvre in Paris."
]
var data = spark.createDataset(text, Encoders.STRING()).toDF('text')
var pipelineModel = pipeline.fit(data)
var transformed = pipelineModel.transform(data)
transformed.show()
use(SparkCategory) {
    transformed.collectAsList().each { row ->
        def res =  row.text
        def chunks = row.ner_chunk.reverseIterator()
        while (chunks.hasNext()) {
            def chunk = chunks.next()
            int begin = chunk.begin
            int end = chunk.end
            def entity = chunk.metadata.get('entity').get()
            res = res[0..<begin] + "$entity($chunk.result)" + res[end<..-1]
        }
        println res
    }
}

There is no need for us to go into all of the details here. In summary, the code sets up a pipeline that transforms our input sentences, via a series of steps, into chunks, where each chunk corresponds to a detected entity. Each chunk has a start and ending position, and an associated tag type.

This may not seem like it is much different to our earlier examples, but if we had large volumes of data and we were running in a large cluster, the work could be spread across worker nodes within the cluster.

Here we have used a utility SparkCategory class which makes accessing the information in Spark Row instances a little nicer in terms of Groovy shorthand syntax. We can use row.text instead of row.get(row.fieldIndex('text')). Here is the code for this utility class:

class SparkCategory {
    static get(Row r, String field) { r.get(r.fieldIndex(field)) }
}

If doing more than this simple example, the use of SparkCategory could be made implicit through various standard Groovy techniques.

When we run our script, we see the following output:

22/08/07 12:31:39 INFO SparkContext: Running Spark version 3.3.0
...
glove_100d download started this may take some time.
Approximate size to download 145.3 MB
...
onto_100 download started this may take some time.
Approximate size to download 13.5 MB
...
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|                text|            document|               token|          embeddings|                 ner|           ner_chunk|
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|The Mona Lisa is ...|[{document, 0, 98...|[{token, 0, 2, Th...|[{word_embeddings...|[{named_entity, 0...|[{chunk, 0, 12, T...|
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
PERSON(The Mona Lisa) is a DATE(16th century) oil painting created by PERSON(Leonardo). It's held at the FAC(Louvre) in GPE(Paris).

The result has the following visualization:

The Mona Lisa
PERSON

is a

16th century
DATE

oil painting created by

Leonardo
PERSON

. It’s held at the

Louvre
FAC

Paris
GPE

Here FAC is facility (buildings, airports, highways, bridges, etc.) and GPE is Geo-Political Entity (countries, cities, states, etc.).

Sentiment Analysis

corenlp logo Sentiment analysis is a NLP technique used to determine whether data is positive, negative, or neutral. Stanford CoreNLP has default models it uses for this purpose:

def doc = new Document('''
StanfordNLP is fantastic!
Groovy is great fun!
Math can be hard!
''')
for (sent in doc.sentences()) {
    println "${sent.toString().padRight(40)} ${sent.sentiment()}"
}

Which has the following output:

[main] INFO edu.stanford.nlp.parser.common.ParserGrammar - Loading parser from serialized file edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz ... done [0.6 sec].
[main] INFO edu.stanford.nlp.sentiment.SentimentModel - Loading sentiment model edu/stanford/nlp/models/sentiment/sentiment.ser.gz ... done [0.1 sec].
StanfordNLP is fantastic!                POSITIVE
Groovy is great fun!                     VERY_POSITIVE
Math can be hard!                        NEUTRAL

In addition to using pre-trained models, we can also train our own. Let’s start with two datasets:

def datasets = [
    positive: getClass().classLoader.getResource("rt-polarity.pos").toURI(),
    negative: getClass().classLoader.getResource("rt-polarity.neg").toURI()
]

Initially, we’ll use Datumbox which, as we saw earlier, requires training parameters for our algorithm:

def trainingParams = new TextClassifier.TrainingParameters(
    numericalScalerTrainingParameters: null,
    featureSelectorTrainingParametersList: [new ChisquareSelect.TrainingParameters()],
    textExtractorParameters: new NgramsExtractor.Parameters(),
    modelerTrainingParameters: new MultinomialNaiveBayes.TrainingParameters()
)

We now create our algorithm, train it with or training dataset, and for illustrative purposes validate against the training dataset:

def config = Configuration.configuration
TextClassifier classifier = MLBuilder.create(trainingParams, config)
classifier.fit(datasets)
def metrics = classifier.validate(datasets)
println "Classifier Accuracy (using training data): $metrics.accuracy"

The output is shown here:

[main] INFO com.datumbox.framework.core.common.dataobjects.Dataframe$Builder - Dataset Parsing positive class
[main] INFO com.datumbox.framework.core.common.dataobjects.Dataframe$Builder - Dataset Parsing negative class
...
Classifier Accuracy (using training data): 0.8275959103273615

Now we can test our model against several sentences:

['Datumbox is divine!', 'Groovy is great fun!', 'Math can be hard!'].each {
    def r = classifier.predict(it)
    def predicted = r.YPredicted
    def probability = sprintf '%4.2f', r.YPredictedProbabilities.get(predicted)
    println "Classifing: '$it',  Predicted: $predicted,  Probability: $probability"
}

Which has this output:

...
[main] INFO com.datumbox.framework.applications.nlp.TextClassifier - predict()
...
Classifing: 'Datumbox is divine!',  Predicted: positive,  Probability: 0.83
Classifing: 'Groovy is great fun!',  Predicted: positive,  Probability: 0.80
Classifing: 'Math can be hard!',  Predicted: negative,  Probability: 0.95

We can do the same thing but with OpenNLP. First, we collect our input data. OpenNLP is expecting it in a single dataset with tagged examples:

def trainingCollection = datasets.collect { k, v ->
    new File(v).readLines().collect{"$k $it".toString() }
}.sum()

Now, we’ll train two models. One uses naïve bayes, the other maxent. We train up both variants.

def variants = [
        Maxent    : new TrainingParameters(),
        NaiveBayes: new TrainingParameters((CUTOFF_PARAM): '0', (ALGORITHM_PARAM): NAIVE_BAYES_VALUE)
]
def models = [:]
variants.each{ key, trainingParams ->
    def trainingStream = new CollectionObjectStream(trainingCollection)
    def sampleStream = new DocumentSampleStream(trainingStream)
    println "\nTraining using $key"
    models[key] = DocumentCategorizerME.train('en', sampleStream, trainingParams, new DoccatFactory())
}

Now we run sentiment predictions on our sample sentences using both variants:

def w = sentences*.size().max()
variants.each { key, params ->
    def categorizer = new DocumentCategorizerME(models[key])
    println "\nAnalyzing using $key"
    sentences.each {
        def result = categorizer.categorize(it.split('[ !]'))
        def category = categorizer.getBestCategory(result)
        def prob = sprintf '%4.2f', result[categorizer.getIndex(category)]
        println "${it.padRight(w)} $category ($prob)}"
    }
}

When we run this we get:

Training using Maxent ...done.
...
Training using NaiveBayes ...done.
...
Analyzing using Maxent
OpenNLP is fantastic! positive (0.64)}
Groovy is great fun!  positive (0.74)}
Math can be hard!     negative (0.61)}
Analyzing using NaiveBayes
OpenNLP is fantastic! positive (0.72)}
Groovy is great fun!  positive (0.81)}
Math can be hard!     negative (0.72)}

The models here appear to have lower probability levels compared to the model we trained for Datumbox. We could try tweaking the training parameters further if this was a problem. We’d probably also need a bigger testing set to convince ourselves of the relative merits of each model. Some models can be over-trained on small datasets and perform very well with data similar to their training datasets but perform much worse for other data.

Universal Sentence Encoding

This example is inspired from the UniversalSentenceEncoder example in the DJL examples module. It looks at using the universal sentence encoder model from TensorFlow Hub via the DeepJavaLibrary (DJL) api.

First we define a translator. The Translator interface allow us to specify pre and post processing functionality.

class MyTranslator implements NoBatchifyTranslator<String[], double[][]> {
    @Override
    NDList processInput(TranslatorContext ctx, String[] raw) {
        var factory = ctx.NDManager
        var inputs = new NDList(raw.collect(factory::create))
        new NDList(NDArrays.stack(inputs))
    }

    @Override
    double[][] processOutput(TranslatorContext ctx, NDList list) {
        long numOutputs = list.singletonOrThrow().shape.get(0)
        NDList result = []
        for (i in 0..<numOutputs) {
            result << list.singletonOrThrow().get(i)
        }
        result*.toFloatArray() as double[][]
    }
}

Here, we manually pack our input sentences into the required n-dimensional data types, and extract our output calculations into a 2D double array.

Next, we create our predict method by first defining the criteria for our prediction algorithm. We are going to use our translator, use the TensorFlow engine, use a predefined sentence encoder model from the TensorFlow Hub, and indicate that we are creating a text embedding application:

def predict(String[] inputs) {
    String modelUrl = "https://storage.googleapis.com/tfhub-modules/google/universal-sentence-encoder/4.tar.gz"

    Criteria<String[], double[][]> criteria =
        Criteria.builder()
            .optApplication(Application.NLP.TEXT_EMBEDDING)
            .setTypes(String[], double[][])
            .optModelUrls(modelUrl)
            .optTranslator(new MyTranslator())
            .optEngine("TensorFlow")
            .optProgress(new ProgressBar())
            .build()
    try (var model = criteria.loadModel()
         var predictor = model.newPredictor()) {
        predictor.predict(inputs)
    }
}

Now, let’s define our input strings:

String[] inputs = [
    "Cycling is low impact and great for cardio",
    "Swimming is low impact and good for fitness",
    "Palates is good for fitness and flexibility",
    "Weights are good for strength and fitness",
    "Orchids can be tricky to grow",
    "Sunflowers are fun to grow",
    "Radishes are easy to grow",
    "The taste of radishes grows on you after a while",
]
var k = inputs.size()

Then, we’ll use our predictor method to calculate the embeddings for each sentence. We’ll print out the embeddings and also calculate the dot product of the embeddings. The dot product (the same as the inner product for this case) reveals how related the sentences are.

var embeddings = predict(inputs)

var z = new double[k][k]
for (i in 0..<k) {
    println "Embedding for: ${inputs[i]}\n${embeddings[i]}"
    for (j in 0..<k) {
        z[i][j] = dot(embeddings[i], embeddings[j])
    }
}

Finally, we’ll use the Heatmap class from Smile to present a nice display highlighting what the data reveals:

new Heatmap(inputs, inputs, z, Palette.heat(20).reverse()).canvas().with {
    title = 'Semantic textual similarity'
    setAxisLabels('', '')
    window()
}

The output shows us the embeddings:

Loading:     100% |========================================|
2022-08-07 17:10:43.212697: ... This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2
...
2022-08-07 17:10:52.589396: ... SavedModel load for tags { serve }; Status: success: OK...
...
Embedding for: Cycling is low impact and great for cardio
[-0.02865048497915268, 0.02069241739809513, 0.010843578726053238, -0.04450441896915436, ...]
...
Embedding for: The taste of radishes grows on you after a while
[0.015841705724596977, -0.03129228577017784, 0.01183396577835083, 0.022753292694687843, ...]

Embeddings are an indication of similarity. Two sentences with similar meaning typically have similar embeddings.

Our heatmap is shown below:

This graphic shows that our first four sentences are somewhat related, as are the last four sentences, but that there is minimal relationship between those two groups.

Interested readers may also like to see this earlier JVM Advent blog post about OpenNLP.

Object detection

Looks at detecting objects within images using DJL and Apache MXNet.

Our final problem looks at using Apache Groovy with the Deep Java Library (DJL) and backed by the Apache MXNet engine to detect objects within an image.

About Deep Java Library (DJL) & Apache MXNet

DJL logo “DJL is engine agnostic, so it’s capable of supporting different backends including Apache MXNet, PyTorch, TensorFlow and ONNX Runtime. We’ll use the default engine which for our application (at the time of writing) is Apache MXNet.

mxnet logo Apache MXNet provides the underlying engine. It has support for imperative and symbolic execution, distributed training of your models using multi-gpu or multi-host hardware, and multiple language bindings. Groovy is fully compatible with the Java binding.

Using DJL with Groovy

Groovy uses the Java binding. Consider looking at the DJL beginner tutorials for Java – they will work almost unchanged for Groovy.

For our example, the first thing we need to do is download the image we want to run the object detection model on:

Path tempDir = Files.createTempDirectory("resnetssd")
def imageName = 'dog-ssd.jpg'
Path localImage = tempDir.resolve(imageName)
def url = new URL("https://s3.amazonaws.com/model-server/inputs/$imageName")
DownloadUtils.download(url, localImage, new ProgressBar())
Image img = ImageFactory.instance.fromFile(localImage)

It happens to be a well-known already available image. We’ll store a local copy of the image in a temporary directory and we’ll use a utility class that comes with DJL to provide a nice progress bar while the image is downloading. DJL provides its own image classes, so we’ll create an instance using the appropriate class from the downloaded image.

Next we want to configure our neural network layers:

def criteria = Criteria.builder()
        .optApplication(Application.CV.OBJECT_DETECTION)
        .setTypes(Image, DetectedObjects)
        .optFilter("backbone", "resnet50")
        .optEngine(Engine.defaultEngineName)
        .optProgress(new ProgressBar())
        .build()

DJL supports numerous model applications including image classification, word recognition, sentiment analysis, linear regression, and others. We’ll select object detection. This kind of application looks for the bounding box of known objects within an image. The types configuration option identifies that our input will be an image and the output will be detected objects. The filter option indicates that we will be using ResNet-50 (a 50-layers deep convolutional neural network often used as a backbone for many computer vision tasks). We set the engine to be the default engine which happens to be Apache MXNet. We also configure an optional progress bar to provide feedback of progress while our model is running.

Now that we have our configuration sorted, we’ll use it to load a model and then use the model to make object predictions:

def detection = criteria.loadModel().withCloseable { model ->
    model.newPredictor().predict(img)
}
detection.items().each { println it }
img.drawBoundingBoxes(detection)

For good measure, we’ll draw the bounding boxes into our image.

Next, we save our image into a file and display it using Groovy’s SwingBuilder.

Path imageSaved = tempDir.resolve('detected.png')
imageSaved.withOutputStream { os -> img.save(os, 'png') }
def saved = ImageIO.read(imageSaved.toFile())
new SwingBuilder().edt {
    frame(title: "$detection.numberOfObjects detected objects",
            size: [saved.width, saved.height],
            defaultCloseOperation: DISPOSE_ON_CLOSE,
            show: true) { label(icon: imageIcon(image: saved)) }
}

Building and running our application

Our code is stored on a source file called ObjectDetect.groovy.

The example uses Gradle for the build technology:

apply plugin: 'groovy'
apply plugin: 'application'

repositories {
    mavenCentral()
}

application {
    mainClass = 'ObjectDetect'
}

dependencies {
    implementation "ai.djl:api:0.18.0"
    implementation "org.apache.groovy:groovy:4.0.4"
    implementation "org.apache.groovy:groovy-swing:4.0.4"
    runtimeOnly "ai.djl:model-zoo:0.18.0"
    runtimeOnly "ai.djl.mxnet:mxnet-engine:0.18.0"
    runtimeOnly "ai.djl.mxnet:mxnet-model-zoo:0.18.0"
    runtimeOnly "ai.djl.mxnet:mxnet-native-auto:1.8.0"
    runtimeOnly "org.apache.groovy:groovy-nio:4.0.4"
    runtimeOnly "org.slf4j:slf4j-jdk14:1.7.36"
}

We run the application with the gradle run task:

paulk@pop-os:/extra/projects/groovy-data-science$ ./gradlew DLMXNet:run
> Task :DeepLearningMxnet:run
Downloading: 100% |████████████████████████████████████████| dog-ssd.jpg
Loading:     100% |████████████████████████████████████████|
...
class: "car", probability: 0.99991, bounds: [x=0.611, y=0.137, width=0.293, height=0.160]
class: "bicycle", probability: 0.95385, bounds: [x=0.162, y=0.207, width=0.594, height=0.588]
class: "dog", probability: 0.93752, bounds: [x=0.168, y=0.350, width=0.274, height=0.593]

Our displayed image looks like this:

The full source code for this example can be found in the following repo:
https://github.com/paulk-asert/groovy-data-science/subprojects/DeepLearningMxnet

Conclusion

We have seen various data science tasks solved with Groovy and numerous JVM libraries and platforms. Hopefully, you’ve also seen some of the benefits of using Groovy for your data science implementations, including its:

friendly Java-like syntax and flexibility of dynamic or static typing
metaprogramming capabilities that often simplify code
close alignment with Java that reduces learning curves and allows Groovy to piggy-back on the great work of the JVM developers
ability to use the many options for scaling that exist on the JVM

You should be confident that Groovy allows you to create simple solutions for simple problems but also scale using a variety of JVM technologies to solve even the biggest data science problems.

Author: Paul King

Dr Paul King is a JavaOne Rockstar who has been contributing to open source projects for over 30 years. He is an active committer on numerous projects including Groovy, Grails, GPars and Gradle. Paul speaks at international conferences, publishes in software magazines and journals, and is a co-author of Manning’s best-seller: Groovy in Action, 2nd Edition. Paul is a Distinguished Engineer and Head of the Groovy Practice at Object Computing, and is also VP Apache Groovy and Chair of the Apache Groovy PMC.

Twitter LinkedIn Github

1 Comment

sanakhan7 April 23, 2024

The audience has been exposed to a range of data science tasks successfully tackled using Groovy alongside various JVM libraries and platforms. They are encouraged to recognize the advantages of incorporating Groovy into their data science endeavors, citing its user-friendly Java-like syntax, dynamic or static typing flexibility, and metaprogramming capabilities which streamline code complexity. The alignment with Java is highlighted as a valuable asset, facilitating an easier learning curve and leveraging the existing infrastructure of the JVM. Furthermore, the versatility of Groovy is emphasized, enabling the creation of simple solutions for straightforward tasks while also offering scalability options through diverse JVM technologies to tackle complex data science challenges with confidence.

Thank you.

Groovy and Data Science

Introduction

About Groovy

Data Science Libraries

Candles

Linear Regression

Preliminary steps

Ordinary least squares

Other algorithms

Scaling options

Scaling with Apache Spark

Scaling with Apache Beam

Classification

Deep Learning

About Deep Learning

Encog

Eclipse DeepLearning4j

Deep Netts

Deep Netts with GraalVM

Clustering

Scaling Options

Scaling with Apache Wayang

Using Wayang with the Java streams-backed platform

Using Wayang with Apache Spark

Scaling with Apache Ignite

Natural Language Processing

Language Detection

Parts of Speech

Entity Detection

Scaling Entity Detection

Sentiment Analysis

Universal Sentence Encoding

Object detection

About Deep Java Library (DJL) & Apache MXNet

Using DJL with Groovy

Conclusion

Author: Paul King

Share this:

Like this:

Related

1 Comment

sanakhan7 April 23, 2024

Leave a ReplyCancel reply

Using Postgres as a Message Queue

WebAssembly for the Java Geek

JVM Hello World