JVM Advent

The JVM Programming Advent Calendar

Finding your presents using CodeQL

When I was a kid, I could not contain my excitement about Christmas. Meeting my extended family, having a great dinner with lots of laughter and joy. There was a tradition for the kids to search for Christmas presents in the living room. Be it behind the couch, between the Christmas tree branches, or way up on the shelf. Given how much I enjoyed the journey of finding the presents each year, my parents tried to make it a bit more interesting. Sometimes using wrapping paper that matched the wallpaper, sometimes by hiding it inside the chimney. This game reached a point that the stashes were so good that I could not find all the presents anymore and at one point, even my parents forgot where they put all the gifts. Long story short, I found the present eventually…3 months later, tucked away in the bookshelf, wrapped in the book cover. Fast forward to today; let’s get some help using technology to find our presents this year.

The Living Room

This year, let’s turn the complexity up a notch and use a whole codebase as a searching ground for our presents. The good thing is that we can hide a lot more gifts in our codebase due to its inherent complexity. We have a lot of actors (our team) that can hide presents – either accidentally or even deliberately 🤫. For the sake of simplicity, let’s assume we’re hosting our code on GitHub, as this offers us an easy way to set up some of the other things we need in our journey to find all the presents.

The Presents

One of the first tasks we have is to define what a present is. I happened more than once when searching for gifts that I proudly claimed to have found something, whereas it was just some random box that happened to sit on a shelf (= false positive). For the sake of simplicity, let’s say that all objects implementing the ChristmasPresent interface are of relevance (true positive).

interface ChristmasPresent {}

The Search

Nowadays, we have plenty of tools at our hands, from our favorite IDE to a wide variety of static and dynamic analysis tools. In today’s episode of “Benny is looking for presents,” let’s use a technology called CodeQL. CodeQL is the code analysis engine developed by GitHub to automate security checks. You can analyze your code using CodeQL and display the results as code scanning alerts.
Let’s start with an easy query and refine it later on. The excellent part about CodeQL is that it treats your code as if it were data. That means your data is represented as a database, and you can easily query that data using CodeQL.
First, let’s find all places that instantiate a new object (e.g. new Foo()) and refine that going forward.

import java

from ClassInstanceExpr newExpressions
select newExpressions

The first thing to note is that we import the java libraries. CodeQL itself is language-agnostic and supports an incredible variety of languages, including Java, C/C++, C#, Python, JavaScript/TypeScript, Go, and Ruby. 

The above query finds and selects all ClassInstanceExpr (= an expression in the form of new Foo()). An excellent resource to find the right CodeQL type for a given Java syntax is the page on Abstract syntax tree classes for working with Java programs.

You can develop and run such queries using any editor using the CodeQL CLI or the Visual Studio Code Extension, which provides tight integration between CodeQL and your editor. You can also use the Query console if you just want to give it a quick try on some notable projects. Given the above query, let’s run in on a database and see the results:

On the right, we have our query, whereas, on the left, we got our results to view that shows all places that instantiate an object. You can also see that it did that in 0.025 seconds to find all 715 expressions matching our query.

Finding presents

Now that’s a lot of results to review, so let’s go back to our query and see if we can refine it.
First up, we want to only find instantiations that create an object of a class implementing our ChristmasPresent. For that, let’s first define a type to capture that information:

class PresentType extends RefType {
    PresentType() { this.hasQualifiedName("p", "ChristmasPresent") }
}

This CodeQL class matches any type (extends RefType) that has the qualified name ChristmasPresent in package p). If we’d just use this as our query, we’d find one result – our interface defined above. Let’s use this type now to refine our find-our-presents-query:

class PresentType extends RefType {
  PresentType() { this.hasQualifiedName("p", "ChristmasPresent") }
}

from ClassInstanceExpr newExpression, PresentType presentType
where newExpression.getConstructedType().hasSupertype(presentType)
select newExpression, newExpression.toString()

This most essential information is that we restrict our query to only looking for expressions that construct a type with our interface as a supertype. We can’t just define our Christmas present type out of thin air. We add it to our from clause, so we find all new expressions and ChristmasPresent types in the source code, while our where clause combines them into a helpful filter.

If you wonder about the results above, our application has two classes that implement the ChristmasPresent interface. A JavaBook (and yes, you can fight me overusing inheritance in this case) as well as a GiftWrapper. Their usage can be summarized as follows:

ChristmasPresent wrappedPresent = new SantasWrapperFactory().wrap(new JavaBook("Java Concurrency in Practice"));

Whereas GiftWrapper is a simple decorator around any ChristmasPresent:

public class GiftWrapper implements ChristmasPresent {

    private ChristmasPresent present;

    public GiftWrapper(ChristmasPresent present) {
        this.present = present;
    }

    public ChristmasPresent unwrap() {
        return present;
    }

    @Override
    public String toString() {
        return "📦";
    }
}

Using CodeQL, we can now easily find our presents in our codebase. But that’s merely following constructor calls to any implementation of our interface. Given we wrap our gifts, it’s actually not too bad if somebody finds them. They’re covered, so nobody knows what’s inside (and we deliberately don’t expose that fact in our toString method above).

Given there are a few days left until Christmas, let’s ensure that we never expose the contents of the Christmas gift during the development of our application. To do that, we need to consider 3 things. The first part is what kind of information we don’t want to leak. For the sake of simplicity (and as we only implemented JavaBook as presents for now), let’s go with the title of the book as important information.

new JavaBook("Java Concurrency in Practice")

The second part is what we consider as “leaking information.” For our use case, we want to ensure that at no point the title of the book leaves our application, be it as a response to a request or any other form of showing this information to a user of the system. And given the complexity of today’s software, we want to ensure that this information is not leaked and, if it does, how the data traversed the system to be exposed.

😱

If you ever wonder how data breaches or other fatalities happen, a common problem is the complexity of systems combined with many interconnected things. Excellent examples can be found in the book “Logic of Failure” as well as in this fantastic talk “How to crash an airplane” by Nickolas Means

Luckily, CodeQL offers all the features we need to implement such a query. Firstly, we need to define what information we’re looking for.

class BookCreation extends ClassInstanceExpr {
  BookCreation() { this.getConstructedType().hasQualifiedName("p", "JavaBook") }

  StringLiteral title() { result = this.getArgument(0).(StringLiteral) }
}

Using the predicate above, we can find all call sites that create a book and access the title (for simplicity, we now assume we get a regular string literal like "Booktitle"). We want to track the data from this “source” and see where it flows. We want to ensure that we never expose our book title; we can reuse existing sinks from the CodeQL standard library. In this case, the ExternalAPIDataNode sink covers things like writing the data out to respond to an HTTP request.

class BookTitleExposed extends TaintTracking::Configuration {
  BookTitleExposed() { this = "BookTitleExposed" }

  override predicate isSource(DataFlow::Node source) {
    exists(BookCreation e | e.title() = source.asExpr())
  }

  override predicate isSink(DataFlow::Node sink) { sink instanceof ExternalAPIDataNode }
}

Given the above Configuration deciding which data we want to track to which destination, let’s use this in our query. The query uses the above TaintTracking::Configuration to determine all possible data flow in the program that may flow the book title (or a substring) into an external API.

from BookTitleExposed config, DataFlow::PathNode source, DataFlow::PathNode sink
where config.hasFlowPath(source, sink)
select sink, 
       source, 
       sink,
       "Call to "
          + sink.getNode().(ExternalAPIDataNode).getMethodDescription()
          + " exposes book title "
          + source.getNode().asExpr().(StringLiteral).getLiteral()

Running this query on our application actually exposes an information leak.

 Let’s look at the code in question

@Override
protected void doGet(HttpServletRequest req, HttpServletResponse resp) throws Exception {
  JavaBook book = new JavaBook("Java Concurrency in Practice")
  GiftWrapper wrappedPresent = new SantasWrapperFactory().wrap(book);
  resp.getWriter().println(wrappedPresent.unwrap().toString());
}

Looking closely, we see that we properly wrap the present. But we accidentally unwrap it in the following line before writing it to an HTTP response writer. That’s certainly not what we wanted. Not only did we catch our mistake, but CodeQL also helped us to track the flow of data through the various steps, from the JavaBook constructor, through the GiftWrapper ending in the HTTP response writer.

📦 it up

We barely scratched the surface of what CodeQL can do. But given so little knowledge, we could already implement a pretty complex query tracking data flow through our program. I’ve uploaded the sample app and the full query to this GitHub repository. Depending on your use cases, you may already have a few queries in mind that are specific to the codebase that you want to encode. Or you may just want to lean back and use the queries the team at GitHub and the community have written to find security vulnerabilities in your codebase. No matter which one of those, CodeQL integrates tightly into GitHub Code Scanning infrastructure to enable to best of both worlds, like feedback on Pull Requests. Be sure to enable code scanning on your GitHub repository today and help us secure the world’s software, together.

If you liked the the post, I’d appreciate to hear your thoughts on Twitter.

Author: Benjamin Muskalla

Benny (@bmuskalla) has been following his passion for building tools for improving developer productivity. He works on securing the open-source ecosystem at GitHub, specifically on GitHub’s CodeQL security analysis technology. In a previous life, he has been an active committer of the world-class Eclipse IDE, built a JUnit Christmas Decoration Extension, and was a core committer on the Gradle Build Tool. TDD and API design is dear to his heart, as well as working on open-source software.

Next Post

Previous Post

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

© 2024 JVM Advent | Powered by steinhauer.software Logosteinhauer.software

Theme by Anders Norén