Sunday, December 9, 2012

Changes to String.substring in Java 7

It is common knowledge that Java optimizes the substring operation for the case where you generate a lot of substrings of the same source string. It does this by using the (value, offset, count) way of storing the information. See an example below:

In the above diagram you see the strings "Hello" and "World!" derived from "Hello World!" and the way they are represented in the heap: there is one character array containing "Hello World!" and two references to it. This method of storage is advantageous in some cases, for example for a compiler which tokenizes source files. In other instances it may lead you to an OutOfMemorError (if you are routinely reading long strings and only keeping a small part of it - but the above mechanism prevents the GC from collecting the original String buffer). Some even call it a bug. I wouldn't go so far, but it's certainly a leaky abstraction because you were forced to do the following to ensure that a copy was made: new String(str.substring(5, 6)).

This all changed in May of 2012 or Java 7u6. The pendulum is swung back and now full copies are made by default. What does this mean for you?

  • For most probably it is just a nice piece of Java trivia
  • If you are writing parsers and such, you can not rely any more on the implicit caching provided by String. You will need to implement a similar mechanism based on buffering and a custom implementation of CharSequence
  • If you were doing new String(str.substring) to force a copy of the character buffer, you can stop as soon as you update to the latest Java 7 (and you need to do that quite soon since Java 6 is being EOLd as we speak).

Thankfully the development of Java is an open process and such information is at the fingertips of everyone!

A couple of more references (since we don't say pointers in Java :-)) related to Strings:

  • If you are storing the same string over and over again (maybe you're parsing messages from a socket for example), you should read up on alternatives to String.intern() (and also consider reading chapter 50 from the second edition of Effective Java: Avoid strings where other types are more appropriate)
  • Look into (and do benchmarks before using them!) options like UseCompressedStrings (which seems to have been removed), UseStringCache and StringCache

Hope I didn't strung you along too much and you found this useful! Until next time
- Attila Balazs

Meta: this post is part of the Java Advent Calendar and is licensed under the Creative Commons 3.0 Attribution license. If you like it, please spread the word by sharing, tweeting, FB, G+ and so on! Want to write for the blog? We are looking for contributors to fill all 24 slot and would love to have your contribution! Contact Attila Balazs to contribute!

8 comments:

  1. Attilla, you have missed one of the key advantages of the new system. In the previous system is was impossible for the compiler to perform escape analysis on strings which had their underlying character arrays shared. To be more specific, any such analysis would yield a result that the string escaped. The new system allows ea to completely remove strings from the heap. This might not happen much, or even in Java 7 but in Java 8 it should be more possible and well start to be a huge performance gain.

    ReplyDelete
    Replies
    1. Yes, escape analysis is something which should give major performance improvements. However I'm not sure I understand how EA interacts with the new substring behavior. If I have a method:

      void method1(String str) {
      String s1 = str.substring(1, 10);
      ...
      }

      I think that EA should figure out that s1 is not used outside of method1, even if it references the character array of str (that is, references to external variables shouldn't prevent EA from functioning).

      On the other hand if we have

      void method2(String str) {
      return str.substring(1, 10);
      }

      EA will rightfully conclude that the substring can't be freed on return / allocated on the stack because it "escaped".

      Delete
    2. This comment has been removed by the author.

      Delete
  2. Does this means that at least some Java String objects would live no more one the heap but in the stack for optimization?

    ReplyDelete
    Replies
    1. Hello,

      Yes, that is the general idea with escape analysis. My current understanding is that this (EA) is a general optimization and it's not specific to the String type (it should handle any object types).

      Delete
  3. Attila,

    We make heavy usage of String.subString() as we have a parser module. The performance due to this change seems to have degraded by a large number. I am going to try a charAt() implementation to see if it helps improve the performance, but you mentioned about buffering and CharSequence custom implementation. Can you elaborate more on this please?

    Thanks,
    Vishal

    ReplyDelete
  4. I know that this is late (almost one year late :-)), but I still post it in the hopes that it will be useful for somebody: I've done a writeup about the changes, including benchmarks with different approaches to parsing strings - http://jaxenter.com/the-state-of-string-in-java-49450.html

    The TL;DR version: .substring() with .intern() is fine for 99.99% of the cases.

    ReplyDelete
    Replies
    1. Very interesting work. It explains in detail the changes made to the String implementation. Thanks!

      Delete