Introduction

When the first technical solutions based on Artificial Intelligence began to be created, Python was the language and runtime platform of choice. It was not a total surprise as Python was the choice of the great majority of data scientists to analyse data, perform experiments and create AI models, so it was kind of natural that AI solutions were also based on Python and its rich ecosystem for data-intensive applications.

However, with the rise of Generative AI, organisations worldwide reconsider that approach given that the vast majority of GenAI-based solutions are just leveraging existing Large Language Models consumed “as a service” via web APIs. For doing that, Python does not have the same lead over the others. We must balance other key aspects of enterprise-grade solutions, such as resilience, scalability, observability, as well as the leverage of existing skills in the organisation. That means that other key languages and platforms can play a leading role in delivering and running GenAI-based solutions. Platforms such as Node.js, Go, and, of course, Java.

In Java we already have multiple valid approaches to be considered.In this article I will cover Langchain4j, one of the most popular choices in the Java ecosystem for building and running AI solutions at scale.

Why Langchain4j? Key outsdanding aspects are:

Framework-agnostic: As a library it imposes no constraints about how you design and build your solutions, so it can be easily integrated with any existing solution, either Spring-based, Jakarta-based, Quarkus-based, Micronaut-based, or with no framework at all.
Simple yet powerful API: Based on well-known patterns such as the Builder pattern, and with simple API constructs so its learning curve is simple and rewarding:
It works with both cloud-based “as a service” models such as OpenAI or Google Vertex, and with local/owned models via Ollama.

NOTE: The following examples are based on Langchain4j 0.36.2.

The first Langchain4j program

To demonstrate these concepts, let’s look at a “hello world” Langchain4j program.

The main interface that we need to learn about is ChatLanguageModel (from package dev.langchain4j.model.chat). This interface has the simple API that we need to send messages to an LLM and get its response. To instantiate specific models to interact with them we need the specific implementation depending on who provides them:

For OpenAI, we leverage OpenAiChatModel from package dev.langchain4j.model.openai.OpenAiChatModel.
For Vertex, we leverage VertexAiGeminiChatModel from package dev.langchain4j.model.vertexai.
For Ollama, we leverage OllamaChatModel from package dev.langchain4j.model.ollama.OllamaChatModel.

As well as others. Every implementation of ChatLanguageModel has its own builder pattern to be able to add any specific configuration setting that is needed. Let’s see three brief examples of how this looks once we put it together:

OpenAI Hello World

import dev.langchain4j.model.chat.ChatLanguageModel;
import dev.langchain4j.model.openai.OpenAiChatModel;

class OpenAIHelloWorld {

    void main() {
        // OpenAI model
        ChatLanguageModel model = OpenAiChatModel.builder()
            .apiKey(System.getenv("OPENAI_API_KEY"))
            .modelName("gpt-4o")
            .build();

        // the first prompt
        String message = "Hello world!";
        System.out.println("\n>>> " + message);

        String answer = model.generate(message);
        System.out.println(answer);
    }
}

As can be seen above, to connect with OpenAI services you must provide your own API key. You can also add explicitly the model that you want to be used.

Vertex Hello World

import dev.langchain4j.model.chat.ChatLanguageModel;
import dev.langchain4j.model.vertexai.VertexAiGeminiChatModel;

class VertexAIHelloWorld {
    void main() {
        // Vertex AI model
        ChatLanguageModel model = VertexAiGeminiChatModel.builder()
            .project(System.getenv("VERTEXAI_PROJECT_ID"))
            .location("us-central1")
            .modelName("gemini-2.5-flash")
            .build();

        // the first prompt
        String message = "Hello world!";
        System.out.println("\n>>> " + message);

        String answer = model.generate(message);
        System.out.println(answer);
    }
}

The pattern is similar to the previous but the settings that must be provided are different: the project id in Vertex, the cloud region and the model name.

Ollama Hello World

import dev.langchain4j.model.chat.ChatLanguageModel;
import dev.langchain4j.model.ollama.OllamaChatModel;

class OllamaGptOssHelloWorld {
    void main() {
        // gpt-oss:20b model running locally with Ollama
        ChatLanguageModel model = OllamaChatModel.builder()
            .baseUrl("http://localhost:11434")
            .modelName("gpt-oss:20b")
            .build();

        // the first prompt
        String message = "Hello world!";
        System.out.println("\n>>> " + message);

        String answer = model.generate(message);
        System.out.println(answer);

    }
}

Again, the pattern is similar. In this case we are running Ollama in the local computer and using the gpt-oss:20b model, quite competent and runnable in many personal computers.

Managing the context (a.k.a. short-term chat memory)

While the previous examples work, they lack a critical feature that any GenAI solution would need. The first important concept that we need to understand is how to manage the context of the conversation with the LLMs, also known as the short-term memory.

In essence, what we must do is to track the whole conversation with the LLM (a.k.a. “the chat”). After every interaction, the pair question and answer are saved to be sent with the next request payload, typically models expect that under a specific history entry in the request body in Json format. Fortunately, Langchain4j deals with those details and we just focus on keeping track of the conversation. The simplest way to do that is to use a in-memory store, as we can see in the following example:

ChatLanguageModelmodel = OllamaChatModel.builder()
    .baseUrl(baseUrl)
    .modelName(modelName)
    .timeout(Duration.ofSeconds(300))
    .temperature(0.0)
    .build();

// define context window
ChatMemory chatMemory = MessageWindowChatMemory.withMaxMessages(10);

// initial prompt with name and what I'm doing
String message = "Hello world! My name is Jorge and I'm writing this for Java Advent 2025.";
chatMemory.add(userMessage(message));

AiMessage answer = model.generate(chatMemory.messages()).content();
System.out.println(answer.text());
chatMemory.add(answer);

// ask for the name
message = "What is my name?";
chatMemory.add(userMessage(message));

answer = model.generate(chatMemory.messages()).content();
System.out.println(answer.text());
chatMemory.add(answer);

ChatMemory (from package dev.langchain4j.memory) is the interface that abstracts the different implementations of short-term memory. In this example, we just use the implementation provided by MessageWindowChatMemory (from package dev.langchain4j.memory.chat) with a maximum capacity of 10 messages. Other implementations may have other ways to set the maximum capacity, e.g., using the token count.

As can be seen in the example, the whole chat memory is passed to the model. As we keep adding every message and answer into the memory, the LLM will leverage the whole conversation (up to the limits of the memory or its own internal context window, whatever comes first) to come up with the best possible answer.

The static function userMessage from class dev.langchain4j.data.message.UserMessage helps to simplify keeping the history up to date by putting the user prompt in the right place.

Enriching answers with Retrieval Augmented Generation (RAG)

No matter how big is the LLM we use for our solutions it lacks something key: the business-specific data. Facts and figures, business processes, knowledge bases… Every bit and piece of internal information of the organisation that is therefore not part of the public data sets used to train LLMs.

If we want to create really useful GenAI-based solutions we need them to be aware of the potential user context: what do they need, what they must know.

RAG is a very simple and time- and cost-effective pattern to have our AI agents better prepared for their assigned tasks, as compared with training your own models or fine-tuning existing ones.

To augment the LLM answer the RAG pattern enriches the context with pieces of information that are connected to the user’s problem or request. This is done by querying a vector search database or a graph database and obtain the documents or document portions that seem to be related to the user’s prompt.

RAG can be seen as a form of long-term memory: we prepare the knowledge bases or graphs before an agent is first released into the public, and is suitable for continuous improvement as knowledge base sources are not read-only. RAG pattern also plays well with continuous feedback, as users report about our agents performance (e.g., correctness, completeness, relevance of results, etc.) and that feedback leads to refining prompts and the content in knowledge bases.

RAG is a two part process, then:

The existing know-how is processed: parse, tokenization, vectorization, store in KB store. This process can be done periodically or even continuously if needed.
When the user asks for something the prompt is used to search for the relevant information in the KB store, the best ranked results are used to augment the prompt, and the whole set of data is sent to the LLM to get the final response.

It is important to note that as the relevant pieces of information go within the context, and are subject to context limits, it is important to balance the quantity of data that is retrieved and ranked, as we cannot simply add every piece of the KB into the context.

RAG can be seen graphically in this diagram:

Implementing RAG with Langchain4j

In Langchain4j we can implement both parts of the pattern:

Ingest documents containing the organisation knowledge to build up the knowledge base. For simple use cases, this KB can be maintained even in memory (e.g., for a bunch of PDF documents) and is pretty convenient to build many specialised agents with minimal dependencies (and investment).
Access the knowledge base when users asks for something to get the best possible results.

Let’s see how this works in practice.

Building the KNowledge Base

To build the knowledge base with Langchain4j we need the following abstractions:

EmbeddingModel from package dev.langchain4j.model.embedding: This is responsible for converting text into embeddings, that is, numerical representations (vectors) of pieces of text (tokens). In the example below, that can also be used in simple use cases, we will leverage the popular MiniLM-L6-V2 model.
EmbeddingStore from package dev.langchain4j.store.embedding: This is responsible for abstracting the actual store, e.g. a vector search database. In the example below, that can also be used in simple use cases, we will leverage an in-memory store.
DocumentSplitter from package dev.langchain4j.data.document: This is responsible for chunking the know-how documents. To parse documents in binary formats into text, in the example we leverage the popular Apache Tika library.
EmbeddingStoreIngestor from package dev.langchain4j.store.embedding: This is responsible to ingest every parsed document into the embedding store with the provided document splitter and embedding model.

A simple example code with Langchain4j would be like this:

// an embedding model good for simple documents
EmbeddingModel embModel = new AllMiniLmL6V2EmbeddingModel();

// an in-memory embedding store
EmbeddingStore<TextSegment> embStore = new InMemoryEmbeddingStore<>();

// load a PDF file from the classpath
Path path = Path.of(ClassLoader.getSystemResource("acme-know-how.pdf").toURI());

Document document = FileSystemDocumentLoader.loadDocument(path, new ApacheTikaDocumentParser());

DocumentSplitter splitter = DocumentSplitters.recursive(256, 0);

// ingest the document into the embedding store
EmbeddingStoreIngestor ingestor = EmbeddingStoreIngestor.builder()
    .documentSplitter(splitter)
    .embeddingModel(embModel)
    .embeddingStore(embStore)
    .build();

ingestor.ingest(document);

Augmenting the Responses

No matter if the knowledge base is created at runtime or it is a persistence enterprise-grade vector search database, the abstraction needed to augment the response during retrieval are the same:

ContentRetriever from

package dev.langchain4j.rag.content.retriever: This is responsible to abstract the embedding store and model that will be used to look for the relevant data in the KB.
AiServices from package dev.langchain4j.service.AiServices: This is a very convenient abstraction to create AI agents combining a given chat model and chat memory (as seen in the previous examples) with the content retriever.

The retrieval example with Langchain4j is quite straightforward:

// define the content retriever connecting everything together

ContentRetriever retriever = EmbeddingStoreContentRetriever.builder()
    .embeddingModel(embModel)
    .embeddingStore(embStore)
    .maxResults(1)
    .minScore(0.8)
    .build();
       
// llama3:8b model running locally with Ollama
ChatLanguageModel chatModel = OllamaChatModel.builder()
    .baseUrl("http://localhost:11434")
    .modelName("llama3:8b")
    .build();

// define context window
ChatMemory chatMemory = MessageWindowChatMemory.withMaxMessages(100);

Agent agent = AiServices.builder(Agent.class)
    .chatLanguageModel(chatModel)
    .chatMemory(chatMemory)
    .contentRetriever(retriever)
    .build();     

String message1 = "Could you summarize in 50 words the main concepts about the Java Platform?";

String answer1 = agent.answer(message1);

The interface Agent is a simple abstraction of our agent and its system prompt:

interface Agent {
    @SystemMessage("""
        You are an expert in information technologies
        and software engineering.
        """)
    String answer(String inputMessage);
}

Conclusions

The integration of Generative AI into enterprise software is no longer solely the domain of Python scripts or expensive cloud APIs. As we can see, Langchain4j offers a production-grade library for building agentic solutions at scale:

Decoupling: Langchain4j acts as a robust anti-corruption layer. By coding against interfaces like ChatLanguageModel and EmbeddingModel, applications can remain up to a certain degree agnostic to the underlying provider.
Simplicity: The AiServices API brings the familiarity of aspect-oriented programming (similar to Spring Data) to AI. Complex orchestration involving RAG retrieval, history management, and prompt engineering is abstracted behind clean Java interfaces and annotations.
Local Inference Viability: With the optimization of models (quantization) and the efficiency of modern hardware, running capable small-sized or medium-sized models augmented with the organization know-how on your own hardware is not just possible but practical for development cycles, CI/CD pipelines, privacy-sensitive edge deployments, and cost-effective deployments.

Knowing more

If you want to know more and explore Langchain4j in deep, the following resources will be helpful:

I created a step by step workshop with lots of examples here: https://github.com/deors/workshop-langchain4j
The Langchain4j project tutorials: https://docs.langchain4j.dev/category/tutorials/

Making Java a first-class AI citizen with Langchain4j

Introduction

The first Langchain4j program

OpenAI Hello World

Vertex Hello World

Ollama Hello World

Managing the context (a.k.a. short-term chat memory)

Enriching answers with Retrieval Augmented Generation (RAG)

Implementing RAG with Langchain4j

Building the KNowledge Base

Augmenting the Responses

Conclusions

Knowing more

Author: deors

Like this:

Related

Leave a ReplyCancel reply

Making Java a first-class AI citizen with Langchain4j

Introduction

The first Langchain4j program

OpenAI Hello World

Vertex Hello World

Ollama Hello World

Managing the context (a.k.a. short-term chat memory)

Enriching answers with Retrieval Augmented Generation (RAG)

Implementing RAG with Langchain4j

Building the KNowledge Base

Augmenting the Responses

Conclusions

Knowing more

Author: deors

Share this:

Like this:

Related

Leave a ReplyCancel reply

Run Into the New Year with Java’s Ahead-of-Time Cache Features

The FFM API: How OpenJDK Changed the Game for Native Interactions (And Made Pi4J Better!)

Tinkering with a “hands-off” agent