juanrivillas.com

Implementing a RAG system in Elixir

I'm working on a project to render a data visualization tree of my family. This project will be part of this blog and it will have the classical drawer where people can ask questions about me or my relatives.

It goes without saying that Python is the default language when it comes to data science and ML. However, I wanted to implement LLM-related things using the Elixir language. This blog post explores my journey while doing so, as well as the challenges I faced.

As a recap, these are the steps required in a RAG system:

Define the knowledge base format.
Split the content in chunks.
Define an embedding model.
Define a vector database to use.
Define a chat model.

Defining the knowledge base format

Originally I thought about using markdown to organize the information. Each markdown file would be a consolidated file for each person of my family. However, I decided to use JSON because Markdown is more limited in terms of metadata and structured information. Since I know the structure and initially I'm the one who'll produce all the content, JSON felt right.

Splitting content into chunks

The whole idea of using RAG is that you can grab some content, insert it into a vector database, and query that later using natural language. There will be a mechanism that will find the closely related context in your vector database, based on your input.

When we think about chunking, we have a few options:

Using a naive text splitter
Using a library that splits the text for you
Another alternative

Usually, you define a few parameters for chunking

length: Defines the length of each chunk length
overlap: Defines how many characters will overlap between each chunk

A proper chunking is not about randomly splitting strings. You want chunks to have the most relevant context together so that when you query certain information, the appropriate context will come back and then you can use it to respond to questions about it.

With that in mind, we immediately reject the first naive split option.

Second, one can find libraries with more innovative mechanisms to split the content semantically. In Elixir, I saw two libraries

Chunx. At the time of writing this post, it had been eight months without any changes, so it gives me the impression that it's not actively maintained.
Text Chunker. Has more options and looks like a good choice based on the features it has.

Finally, after chatting with my friend Yugo Sakamoto, he suggested that every chunk in my app could also be an stringified JSON map. A JSON map in my application represents a person with information about their biography as well as hobbies, birth date, and relationships with other members of the family. So that fits.

Defining an embedding model

I chose sentence-transformers/all-MiniLM-L6-v2 as an embedding model for several reasons:

Size: At 80MB, it's lightweight enough for local development and testing
Performance: Despite its small size, it captures semantic information effectively
Dimensions: Produces 384-dimensional vectors, which are manageable for in-memory storage
Community adoption: Well-tested and widely used in the community

In this step, the input is the text chunks and the output will be 384-dimensional vectors.

Defining a vector database

I decided to go for an in-memory vector database to get to an MVP point faster.

The trade-offs of this approach:

Pros:

No external dependencies or setup required
Fast retrieval for small to medium datasets
Perfect for prototyping and MVP development

Cons:

Data doesn't persist between application restarts
Limited scalability compared to dedicated vector databases like Pinecone or Weaviate
All vectors must fit in memory

I implemented a cosine similarity in Elixir using Nx tensors, which provides efficient vector operations similar to NumPy in Python.

Selecting the chat model

Finally, we need a chat model to feed the top-K best results from the similarity search. The idea is that a GPT-like model grabs that information and extracts the proper information based on the user's question.

I chose google/gemma-2b for several reasons:

Size: At 2B parameters, it's small enough to run locally on most machines
Performance: Provides good text generation quality for its size
Licensing: Permissive license allows for commercial use
Bumblebee Support: Well-supported in the Elixir ecosystem through Bumblebee since it's available via Hugging Faces

For example:

The following snippet describes the vector search looks output. I was interested in the top 3 results

1. (Score: 0.138) {"bio":"Person A bio.","birth_date":"1980-01-15" ...}
--------------------------------------------------
2. (Score: 0.127) {"bio":"Person B bio.","birth_date":"1985-11-10" ...}
--------------------------------------------------
3. (Score: 0.106) {"bio":"Person C bio.","birth_date":"1950-03-09" ...}

When we feed this information to the chat model, a model like GPT-4 or in my case google/gemma-2b will be able to extract the relevant information and answer your question using natural language. When somebody asks "Tell me about person A", the model will respond something like:

Person A is a software engineer. He is married to Jane Doe and has two children, Alice and Bob. He lives in San Francisco

Implementation

The implementation combines several Elixir libraries to create a functional RAG system:

Key Components

JSON Processing: Using Jason for parsing family member data and converting it into searchable chunks.

Nx and Bumblebee: Used for loading and running the embedding model locally. Bumblebee provides a high-level API for transformer models, while Nx handles the tensor operations.

GenServer and Supervisor: The RAG system runs as a GenServer process, maintaining the vector database in memory and handling concurrent queries efficiently, while keeping the system fault-tolerant. Also, I added a specific RAG supervisor to handle failures gracefully.

Basic Pipeline

# Simplified flow
  query -> embed_query -> similarity_search -> top_k_results -> chat_model -> response

Challenges Faced

Model Loading: Initial model loading time can be significant. Using a GenServer allows for lazy loading and model caching. Moreover, it will enable your application to stay fault-tolerant, following the Elixir principles.
Resources: There are not many resources online about LLM implementations in the Elixir ecosystem. So, I often found myself needing to understand the basic principles first and then trying to translate a Python implementation.
Inference time: The inference time was taking too long. Here I'm considering both the similarity search part and the chat interpretation about the top K results, outcome of the RAG process

The implementation

Check out the full implementation here

The first implementation took a bit until I understood what were the pieces required to make this work. Once I had a working version, it became a slowly iterative process where I started analyzing how each part was performing. Aside from the AI core, I wanted to ensure I could have this API integrated in a deployed environment with low costs. Finally, I wanted to follow Elixir's principles about fault-tolerance.

Once I had a decent version #1 working, I decided to focus on the architecture, so I created the following modules:

A RAG supervisor
A RAG GenServer
A RAG Behavior to instruct the RAG Clients what functions to expose
A File Processor to focus on reading and chunking the documents
An In-Memory-Vector-Store module to handle with all the embeddings and cosine similarity
A RAG Client (Bumblebee and later Ollama) to orchestrate all that logic

Out of those six items I want to focus on items 1 and 2. Elixir has a beautiful mechanism called Supervisor. Supervisors allow to orchestrate processes and allows one to define a strategy to handle unexpected errors. In this case, I have a whole API that can't go down just because I couldn't load the model. So I created an optimistic behavior, where we can start the API in a degraded version. The Supervisor alongside a GenServer allows me to respond to the client requests asking them to retry later. This is very useful in many systems, as one single component cannot bring down your whole application. Maybe one could have a fallback mechanism as well, if the main RAG system is not working. For example, one could have some cached information and attempt to respond the request based on the latest known information.

The rest of the items are all well-known software engineering decisions that focus more on the architecture and exposing abstractions rather than how we will implement the code itself.

Summary

Implementing RAG in Elixir proved to be both challenging and rewarding. While Python dominates the ML space with mature libraries, Elixir's Nx and Bumblebee ecosystem provides a solid foundation for building ML applications with the added benefits of fault tolerance and concurrency that the BEAM VM offers.

Key takeaways from this project:

Architecture matters: Decisions about chunking strategy, model selection, and data storage significantly impact the system's effectiveness
Elixir viability: The Elixir ML ecosystem, while younger than Python's, is capable of handling real-world RAG implementations
Performance considerations: Local model inference requires careful memory management and optimization
Elixir ecosystem: Although there are not as many resources in the Elixir ecosystem as in the Python ecosystem, the Elixir ecosystem still proves to be solid and mature enough to allow one to implement a variety of systems, including LLMs.
Concept first!: Understand the concept first, and then it'll be easier to map the steps required to implement a system like this one in Elixir.

To me, the most valuable lesson was that the architectural decisions—how to structure the data, which models to choose, and how to integrate with existing systems—were far more critical than the actual LLM code, reinforcing my belief that system design knowledge remains a vital human skill, even in an AI-driven world.