RAG: Source Ingesting and Model

Continued from https://snippets.therealvan.com/2026/04/27/rags-to-riches/

The source ingesting subsystem functions as follows:

a list of sources (folders/subdirectories) and file name masks define the source documents to ingest.
- absolute path to the subdirectory.
- includes file mask (e.g. *.txt) — defaults to equivalent of “**/*.*“, indicates files to process in the path.
- excludes file mask — defaults to null/None, indicates files to skip/ignore processing.
a process (or processes, preferably asynchronous) to scan sources, find the documents, and process them:
- extract text content from document files.
- chunk text content into pieces that can be processed, with overlap to avoid losing context at chunk boundaries.
- calculate embedding vectors for chunks.
- store embedding vectors along with metadata into vector store (PostgreSQL).

Since I want to avoid having to redefine sources repeatedly, I added a Django model Source to capture a document source and persist it. These Source objects then can be persistent. Of course, the usual CRUD operations are needed.

The processing of Source objects is where a chunk of coding is required. The logic includes scanning the file system and filter out documents, extract text content from the documents, and chunk with overlap. If I look hard enough, I may find some libraries that do this with various file types (e.g. text, PDF, ePUB, Word). However, I’ve decided to roll my own, citing the age-old “not invented here” principle.

Asynchronous Ingestion Framework

Since I want to eventually set up a fleet of asynchronous independent processes, I will use the Celery library for its asynchronous task feature. Unfortunately, doing so will complicate the system design a bit since now I have to consider a message queue to use for its “broker” and a store for its “[result] backend.” From experience I know that Celery works fine without a result backend (with some limitations). If a result backend is used, it must be regularly maintained so that obsolete results are removed.

This is the first dilemma: should I add a result backend? And if so, which option should I use?

If I do want a backend, from the Celery doc, both Redis and RabbitMQ are good brokers but not ideal for backends. For reliable (durable) result backends, PostgreSQL or MySQL is recommended. The Django app already uses PostgreSQL for its vector storage, so I’d go with PostgreSQL instead of adding a whole separate RDBMS just for Celery backend. This means SQLAlchemy is needed to bridge Celery to PostgreSQL.

Use PostgreSQL as the Backend Connected with SQLAlchemy

Pros	Cons
Persistent task results are handled by Celery.	Periodic pruning of results needed.
Task state and intermediate status support available via `Task.update_state()`.	SQLAlchemy is required to use PostgreSQL for backend (see above), but this means our Django app has 2 ORMs: Django’s own and SQLAlchemy.

What if I rolled my own asynchronous task status library that works with Django models?

Implement Custom Task Status Library with Django

Pros	Cons
– Persistent task results and intermediate status are supported by Django models. – Django models work with all the different RDBMSes that Django works with.	Periodic pruning of results needed.
No need for a dedicated Celery backend.	Custom logic around the Django models needs to be written, and it will be a duplication of logic/code in Celery’s task state.

This is a difficult decision. Fortunately, I work for myself, so I don’t mind spending some cycles rolling my own asynchronous task status library. If I were a salaried employee with a schedule, I would probably use Celery’s and live with having SQLAlchemy work alongside Django’s ORM and deal with the configuration.

So, the vasync-request library was born. This is a library that sits on top of Celery (one that doesn’t have a result backend) to track the state and intermediate status of Celery tasks using Django.

Embedding Vector Storage

“Oh. Just use PostgreSQL’s vector extension.”

Does that sound like something a manager would say?

There are some metadata to consider linking to an embedding vector for a chunk of text. For one thing, we need the chunk of text itself so that we can recall and return for a query. It’s of no use to to present a vector of floats to a user. In fact, the collection of properties we’d need to store are:

text chunk (with overlap text) — the text that produced the embedding vector.
source — where was the text extracted from? This is useful for reference, audit, and debugging. “Cite the source” is a common request nowadays, particularly when dealing with a potentially hallucinating entity. For now, we’ll use the absolute path to the document on the host. Remember, this is a personal RAG system running on potentially a laptop.
- What happens if we move the document around after ingestion?
sequence — relative ordering of the text chunk in the source document’s content. Again, this is useful for reference, audit, and debugging.
embedding parameters — these are used to disambiguate different embeddings of the same source. They allow multiple approaches to calculations simultaneously to allow comparisons. The vector size also is a technical requirement when working with PostgreSQL.¹
- model — the model used for calculating the embedding vector.
- vector size — the length of the embedding vector.

This comes down to two models:

EmbeddingSource

This describes the data and its source:

Field	Type	Description
source	varchar(300)	path to the document file on host
sequence	int	relative sequence of this chunk in the source
content	text	the text chunk

Unique together: source, sequence

NOTE that a strict 3NF design probably requires breaking that down further, something like two models SourceLocation (source and sequence) and SourceContent (foreign key to SourceLocation and content). However, for this project I’m trying to keep the complexity down. Arguably this is a subjective decision.

EmbeddedContent

This contains the embedding vector itself and its embedding parameters:

Field	Type	Description
source	FK to EmbeddingSource	source of the embedding
model	varchar(100)	model used to calculate embedding
dimension	int	dimension of the embedding vector
embedding	vector	the embedding vector
active	boolean	used to optionally deactivate the entry
created_at	datetime
updated_at	datetime

Unique together: source, model, dimension

Again, a strict 3NF design will require refactoring out model & dimension into something like EmbeddingParameter. This purity, however, will mean we need to always join the two models when querying. KISS?

The main query we expect from these models is:

“Given an embedding, a model, and a dimension, give me all the similar embeddings and their associated text chunk, source, and sequence. Order the results by similarity distance in ascending order (best matches first).”

So why have two tables instead of just one?

Separation of concerns: EmbeddingSource deals with the data source, and EmbeddedContent deals with the embedding vector and the embedding parameters. For example, if a document is moved to a new location, we can simply update EmbeddingSource without touching EmbeddedContent (and its updated_at).
This design is optimized to querying the EmbeddedContent model, applying filtering and ordering. When needed (e.g. after obtaining the top N matches), pull in the joined EmbeddingSource for the metadata. The latter table, of course, includes a potential giant blob for content that we don’t want to load in unless necessary.

Next Time: Ingestion Flow

I ran the schema through Copilot and had a nice little session discussing the design. One thing I missed was a robust ingestion flow handling failures and partially ingested sources.

I already mentioned I will use Celery for the asynchronous ingestion process. A Celery task is identified by a UUID and can be used to identify an ingestion “run.” Beyond the state of [pending, in-progress, succeeded, failed], it’d be nice to record for each run, which documents were processed and what their individual ingestion states are.

And that’s what the next post will focus on.

If a vector is compared (as in a WHERE clause of a SELECT) with a vector field whose vector value was written with a different size, PostgreSQL will return an error for the mismatched vector sizes. Therefore, if we plan to support multiple vector sizes simultaneously, we need to add a field to indicate the vector size so that our query can filter for only records that have the same vector size. Copilot thinks there is a length() vector function available on a vector field for filtering purposes, but I didn’t find any mention of it from pgvector’s documentation. ↩︎