Get all documents from ChromaDb using Python and langchain

Quick Fix: You can use the get() method to fetch all documents from a ChromaDb instance.
This method will return a list of dictionaries, where each dictionary represents a document.
Each dictionary will contain the following keys:

id: The ID of the document.
embedding: The embedding of the document.
doc: The text of the document.

The Problem:

You have a collection of documents stored in a Mongo database. You want to load these documents into a ChromaDb vector storage using Langchain for further processing. After storing the data, you need to retrieve all the documents and their embeddings, along with their IDs, to store them back into MongoDB and perform topic categorization using Bertopic. The challenge is to write Python code using Langchain to extract all documents from the ChromaDb vector storage and obtain their associated metadata.

The Solutions:

Solution 1: Get documents from ChromaDb using Python and langchain

To retrieve all documents from a ChromaDb vector storage created using Langchain, you can use the following code:

“`python
import langchain
db = langchain.vectorstores.Chroma.from_path(‘db’)
documents = db.get()
“`

The `get()` method will return a list of JSON objects, with the following structure for each object:

`id`: The unique identifier of the document
`embedding`: The document’s embedding vector
`doc`: The original document text

Solution 2: Get all documents from ChromaDB using langchain

Once the ChromaDB database is initialized, create a client using the DB directory:

import chromadb
client = chromadb.Client(Settings(is_persistent=True, db_path=<PERSIST_DIR_NAME>))

Then, get the specified collection:

collection = client.get_collection("collection_name")

Finally, retrieve all the data:

data = collection.get()

The retrieved JSON will include all the data, metadata, source, and embeddings.

Q&A

How do I get all documents I’ve just stored in the Chroma database?

—

You can just call db.get() and you will get a json output with the id’s, embeddings and docs data.

How to create a client separately using the DB persist directory?

—

You can create a client separately using the DB persist directory as below:

Video Explanation:

The following video, titled "Get all documents from ChromaDb using Python and langchain ...", provides additional insights and in-depth exploration related to the topics discussed in this post.

Get all documents from ChromaDb using Python and langchain I hope you found a solution that worked for you 🙂 The Content (except music ...

Get all documents from ChromaDb using Python and langchain – Langchain