how to convert langchain documents back to strings? – Langchain

by
Maya Patel
chatgpt-api langchain llama-cpp-python openai-api

Quick Fix: To convert LangChain documents back to strings, replace create_documents with split_text in the code. This ensures that each chunk from the RecursiveCharacterTextSplitter is an item in the texts list.

The Problem:

You’ve developed a splitter function in Python using the LangChain library to dissect a set of Python files. However, you need to convert the separated documents back into their original Python code. You’re unsure of the approach for this conversion.

The Solutions:

Solution 1: Use `split_text` method to convert documents back to strings

To convert the LangChain documents back to strings, replace the `create_documents` method with the `split_text` method. Here’s the modified code:
“`
texts = text_splitter.split_text(contents)
“`
The `create_documents` method creates a `Document` object, which is a list of dictionaries with two keys: `page_content` (a string) and `metadata` (a dictionary). The `split_text` method, on the other hand, will directly put each chunk from the `RecursiveCharacterTextSplitter` as an item in the `texts` list. This will result in a list of strings, which can then be processed as needed.

Solution 2: Convert Document Back to Dictionary

Since the Langchain documentation and structure are not well-organized, information on the ‘Document’ type is limited. However, you can convert the ‘Document’ back to a dictionary using the ‘doc.dict()’ method.

The code demonstrates this conversion:

document = langchain.Document(...)  # Create a 'Document' instance

# Convert the 'Document' to a dictionary
document_dict = document.dict()

The ‘document_dict’ variable now contains a dictionary representation of the original ‘Document’. This conversion allows you to access the contents of the ‘Document’ in a more structured and familiar format.

Solution 1: Extracting LangChain Documents to String

To extract the contents of individual LangChain documents and convert them back into a string, you can use the following steps:

  1. Extract the Page Content:

    Access the page_content attribute of the LangChain document. This attribute contains the raw text content of the document.

  2. Convert to String:

    If the page_content is not already a string, convert it to a string using the appropriate method, such as the str() function or the join() method for a list of characters.

  3. Process Multiple Documents:

    If you have multiple LangChain documents, you can use a loop to extract the page_content and convert each document to a string. Store the resulting strings in a list or other appropriate data structure.

Here’s an example of how you can extract the contents of a single LangChain document and convert it to a string:

import langchain

# Load a LangChain document
document = langchain.Document.from_json('document.json')

# Extract the page content
page_content = document.page_content

# Convert the page content to a string
string_text = str(page_content)

# Print the extracted string
print(string_text)

To process multiple documents, you can use a loop:

import langchain

# Load LangChain documents
documents = [langchain.Document.from_json(f'document{i}.json') for i in range(5)]

# Extract page content and convert to strings
string_texts = [str(document.page_content) for document in documents]

# Print the extracted strings
for string_text in string_texts:
    print(string_text)

This will extract the page_content from each document, convert it to a string, and store the resulting strings in the string_texts list. You can then use this list to further process the extracted text as needed.

Q&A

How to convert a Document created with the RecursiveCharacterTextSplitter back to a string?

Use the .dict() method to convert the Document back to a dictionary, then extract the ‘page_content’ key to get the string.

Can we use the create_documents method to split the text?

No, the ‘create_documents’ method creates a Document object, not a list of chunks. Use ‘split_text’ instead.

How to extract string from a list of langchain docs?

Extract the ‘page_content’ key from each Document object.

Video Explanation:

The following video, titled "Using LangChain Output Parsers to get what you want out of LLMs ...", provides additional insights and in-depth exploration related to the topics discussed in this post.

Play video

OutParsers Colab: https://drp.li/bzNQ8 In this video I go through what outparsers are and how to use them in LangChain to improve you the ...