Quick Fix: To convert LangChain documents back to strings, replace create_documents with split_text in the code. This ensures that each chunk from the RecursiveCharacterTextSplitter is an item in the texts list.

The Problem:

You’ve developed a splitter function in Python using the LangChain library to dissect a set of Python files. However, you need to convert the separated documents back into their original Python code. You’re unsure of the approach for this conversion.

The Solutions:

Solution 1: Use `split_text` method to convert documents back to strings

To convert the LangChain documents back to strings, replace the `create_documents` method with the `split_text` method. Here’s the modified code:
texts = text_splitter.split_text(contents)
The `create_documents` method creates a `Document` object, which is a list of dictionaries with two keys: `page_content` (a string) and `metadata` (a dictionary). The `split_text` method, on the other hand, will directly put each chunk from the `RecursiveCharacterTextSplitter` as an item in the `texts` list. This will result in a list of strings, which can then be processed as needed.

Solution 2: Convert Document Back to Dictionary

Since the Langchain documentation and structure are not well-organized, information on the ‘Document’ type is limited. However, you can convert the ‘Document’ back to a dictionary using the ‘doc.dict()’ method.

The code demonstrates this conversion:

document = langchain.Document(...)  # Create a 'Document' instance

# Convert the 'Document' to a dictionary
document_dict = document.dict()

The ‘document_dict’ variable now contains a dictionary representation of the original ‘Document’. This conversion allows you to access the contents of the ‘Document’ in a more structured and familiar format.

How to convert a Document created with the RecursiveCharacterTextSplitter back to a string?

Use the .dict() method to convert the Document back to a dictionary, then extract the ‘page_content’ key to get the string.

Can we use the create_documents method to split the text?

No, the ‘create_documents’ method creates a Document object, not a list of chunks. Use ‘split_text’ instead.

How to extract string from a list of langchain docs?

Extract the ‘page_content’ key from each Document object.

