[Solved] LangChain: Querying a document and getting structured output using Pydantic with ChatGPT not working well

Quick Fix: To resolve the issue with ChatGPT via Pydantic, employ a try/except block, delete embeddings after each run, specify formatting instructions, update the Enum to include a None value, split documents into larger chunks, and utilize the gpt-4 model for improved accuracy.

The Problem:

A developer is using LangChain to query a document and extract structured outputs using Pydantic. However, they are facing challenges:

ChatGPT sometimes doesn’t format the extracted date according to the required ‘datetime.date’ format.
Pydantic’s Enum Field is not working as expected. The document has ‘Lastname’ instead of ‘Surname,’ and ChatGPT doesn’t transform it, resulting in exceptions.
The developer is unsure about how to correctly use chains due to the complexity of LangChain’s documentation.

Overall, the developer is struggling to get ChatGPT to respect the defined output structure, leading to errors and confusion.

Can you assist in resolving these issues and providing guidance on using LangChain effectively?

The Solutions:

Solution 1: Handle Errors, Clean Vectorstore, Provide Formatting Tips, and Use Enum with None

To resolve the issue, a few modifications were made to the code:

try/except Block: A try/except block was added around the chain execution code to catch errors and prevent the program from stopping.
Cleaning Vectorstore: The vectorstore variable was not getting cleaned on each run, resulting in old data affecting new documents. To address this, the vectorstore was deleted after each run to ensure clean data for each new document.
Formatting Tips: Explicit instructions were provided to the LLM to ensure proper formatting of the results. A "Tips" message was included in the prompt, specifically instructing the LLM to format dates in the YYYY-MM-DD format.
Enum with None: To account for cases where the LLM cannot find the information requested, the NameEnum class was modified to include a None value. This allows the variable to be set to None when the LLM is unable to determine whether the field is a name or surname.
Larger Splits and gpt-4: The size of the document splits was increased from 200 to 500 characters to improve the accuracy of the data extraction. Additionally, the gpt-4 model was employed instead of gpt-3.5-turbo for enhanced performance.

By implementing these modifications, formatting errors were eliminated, the program was able to handle errors gracefully without crashing, and the accuracy of data extraction was significantly improved.

Q&A

How can I fix ChatGPT not respecting the format from the Pydantic structure?

—

Add a try/except block, clean vectorstore on each run, provide more explicit formatting instructions, and add None value to the Enum class.

What are some additional tips to improve the accuracy of the data extraction?

—

Increase the size of the splits, use gpt-4 as the model, and make sure to handle unexpected errors.

How can I handle the situation when ChatGPT cannot find the requested information?

—

Modify the Pydantic class to account for a None value, indicating that the information was not found.

Video Explanation:

The following video, titled "How To Use GitHub Copilot (with Python Examples) - YouTube", provides additional insights and in-depth exploration related to the topics discussed in this post.

Pydantic Tutorial • Solving Python's Biggest Problem. pixegami•172K ... How to learn to code FAST using ChatGPT (it's a game changer seriously).

[Solved] LangChain: Querying a document and getting structured output using Pydantic with ChatGPT not working well – Langchain