Valérian de Thézan de Gaussan · Data Engineering for process-heavy organizations

Unstructured data ➡ Structured data

One of the big use-case of LLMs. But how?


JSON is a format that’s commonly used to structure data in a clear, easy-to-read way on the web.

However, transforming text into JSON using a LLM, can be challenging due to potential errors and inconsistencies in the ouput. Since an LLM is “just” a “next word prediction machine”, it can start well creating a JSON and then start diverging into writing a poem.

🤏 Rules of thumb to ouput a valid JSON:

1️⃣ Set the AI’s “temperature” to 0 (or 0.1, but not much higher). This setting controls how predictable the AI’s responses are. A lower temperature means less randomness, increasing the likelihood of getting a precise, structured output that passes the JSON validation.

2️⃣ Begin your prompt with a part of the JSON structure already written. This gives the AI a clear starting point and helps guide its response in the format you want.

3️⃣ Finally, use a JSON parser to check the AI’s output. If there are errors, use the feedback from the parser to refine the prompt and ask the AI again. Repeat this process until the JSON is error-free.

A simple example:

”””
Here are information extracted from an invoice:

{invoice_data}

Please convert this information into JSON format, starting like this: {“client_name”:”
“””
Then, pass the result to a json parser. If it fails, you can pipe the error into the prompt instead of simply retrying, as the LLM might understand what mistake it did and fix it.

In the image below, I even managed to structure Hipster Ipsum, a nonsensical text, as a structured blog article, with added categories and tags.

unstructured-to-structured.png