Large Language Models For Production Data Modeling

We increasingly rely on AI as assistants for coding, video production, and photography. While AI’s capabilities in areas like code, image, and video generation are well-explored, we may be underestimating the potential of popular large language models (LLMs) to enhance the quality of the products we develop in other domains. Specifically, utilizing LLMs to model production data represents a significant, often overlooked, area of opportunity.

Modeling production data

Let’s first tackle the scenario as to why we need to model production data? I think we are all familiar with production data. To recap production data is customer data like name, address, emails, billing information etc. For any business production data is a high value asset and a lot of care is taken in building software that will affect this data.

Change controlled environments

A lot of times the environments that hold production data are change controlled which means we are not at liberty to execute scripts and tooling without prior approval and authorization. This is because these environments might include personally identifiable information and this data is protected by different privacy regulations like HIPPA and GDPR. There is no way a business can test its software and tooling on change controlled or production environments because that can have a potential negative impact on customer experience.

Sandbox environments

A good alternative to this problem is to create a sandbox environment. Usually sandbox environments can be created quickly with modern tooling and cloud providers like Azure, AWS or GCP. These sandbox environments are usually the dev, test or stage environments in the deployment pipeline. Although the new environment can closely simulate production environment configuration and setting, it will lack any data.

Also, there are instances where we might need to run analytics and monitoring tooling on production data. By definition production data will be a lot bigger in size as compared to the data in our sandbox environment. This is a situation where we need to model a lot of data that will match the size of our production data.

Those are some scenarios that showcase a need to not only model quality production data but also a large quantity of data.

Generating data with LLMs

Now that we have explored the need for modeling data. Let’s look at an example where we generate simulated customer data for a hospital. In the following exercise, I provided Gemini with the following prompt:

“Create an example csv file of data with 10 rows with basic customer information for a hospital”

Based on the previous prompt I received the following output:

Patient_ID,First_Name,Last_Name,DOB,Gender,Phone_Number,Email,Blood_Type,Last_Visit_Date,Insurance_Provider

P1001,James,Smith,1975-05-12,Male,555-0101,jsmith75@email.com,A+,2023-11-15,BlueCross

P1002,Maria,Garcia,1988-08-23,Female,555-0102,mgarcia88@email.com,O-,2024-01-10,Aetna

P1003,Robert,Johnson,1962-12-01,Male,555-0103,rob.j@email.com,B+,2023-09-22,Medicare

P1004,Linda,Williams,1995-03-14,Female,555-0104,lwilliams95@email.com,AB-,2024-02-05,UnitedHealth

P1005,Michael,Brown,1950-07-30,Male,555-0105,mbrown50@email.com,O+,2023-12-12,Medicare

P1006,Elizabeth,Jones,1982-11-19,Female,555-0106,ejones82@email.com,A-,2024-03-01,Cigna

P1007,David,Miller,2001-01-25,Male,555-0107,dmiller01@email.com,O+,2023-10-30,Kaiser

P1008,Barbara,Davis,1945-09-08,Female,555-0108,bdavis45@email.com,B-,2024-01-20,Humana

P1009,William,Wilson,1990-06-17,Male,555-0109,wwilson90@email.com,A+,2023-08-14,BlueCross

P1010,Susan,Taylor,1979-04-05,Female,555-0110,staylor79@email.com,AB+,2024-02-28,Aetna

As we can see in the example above this data is actually a very good first draft of what we might need in a sandbox environment.

I will now refine the first draft of data to remove medical information from the data and add uuid identifier using the following prompts:

Remove Last_Visit_Date, and Blood_Type column from the csv
Remove Insurance_Provider from the csv
Convert the Patent_Id to a UUID format

The generated data is an example of a customer information table in a database.

Patient_ID,First_Name,Last_Name,DOB,Gender,Phone_Number,Email

44e36bad-2c81-4410-bf75-d9072651ac4b,James,Smith,1975-05-12,Male,555-0101,jsmith75@email.com

c7d84b81-6d39-4bc6-b695-b0854e2201da,Maria,Garcia,1988-08-23,Female,555-0102,mgarcia88@email.com

f5d50c7b-e4ad-4763-af2a-9913abcc80ad,Robert,Johnson,1962-12-01,Male,555-0103,rob.j@email.com

9c355fd4-37f8-4522-8013-8f8855171b6f,Linda,Williams,1995-03-14,Female,555-0104,lwilliams95@email.com

6b6868e6-15d4-411a-9e2c-8cc56b5b27cf,Michael,Brown,1950-07-30,Male,555-0105,mbrown50@email.com

8eccf25e-5ecc-4cca-b679-68dc929429a5,Elizabeth,Jones,1982-11-19,Female,555-0106,ejones82@email.com

a241c4c0-c071-49d2-976d-0f251a4e0abe,David,Miller,2001-01-25,Male,555-0107,dmiller01@email.com

01b6f80b-dc3c-4e66-a9af-ed89947f515b,Barbara,Davis,1945-09-08,Female,555-0108,bdavis45@email.com

b842ab80-614f-4c0c-95a7-944099ee0c2a,William,Wilson,1990-06-17,Male,555-0109,wwilson90@email.com

9f02891c-06ba-4f80-91c2-6077ca918bad,Susan,Taylor,1979-04-05,Female,555-0110,staylor79@email.com

Building on the previously generated data, how can we generate more data that is needed? We will now using the following prompt to generate a second set of data that will contain the patients hospital visit information:

“Using the data in final_hospital_data.csv generate a single csv file that will contain their last medical visit information like patient_id, visit date, visit time, insurance provider, visit reason, doctor visited.”

And we received the following output from LLM:

Patient_ID,Visit_Date,Visit_Time,Insurance_Provider,Visit_Reason,Doctor_Visited

44e36bad-2c81-4410-bf75-d9072651ac4b,2023-10-11,9:15,Humana,Hypertension Follow-up,Dr. Aris

c7d84b81-6d39-4bc6-b695-b0854e2201da,2024-05-15,8:45,BlueCross,Post-Op Review,Dr. Foster

f5d50c7b-e4ad-4763-af2a-9913abcc80ad,2023-06-04,10:30,Cigna,Hypertension Follow-up,Dr. Dhillon

9c355fd4-37f8-4522-8013-8f8855171b6f,2023-07-15,14:30,Aetna,Migraine,Dr. Evans

6b6868e6-15d4-411a-9e2c-8cc56b5b27cf,2023-05-10,14:15,Medicare,Flu Symptoms,Dr. Bennett

8eccf25e-5ecc-4cca-b679-68dc929429a5,2024-04-21,13:15,Kaiser,Migraine,Dr. Aris

a241c4c0-c071-49d2-976d-0f251a4e0abe,2023-07-18,8:15,BlueCross,Diabetes Consultation,Dr. Evans

01b6f80b-dc3c-4e66-a9af-ed89947f515b,2023-04-10,13:45,Cigna,Flu Symptoms,Dr. Chen

b842ab80-614f-4c0c-95a7-944099ee0c2a,2023-08-24,14:45,Medicare,Flu Symptoms,Dr. Foster

9f02891c-06ba-4f80-91c2-6077ca918bad,2024-08-19,9:15,Cigna,Joint Pain,Dr. Aris

This generated data is an example of a customer information and customer visit table for a hospital.

Integrating generated data

The collected data can now be readily integrated into either a SQL or NoSQL database. LLMs can be employed to generate the necessary insert queries for a SQL database, allowing this data to be fed into a local, development, or test environment. For the sake of clarity, I limited the example above to ten rows, but generating a larger dataset is straightforward.

Furthermore, it might be necessary to convert the data from its current CSV format into other structures, such as JSON or XML, to facilitate easier integration. While LLMs are capable of handling all these conversion tasks, we will omit them from this article to keep the focus narrow.

LLMs are a great tool to simulate data for our local or simulated environments. By using this generated data, we can test our software better and have more confidence in the solutions that we build.

Tags: trends