We increasingly rely on AI as assistants for coding, video production, and photography. While AI’s capabilities in areas like code, image, and video generation are well-explored, we may be underestimating the potential of popular large language models (LLMs) to enhance the quality of the products we develop in other domains. Specifically, utilizing LLMs to model production data represents a significant, often overlooked, area of opportunity.
Modeling production data
Let’s first tackle the scenario as to why we need to model production data? I think we are all familiar with production data. To recap production data is customer data like name, address, emails, billing information etc. For any business production data is a high value asset and a lot of care is taken in building software that will affect this data.
Change controlled environments
A lot of times the environments that hold production data are change controlled which means we are not at liberty to execute scripts and tooling without prior approval and authorization. This is because these environments might include personally identifiable information and this data is protected by different privacy regulations like HIPPA and GDPR. There is no way a business can test its software and tooling on change controlled or production environments because that can have a potential negative impact on customer experience.
Sandbox environments
A good alternative to this problem is to create a sandbox environment. Usually sandbox environments can be created quickly with modern tooling and cloud providers like Azure, AWS or GCP. These sandbox environments are usually the dev, test or stage environments in the deployment pipeline. Although the new environment can closely simulate production environment configuration and setting, it will lack any data.
Also, there are instances where we might need to run analytics and monitoring tooling on production data. By definition production data will be a lot bigger in size as compared to the data in our sandbox environment. This is a situation where we need to model a lot of data that will match the size of our production data.
Those are some scenarios that showcase a need to not only model quality production data but also a large quantity of data.
Generating data with LLMs
Now that we have explored the need for modeling data. Let’s look at an example where we generate simulated customer data for a hospital. In the following exercise, I provided Gemini with the following prompt:
“Create an example csv file of data with 10 rows with basic customer information for a hospital”
Based on the previous prompt I received the following output:
Patient_ID,First_Name,Last_Name,DOB,Gender,Phone_Number,Email,Blood_Type,Last_Visit_Date,Insurance_Provider
P1001,James,Smith,1975-05-12,Male,555-0101,jsmith75@email.com,A+,2023-11-15,BlueCross
P1002,Maria,Garcia,1988-08-23,Female,555-0102,mgarcia88@email.com,O-,2024-01-10,Aetna
P1003,Robert,Johnson,1962-12-01,Male,555-0103,rob.j@email.com,B+,2023-09-22,Medicare
P1004,Linda,Williams,1995-03-14,Female,555-0104,lwilliams95@email.com,AB-,2024-02-05,UnitedHealth
P1005,Michael,Brown,1950-07-30,Male,555-0105,mbrown50@email.com,O+,2023-12-12,Medicare
P1006,Elizabeth,Jones,1982-11-19,Female,555-0106,ejones82@email.com,A-,2024-03-01,Cigna
P1007,David,Miller,2001-01-25,Male,555-0107,dmiller01@email.com,O+,2023-10-30,Kaiser
P1008,Barbara,Davis,1945-09-08,Female,555-0108,bdavis45@email.com,B-,2024-01-20,Humana
P1009,William,Wilson,1990-06-17,Male,555-0109,wwilson90@email.com,A+,2023-08-14,BlueCross
P1010,Susan,Taylor,1979-04-05,Female,555-0110,staylor79@email.com,AB+,2024-02-28,Aetna
As we can see in the example above this data is actually a very good first draft of what we might need in a sandbox environment.
I will now refine the first draft of data to remove medical information from the data and add uuid identifier using the following prompts:
Remove Last_Visit_Date, and Blood_Type column from the csv
Remove Insurance_Provider from the csv
Convert the Patent_Id to a UUID format
The generated data is an example of a customer information table in a database.
Patient_ID,First_Name,Last_Name,DOB,Gender,Phone_Number,Email
44e36bad-2c81-4410-bf75-d9072651ac4b,James,Smith,1975-05-12,Male,555-0101,jsmith75@email.com
c7d84b81-6d39-4bc6-b695-b0854e2201da,Maria,Garcia,1988-08-23,Female,555-0102,mgarcia88@email.com
f5d50c7b-e4ad-4763-af2a-9913abcc80ad,Robert,Johnson,1962-12-01,Male,555-0103,rob.j@email.com
9c355fd4-37f8-4522-8013-8f8855171b6f,Linda,Williams,1995-03-14,Female,555-0104,lwilliams95@email.com
6b6868e6-15d4-411a-9e2c-8cc56b5b27cf,Michael,Brown,1950-07-30,Male,555-0105,mbrown50@email.com
8eccf25e-5ecc-4cca-b679-68dc929429a5,Elizabeth,Jones,1982-11-19,Female,555-0106,ejones82@email.com
a241c4c0-c071-49d2-976d-0f251a4e0abe,David,Miller,2001-01-25,Male,555-0107,dmiller01@email.com
01b6f80b-dc3c-4e66-a9af-ed89947f515b,Barbara,Davis,1945-09-08,Female,555-0108,bdavis45@email.com
b842ab80-614f-4c0c-95a7-944099ee0c2a,William,Wilson,1990-06-17,Male,555-0109,wwilson90@email.com
9f02891c-06ba-4f80-91c2-6077ca918bad,Susan,Taylor,1979-04-05,Female,555-0110,staylor79@email.com
Building on the previously generated data, how can we generate more data that is needed? We will now using the following prompt to generate a second set of data that will contain the patients hospital visit information:
“Using the data in final_hospital_data.csv generate a single csv file that will contain their last medical visit information like patient_id, visit date, visit time, insurance provider, visit reason, doctor visited.”
And we received the following output from LLM:
Patient_ID,Visit_Date,Visit_Time,Insurance_Provider,Visit_Reason,Doctor_Visited
44e36bad-2c81-4410-bf75-d9072651ac4b,2023-10-11,9:15,Humana,Hypertension Follow-up,Dr. Aris
c7d84b81-6d39-4bc6-b695-b0854e2201da,2024-05-15,8:45,BlueCross,Post-Op Review,Dr. Foster
f5d50c7b-e4ad-4763-af2a-9913abcc80ad,2023-06-04,10:30,Cigna,Hypertension Follow-up,Dr. Dhillon
9c355fd4-37f8-4522-8013-8f8855171b6f,2023-07-15,14:30,Aetna,Migraine,Dr. Evans
6b6868e6-15d4-411a-9e2c-8cc56b5b27cf,2023-05-10,14:15,Medicare,Flu Symptoms,Dr. Bennett
8eccf25e-5ecc-4cca-b679-68dc929429a5,2024-04-21,13:15,Kaiser,Migraine,Dr. Aris
a241c4c0-c071-49d2-976d-0f251a4e0abe,2023-07-18,8:15,BlueCross,Diabetes Consultation,Dr. Evans
01b6f80b-dc3c-4e66-a9af-ed89947f515b,2023-04-10,13:45,Cigna,Flu Symptoms,Dr. Chen
b842ab80-614f-4c0c-95a7-944099ee0c2a,2023-08-24,14:45,Medicare,Flu Symptoms,Dr. Foster
9f02891c-06ba-4f80-91c2-6077ca918bad,2024-08-19,9:15,Cigna,Joint Pain,Dr. Aris
This generated data is an example of a customer information and customer visit table for a hospital.
Integrating generated data
The collected data can now be readily integrated into either a SQL or NoSQL database. LLMs can be employed to generate the necessary insert queries for a SQL database, allowing this data to be fed into a local, development, or test environment. For the sake of clarity, I limited the example above to ten rows, but generating a larger dataset is straightforward.
Furthermore, it might be necessary to convert the data from its current CSV format into other structures, such as JSON or XML, to facilitate easier integration. While LLMs are capable of handling all these conversion tasks, we will omit them from this article to keep the focus narrow.
LLMs are a great tool to simulate data for our local or simulated environments. By using this generated data, we can test our software better and have more confidence in the solutions that we build.





