Dataconomy
  • News
    • Artificial Intelligence
    • Cybersecurity
    • DeFi & Blockchain
    • Finance
    • Gaming
    • Startups
    • Tech
  • Industry
  • Research
  • Resources
    • Articles
    • Guides
    • Case Studies
    • Whitepapers
  • AI toolsNEW
  • Newsletter
  • + More
    • Glossary
    • Conversations
    • Events
    • About
      • About
      • Contact
      • Imprint
      • Legal & Privacy
      • Partner With Us
Subscribe
No Result
View All Result
  • AI
  • Tech
  • Cybersecurity
  • Finance
  • DeFi & Blockchain
  • Startups
  • Gaming
Dataconomy
  • News
    • Artificial Intelligence
    • Cybersecurity
    • DeFi & Blockchain
    • Finance
    • Gaming
    • Startups
    • Tech
  • Industry
  • Research
  • Resources
    • Articles
    • Guides
    • Case Studies
    • Whitepapers
  • AI toolsNEW
  • Newsletter
  • + More
    • Glossary
    • Conversations
    • Events
    • About
      • About
      • Contact
      • Imprint
      • Legal & Privacy
      • Partner With Us
Subscribe
No Result
View All Result
Dataconomy
No Result
View All Result

Large language models for production data modeling

byKunal Nain
December 17, 2025
in Artificial Intelligence
Home News Artificial Intelligence
Share on FacebookShare on TwitterShare on LinkedInShare on WhatsAppShare on e-mail

We increasingly rely on AI as assistants for coding, video production, and photography. While AI’s capabilities in areas like code, image, and video generation are well-explored, we may be underestimating the potential of popular large language models (LLMs) to enhance the quality of the products we develop in other domains. Specifically, utilizing LLMs to model production data represents a significant, often overlooked, area of opportunity.

Modeling production data

Let’s first tackle the scenario as to why we need to model production data? I think we are all familiar with production data. To recap production data is customer data like name, address, emails, billing information etc. For any business production data is a high value asset and a lot of care is taken in building software that will affect this data.

Change controlled environments

A lot of times the environments that hold production data are change controlled which means we are not at liberty to execute scripts and tooling without prior approval and authorization. This is because these environments might include personally identifiable information and this data is protected by different privacy regulations like HIPPA and GDPR. There is no way a business can test its software and tooling on change controlled or production environments because that can have a potential negative impact on customer experience.

Stay Ahead of the Curve!

Don't miss out on the latest insights, trends, and analysis in the world of data, technology, and startups. Subscribe to our newsletter and get exclusive content delivered straight to your inbox.

Sandbox environments

A good alternative to this problem is to create a sandbox environment. Usually sandbox environments can be created quickly with modern tooling and cloud providers like Azure, AWS or GCP. These sandbox environments are usually the dev, test or stage environments in the deployment pipeline. Although the new environment can closely simulate production environment configuration and setting, it will lack any data.

Also, there are instances where we might need to run analytics and monitoring tooling on production data. By definition production data will be a lot bigger in size as compared to the data in our sandbox environment. This is a situation where we need to model a lot of data that will match the size of our production data.

Those are some scenarios that showcase a need to not only model quality production data but also a large quantity of data.

Generating data with LLMs

Now that we have explored the need for modeling data. Let’s look at an example where we generate simulated customer data for a hospital. In the following exercise, I provided Gemini with the following prompt:

“Create an example csv file of data with 10 rows with basic customer information for a hospital”

Based on the previous prompt I received the following output:

Patient_ID,First_Name,Last_Name,DOB,Gender,Phone_Number,Email,Blood_Type,Last_Visit_Date,Insurance_Provider

P1001,James,Smith,1975-05-12,Male,555-0101,jsmith75@email.com,A+,2023-11-15,BlueCross

P1002,Maria,Garcia,1988-08-23,Female,555-0102,mgarcia88@email.com,O-,2024-01-10,Aetna

P1003,Robert,Johnson,1962-12-01,Male,555-0103,rob.j@email.com,B+,2023-09-22,Medicare

P1004,Linda,Williams,1995-03-14,Female,555-0104,lwilliams95@email.com,AB-,2024-02-05,UnitedHealth

P1005,Michael,Brown,1950-07-30,Male,555-0105,mbrown50@email.com,O+,2023-12-12,Medicare

P1006,Elizabeth,Jones,1982-11-19,Female,555-0106,ejones82@email.com,A-,2024-03-01,Cigna

P1007,David,Miller,2001-01-25,Male,555-0107,dmiller01@email.com,O+,2023-10-30,Kaiser

P1008,Barbara,Davis,1945-09-08,Female,555-0108,bdavis45@email.com,B-,2024-01-20,Humana

P1009,William,Wilson,1990-06-17,Male,555-0109,wwilson90@email.com,A+,2023-08-14,BlueCross

P1010,Susan,Taylor,1979-04-05,Female,555-0110,staylor79@email.com,AB+,2024-02-28,Aetna

As we can see in the example above this data is actually a very good first draft of what we might need in a sandbox environment.

I will now refine the first draft of data to remove medical information from the data and add uuid identifier using the following prompts:

  1. Remove Last_Visit_Date, and Blood_Type column from the csv

  2. Remove Insurance_Provider from the csv

  3. Convert the Patent_Id to a UUID format

The generated data is an example of a customer information table in a database.

Patient_ID,First_Name,Last_Name,DOB,Gender,Phone_Number,Email

44e36bad-2c81-4410-bf75-d9072651ac4b,James,Smith,1975-05-12,Male,555-0101,jsmith75@email.com

c7d84b81-6d39-4bc6-b695-b0854e2201da,Maria,Garcia,1988-08-23,Female,555-0102,mgarcia88@email.com

f5d50c7b-e4ad-4763-af2a-9913abcc80ad,Robert,Johnson,1962-12-01,Male,555-0103,rob.j@email.com

9c355fd4-37f8-4522-8013-8f8855171b6f,Linda,Williams,1995-03-14,Female,555-0104,lwilliams95@email.com

6b6868e6-15d4-411a-9e2c-8cc56b5b27cf,Michael,Brown,1950-07-30,Male,555-0105,mbrown50@email.com

8eccf25e-5ecc-4cca-b679-68dc929429a5,Elizabeth,Jones,1982-11-19,Female,555-0106,ejones82@email.com

a241c4c0-c071-49d2-976d-0f251a4e0abe,David,Miller,2001-01-25,Male,555-0107,dmiller01@email.com

01b6f80b-dc3c-4e66-a9af-ed89947f515b,Barbara,Davis,1945-09-08,Female,555-0108,bdavis45@email.com

b842ab80-614f-4c0c-95a7-944099ee0c2a,William,Wilson,1990-06-17,Male,555-0109,wwilson90@email.com

9f02891c-06ba-4f80-91c2-6077ca918bad,Susan,Taylor,1979-04-05,Female,555-0110,staylor79@email.com

Building on the previously generated data, how can we generate more data that is needed? We will now using the following prompt to generate a second set of data that will contain the patients hospital visit information:

“Using the data in final_hospital_data.csv generate a single csv file that will contain their last medical visit information like patient_id, visit date, visit time, insurance provider, visit reason, doctor visited.”

And we received the following output from LLM:

Patient_ID,Visit_Date,Visit_Time,Insurance_Provider,Visit_Reason,Doctor_Visited

44e36bad-2c81-4410-bf75-d9072651ac4b,2023-10-11,9:15,Humana,Hypertension Follow-up,Dr. Aris

c7d84b81-6d39-4bc6-b695-b0854e2201da,2024-05-15,8:45,BlueCross,Post-Op Review,Dr. Foster

f5d50c7b-e4ad-4763-af2a-9913abcc80ad,2023-06-04,10:30,Cigna,Hypertension Follow-up,Dr. Dhillon

9c355fd4-37f8-4522-8013-8f8855171b6f,2023-07-15,14:30,Aetna,Migraine,Dr. Evans

6b6868e6-15d4-411a-9e2c-8cc56b5b27cf,2023-05-10,14:15,Medicare,Flu Symptoms,Dr. Bennett

8eccf25e-5ecc-4cca-b679-68dc929429a5,2024-04-21,13:15,Kaiser,Migraine,Dr. Aris

a241c4c0-c071-49d2-976d-0f251a4e0abe,2023-07-18,8:15,BlueCross,Diabetes Consultation,Dr. Evans

01b6f80b-dc3c-4e66-a9af-ed89947f515b,2023-04-10,13:45,Cigna,Flu Symptoms,Dr. Chen

b842ab80-614f-4c0c-95a7-944099ee0c2a,2023-08-24,14:45,Medicare,Flu Symptoms,Dr. Foster

9f02891c-06ba-4f80-91c2-6077ca918bad,2024-08-19,9:15,Cigna,Joint Pain,Dr. Aris

This generated data is an example of a customer information and customer visit table for a hospital.

Integrating generated data

The collected data can now be readily integrated into either a SQL or NoSQL database. LLMs can be employed to generate the necessary insert queries for a SQL database, allowing this data to be fed into a local, development, or test environment. For the sake of clarity, I limited the example above to ten rows, but generating a larger dataset is straightforward.

Furthermore, it might be necessary to convert the data from its current CSV format into other structures, such as JSON or XML, to facilitate easier integration. While LLMs are capable of handling all these conversion tasks, we will omit them from this article to keep the focus narrow.

LLMs are a great tool to simulate data for our local or simulated environments. By using this generated data, we can test our software better and have more confidence in the solutions that we build.

Tags: trends

Related Posts

Google transforms Gmail with AI Inbox and natural language search

Google transforms Gmail with AI Inbox and natural language search

January 9, 2026
Beyond the Hallucination: How AI is Rebuilding AAA Production Pipelines from a Vacuum

Beyond the Hallucination: How AI is Rebuilding AAA Production Pipelines from a Vacuum

January 8, 2026
Google Classroom turns lessons into podcasts with Gemini

Google Classroom turns lessons into podcasts with Gemini

January 8, 2026
Skylight unveils Calendar 2 with AI organization tools at CES 2026

Skylight unveils Calendar 2 with AI organization tools at CES 2026

January 8, 2026
OpenAI launches dedicated ChatGPT Health space

OpenAI launches dedicated ChatGPT Health space

January 8, 2026
Ford announces AI assistant and next-gen BlueCruise at CES 2026

Ford announces AI assistant and next-gen BlueCruise at CES 2026

January 8, 2026

LATEST NEWS

Xbox Developer Direct returns January 22 with Fable and Forza Horizon 6

Dell debuts disaggregated infrastructure for modern data centers

TikTok scores partnership with FIFA for World Cup highlights

YouTube now lets you hide Shorts in search results

Google transforms Gmail with AI Inbox and natural language search

Disney+ to launch TikTok-style short-form video feed in the US

Dataconomy

COPYRIGHT © DATACONOMY MEDIA GMBH, ALL RIGHTS RESERVED.

  • About
  • Imprint
  • Contact
  • Legal & Privacy

Follow Us

  • News
    • Artificial Intelligence
    • Cybersecurity
    • DeFi & Blockchain
    • Finance
    • Gaming
    • Startups
    • Tech
  • Industry
  • Research
  • Resources
    • Articles
    • Guides
    • Case Studies
    • Whitepapers
  • AI tools
  • Newsletter
  • + More
    • Glossary
    • Conversations
    • Events
    • About
      • About
      • Contact
      • Imprint
      • Legal & Privacy
      • Partner With Us
No Result
View All Result
Subscribe

This website uses cookies. By continuing to use this website you are giving consent to cookies being used. Visit our Privacy Policy.