Dataconomy
  • News
    • Artificial Intelligence
    • Cybersecurity
    • DeFi & Blockchain
    • Finance
    • Gaming
    • Startups
    • Tech
  • Industry
  • Research
  • Resources
    • Articles
    • Guides
    • Case Studies
    • Whitepapers
    • AI Models Leaderboard
  • AI toolsNEW
  • Newsletter
  • + More
    • Glossary
    • Conversations
    • Events
    • About
      • Who we are
      • Contact
      • Imprint
      • Legal & Privacy
      • Partner With Us
Subscribe
No Result
View All Result
  • AI
  • Tech
  • Cybersecurity
  • Finance
  • DeFi & Blockchain
  • Startups
  • Gaming
Dataconomy
  • News
    • Artificial Intelligence
    • Cybersecurity
    • DeFi & Blockchain
    • Finance
    • Gaming
    • Startups
    • Tech
  • Industry
  • Research
  • Resources
    • Articles
    • Guides
    • Case Studies
    • Whitepapers
    • AI Models Leaderboard
  • AI toolsNEW
  • Newsletter
  • + More
    • Glossary
    • Conversations
    • Events
    • About
      • Who we are
      • Contact
      • Imprint
      • Legal & Privacy
      • Partner With Us
Subscribe
No Result
View All Result
Dataconomy
No Result
View All Result

Large language models for production data modeling

byKunal Nain
December 17, 2025
in Artificial Intelligence
Home News Artificial Intelligence
Share on FacebookShare on TwitterShare on LinkedInShare on WhatsAppShare on e-mail
Google Preferred Source

We increasingly rely on AI as assistants for coding, video production, and photography. While AI’s capabilities in areas like code, image, and video generation are well-explored, we may be underestimating the potential of popular large language models (LLMs) to enhance the quality of the products we develop in other domains. Specifically, utilizing LLMs to model production data represents a significant, often overlooked, area of opportunity.

Modeling production data

Let’s first tackle the scenario as to why we need to model production data? I think we are all familiar with production data. To recap production data is customer data like name, address, emails, billing information etc. For any business production data is a high value asset and a lot of care is taken in building software that will affect this data.

Change controlled environments

A lot of times the environments that hold production data are change controlled which means we are not at liberty to execute scripts and tooling without prior approval and authorization. This is because these environments might include personally identifiable information and this data is protected by different privacy regulations like HIPPA and GDPR. There is no way a business can test its software and tooling on change controlled or production environments because that can have a potential negative impact on customer experience.

Stay Ahead of the Curve!

Don't miss out on the latest insights, trends, and analysis in the world of data, technology, and startups. Subscribe to our newsletter and get exclusive content delivered straight to your inbox.

Sandbox environments

A good alternative to this problem is to create a sandbox environment. Usually sandbox environments can be created quickly with modern tooling and cloud providers like Azure, AWS or GCP. These sandbox environments are usually the dev, test or stage environments in the deployment pipeline. Although the new environment can closely simulate production environment configuration and setting, it will lack any data.

Also, there are instances where we might need to run analytics and monitoring tooling on production data. By definition production data will be a lot bigger in size as compared to the data in our sandbox environment. This is a situation where we need to model a lot of data that will match the size of our production data.

Those are some scenarios that showcase a need to not only model quality production data but also a large quantity of data.

Generating data with LLMs

Now that we have explored the need for modeling data. Let’s look at an example where we generate simulated customer data for a hospital. In the following exercise, I provided Gemini with the following prompt:

“Create an example csv file of data with 10 rows with basic customer information for a hospital”

Based on the previous prompt I received the following output:

Patient_ID,First_Name,Last_Name,DOB,Gender,Phone_Number,Email,Blood_Type,Last_Visit_Date,Insurance_Provider

P1001,James,Smith,1975-05-12,Male,555-0101,[email protected],A+,2023-11-15,BlueCross

P1002,Maria,Garcia,1988-08-23,Female,555-0102,[email protected],O-,2024-01-10,Aetna

P1003,Robert,Johnson,1962-12-01,Male,555-0103,[email protected],B+,2023-09-22,Medicare

P1004,Linda,Williams,1995-03-14,Female,555-0104,[email protected],AB-,2024-02-05,UnitedHealth

P1005,Michael,Brown,1950-07-30,Male,555-0105,[email protected],O+,2023-12-12,Medicare

P1006,Elizabeth,Jones,1982-11-19,Female,555-0106,[email protected],A-,2024-03-01,Cigna

P1007,David,Miller,2001-01-25,Male,555-0107,[email protected],O+,2023-10-30,Kaiser

P1008,Barbara,Davis,1945-09-08,Female,555-0108,[email protected],B-,2024-01-20,Humana

P1009,William,Wilson,1990-06-17,Male,555-0109,[email protected],A+,2023-08-14,BlueCross

P1010,Susan,Taylor,1979-04-05,Female,555-0110,[email protected],AB+,2024-02-28,Aetna

As we can see in the example above this data is actually a very good first draft of what we might need in a sandbox environment.

I will now refine the first draft of data to remove medical information from the data and add uuid identifier using the following prompts:

  1. Remove Last_Visit_Date, and Blood_Type column from the csv

  2. Remove Insurance_Provider from the csv

  3. Convert the Patent_Id to a UUID format

The generated data is an example of a customer information table in a database.

Patient_ID,First_Name,Last_Name,DOB,Gender,Phone_Number,Email

44e36bad-2c81-4410-bf75-d9072651ac4b,James,Smith,1975-05-12,Male,555-0101,[email protected]

c7d84b81-6d39-4bc6-b695-b0854e2201da,Maria,Garcia,1988-08-23,Female,555-0102,[email protected]

f5d50c7b-e4ad-4763-af2a-9913abcc80ad,Robert,Johnson,1962-12-01,Male,555-0103,[email protected]

9c355fd4-37f8-4522-8013-8f8855171b6f,Linda,Williams,1995-03-14,Female,555-0104,[email protected]

6b6868e6-15d4-411a-9e2c-8cc56b5b27cf,Michael,Brown,1950-07-30,Male,555-0105,[email protected]

8eccf25e-5ecc-4cca-b679-68dc929429a5,Elizabeth,Jones,1982-11-19,Female,555-0106,[email protected]

a241c4c0-c071-49d2-976d-0f251a4e0abe,David,Miller,2001-01-25,Male,555-0107,[email protected]

01b6f80b-dc3c-4e66-a9af-ed89947f515b,Barbara,Davis,1945-09-08,Female,555-0108,[email protected]

b842ab80-614f-4c0c-95a7-944099ee0c2a,William,Wilson,1990-06-17,Male,555-0109,[email protected]

9f02891c-06ba-4f80-91c2-6077ca918bad,Susan,Taylor,1979-04-05,Female,555-0110,[email protected]

Building on the previously generated data, how can we generate more data that is needed? We will now using the following prompt to generate a second set of data that will contain the patients hospital visit information:

“Using the data in final_hospital_data.csv generate a single csv file that will contain their last medical visit information like patient_id, visit date, visit time, insurance provider, visit reason, doctor visited.”

And we received the following output from LLM:

Patient_ID,Visit_Date,Visit_Time,Insurance_Provider,Visit_Reason,Doctor_Visited

44e36bad-2c81-4410-bf75-d9072651ac4b,2023-10-11,9:15,Humana,Hypertension Follow-up,Dr. Aris

c7d84b81-6d39-4bc6-b695-b0854e2201da,2024-05-15,8:45,BlueCross,Post-Op Review,Dr. Foster

f5d50c7b-e4ad-4763-af2a-9913abcc80ad,2023-06-04,10:30,Cigna,Hypertension Follow-up,Dr. Dhillon

9c355fd4-37f8-4522-8013-8f8855171b6f,2023-07-15,14:30,Aetna,Migraine,Dr. Evans

6b6868e6-15d4-411a-9e2c-8cc56b5b27cf,2023-05-10,14:15,Medicare,Flu Symptoms,Dr. Bennett

8eccf25e-5ecc-4cca-b679-68dc929429a5,2024-04-21,13:15,Kaiser,Migraine,Dr. Aris

a241c4c0-c071-49d2-976d-0f251a4e0abe,2023-07-18,8:15,BlueCross,Diabetes Consultation,Dr. Evans

01b6f80b-dc3c-4e66-a9af-ed89947f515b,2023-04-10,13:45,Cigna,Flu Symptoms,Dr. Chen

b842ab80-614f-4c0c-95a7-944099ee0c2a,2023-08-24,14:45,Medicare,Flu Symptoms,Dr. Foster

9f02891c-06ba-4f80-91c2-6077ca918bad,2024-08-19,9:15,Cigna,Joint Pain,Dr. Aris

This generated data is an example of a customer information and customer visit table for a hospital.

Integrating generated data

The collected data can now be readily integrated into either a SQL or NoSQL database. LLMs can be employed to generate the necessary insert queries for a SQL database, allowing this data to be fed into a local, development, or test environment. For the sake of clarity, I limited the example above to ten rows, but generating a larger dataset is straightforward.

Furthermore, it might be necessary to convert the data from its current CSV format into other structures, such as JSON or XML, to facilitate easier integration. While LLMs are capable of handling all these conversion tasks, we will omit them from this article to keep the focus narrow.

LLMs are a great tool to simulate data for our local or simulated environments. By using this generated data, we can test our software better and have more confidence in the solutions that we build.

Tags: trends

Related Posts

OpenAI limits ChatGPT 5.6 access to government-approved users first

OpenAI limits ChatGPT 5.6 access to government-approved users first

June 26, 2026
Meta debuts AI-powered Creator Studio app to help Facebook creators grow

Meta debuts AI-powered Creator Studio app to help Facebook creators grow

June 25, 2026
Figma adds code layers to collaborative design canvas

Figma adds code layers to collaborative design canvas

June 25, 2026
US reportedly urges Meta to submit AI models

US reportedly urges Meta to submit AI models

June 25, 2026
OpenAI upgrades GPT-5.5 Instant for stronger context awareness

OpenAI upgrades GPT-5.5 Instant for stronger context awareness

June 25, 2026
ByteDance launches Doubao 2.1 Pro language model

ByteDance launches Doubao 2.1 Pro language model

June 24, 2026

LATEST NEWS

Apple touchscreen MacBook could launch with M5 Pro chips

Apple touchscreen MacBook could launch with M5 Pro chips

OpenAI limits ChatGPT 5.6 access to government-approved users first

Apple to skip M6 Pro and Max chips and launch M7 in 2027

IBM unveils world’s first sub-1nm chip with new nanostack architecture

Apple raises prices across Macs, iPads and home devices

BEST AI MODELS LEADERBOARD

See the best AI models, ranked by intelligence, benchmark results, speed and token price. Find the most suitable LLMs, Text-to-Image, Image Editing, Text-to-Speech, Text-to-Video and Image-to-Video  artificial intelligence model for your tasks and business.

LATEST TOOLS

WatchMyCompetitor

TokkingHeads

Fellow.app

Octoparse

AnyToSpeech

Vrew

Fireflies

SpeedLegal

Teachable Machine

Unriddle

Dataconomy

COPYRIGHT © DATACONOMY MEDIA GMBH, ALL RIGHTS RESERVED.

  • About
  • Imprint
  • Contact
  • Legal & Privacy

Follow Us

  • News
    • Artificial Intelligence
    • Cybersecurity
    • DeFi & Blockchain
    • Finance
    • Gaming
    • Startups
    • Tech
  • Industry
  • Research
  • Resources
    • Articles
    • Guides
    • Case Studies
    • Whitepapers
    • AI Models Leaderboard
  • AI tools
  • Newsletter
  • + More
    • Glossary
    • Conversations
    • Events
    • About
      • Who we are
      • Contact
      • Imprint
      • Legal & Privacy
      • Partner With Us
No Result
View All Result
Subscribe

This website uses cookies to improve your experience. You can choose to accept or reject them. Visit our Privacy Policy.