Amer S and Ryan McKenna from Google Research announced VaultGemma on September 12, 2025, as the most capable language model trained from scratch with differential privacy. This 1-billion-parameter open model addresses privacy challenges in AI training by incorporating calibrated noise, while a new research paper outlines scaling laws for compute-privacy-utility trade-offs, with weights released on Hugging Face and Kaggle.
Differential privacy adds calibrated noise during training to prevent memorization of individual data points, ensuring that the model’s outputs remain statistically similar whether or not any single training example is included. This approach provides a mathematically rigorous framework for protecting user data in large language models. However, implementing differential privacy in language model training introduces specific challenges. The noise disrupts the traditional scaling laws, which describe how model performance improves with increases in model size, data volume, and computational resources. In particular, the noise reduces training stability, making it harder for the model to learn consistently without encountering issues such as sudden spikes in loss or complete divergence during optimization. To counteract this instability, practitioners must use significantly larger batch sizes, which in turn demand more computational power and memory, elevating the overall costs of training.
The research paper titled “Scaling Laws for Differentially Private Language Models,” developed in partnership with Google DeepMind, establishes equations that precisely model these compute-privacy-utility trade-offs for differentially private large language models. These equations capture the intricate relationships between the amount of computation, the privacy level achieved, and the resulting model utility, offering a predictive tool for optimizing training configurations. The paper’s development involved extensive analysis to quantify how differential privacy alters the dynamics of model training compared to non-private methods. By deriving these laws, the authors provide a foundation for designing efficient private models, enabling researchers to forecast performance without exhaustive experimentation.
Guided by the insights from these scaling laws, the team constructed VaultGemma as a 1-billion-parameter model based on the Gemma 2 architecture, trained entirely from scratch under differential privacy constraints. The model’s weights are now publicly available on platforms such as Hugging Face and Kaggle, accompanied by a detailed technical report that explains the training process, hyperparameters, and evaluation results. This release marks the largest such open model to date, allowing developers and researchers worldwide to access and build upon a production-quality differentially private language model. The Gemma series itself emphasizes responsibility and safety in AI development, which aligned well with the goals of incorporating privacy protections from the outset.
The experimental methodology in the research focused on quantifying the impacts of varying model sizes, batch sizes, and training iterations within the differential privacy framework. To manage the vast number of possible combinations, the authors made simplifying assumptions, centering their analysis on the noise-batch ratio. This ratio measures the relative scale of the privacy-induced noise against the batch size used in stochastic gradient descent. The assumption holds because the deliberate noise added for privacy dominates over any inherent randomness from data sampling, allowing the model’s learning effectiveness to be primarily determined by this single metric. Through this lens, the methodology enabled systematic evaluation of how adjustments in these parameters affect overall performance.
Comprehensive experiments evaluated model performance across diverse model sizes and noise-batch ratios, generating empirical data that, when combined with deterministic relationships between variables like compute budget and data budget, supports targeted queries. For example, the scaling laws can determine the optimal training setup to minimize loss given fixed compute, privacy, and data budgets. The predicted loss is modeled using the model size, number of iterations, and the noise-batch ratio, which simplifies the navigation of complex interactions among budgets. This structure provides a clear pathway for practitioners to balance resources effectively during private model training.
From a privacy accounting perspective, the dynamics between the compute budget, privacy budget, and data budget reveal key interactions for a fixed model size and iteration count. Increasing the privacy budget, denoted by the parameter ε, reduces the noise level but yields diminishing returns if not paired with expansions in compute or data budgets. Specifically, without corresponding increases in floating-point operations (FLOPs) or tokens processed, the noise-batch ratio improves only marginally, limiting gains in utility. This synergy underscores the need for coordinated scaling: enhancing privacy alone does not sufficiently lower the effective noise unless supported by more computational resources or additional training data.
Visualizations in the research illustrate how optimal configurations shift with changing budgets. As privacy and compute constraints vary, the preferred allocation moves between larger model sizes, expanded batch sizes, or additional iterations. For instance, under tighter privacy budgets, prioritizing larger batches often proves more effective than scaling the model size, as it directly mitigates the noise impact. These plots detail the minimum achievable loss for various budget combinations, alongside breakdowns of hyperparameters such as iterations, batch size, and model dimensions. Such granularity helps identify not only the best setup but also ranges of viable alternatives that deliver comparable utility, offering flexibility in resource-constrained environments.
A central insight from the scaling laws is the recommendation to train smaller models with substantially larger batch sizes compared to non-private scenarios. This approach leverages the importance of oversized batches in stabilizing differential private stochastic gradient descent (DP-SGD), a common optimization method in this domain. The insight applies broadly across different settings, though exact optima adjust based on specific privacy and data budgets. Understanding these trade-offs ensures efficient use of compute and privacy allocations, preventing wasteful configurations. The analysis also highlights flexibility in choices, where multiple model sizes can achieve similar losses when matched with appropriate iterations and batch adjustments.
To construct VaultGemma, the team applied the scaling laws to calculate the total FLOPs required for a compute-optimal 1-billion-parameter model derived from Gemma 2. They then distributed these FLOPs across batch size, iterations, and sequence length to maximize utility under privacy constraints. This allocation process involved iterative simulations using the predictive equations to test various distributions, ensuring the final setup aligned with the lowest projected loss. The resulting configuration balanced the need for noise mitigation through large batches with sufficient iterations to converge effectively, all while adhering to the target parameter count.
A notable challenge in bridging the scaling law research to actual training was handling Poisson sampling, a key element of DP-SGD that ensures robust privacy guarantees by randomizing data selection. Initially, the team loaded data in uniform batches, but this method offered suboptimal privacy protections due to higher effective noise. Switching to Poisson sampling improved guarantees but introduced variability: batches varied in size, and data processing required a randomized order. To resolve these issues, they adopted techniques from recent work on Scalable DP-SGD, which processes data in fixed-size batches by padding shorter ones or trimming longer ones. This adaptation preserves the privacy benefits of Poisson sampling without disrupting the training pipeline’s efficiency.
The training of VaultGemma confirmed the accuracy of the scaling laws, with the final training loss aligning closely to predictions from the equations. This validation demonstrates the reliability of the framework for forecasting outcomes in private model development, providing a dependable guide for future efforts. The process involved monitoring loss curves throughout training to ensure stability, adjusting hyperparameters as needed within the predefined budget, and verifying that the noise-batch ratio remained optimal. Such close correspondence between theory and practice reinforces the laws’ utility in practical applications.
In performance evaluations, VaultGemma 1B with differential privacy achieves utility levels comparable to the non-private Gemma3 1B and the GPT-2 1.5B model. These comparisons quantify the resource demands of privacy-preserving training, showing that current methods produce models on par with non-private architectures from approximately five years prior. The evaluations included perplexity metrics on held-out data, where VaultGemma’s scores reflect effective learning despite the added noise, highlighting progress in closing the utility gap through optimized scaling.
Downstream assessments on standard benchmarks further validate VaultGemma’s capabilities. On HellaSwag, the model performs at levels matching its non-private counterpart, demonstrating strong commonsense inference. BoolQ results indicate reliable question answering on boolean queries, while PIQA shows competence in physical interaction predictions. SocialIQA evaluations reveal solid understanding of social norms, TriviaQA confirms knowledge retention for factual recall, ARC-C handles complex reasoning challenges, and ARC-E addresses easy science questions effectively. Including GPT-2 1.5B in these comparisons underscores that VaultGemma’s benchmark scores align with older non-private models of similar scale, illustrating the state of private training advancements.
VaultGemma provides a formal sequence-level differential privacy guarantee of ε ≤ 2.0 and δ ≤ 1.1 × 10⁻¹⁰ for sequences of 1024 tokens drawn from heterogeneous data sources. The training mixture mirrors that of Gemma 2, comprising documents of varying lengths preprocessed by splitting long ones into multiple sequences and packing short ones together. This sequence-level unit suits the data format, though user-level privacy would be preferable when data ties directly to individuals. In practice, this guarantee ensures that the model’s responses to queries remain statistically indistinguishable whether a particular sequence is included in training or not, effectively preventing the model from learning any isolated fact within a single sequence. However, facts appearing across multiple sequences can still be learned, allowing general knowledge acquisition without compromising individual privacy.
Complementing the theoretical guarantees, empirical tests assessed memorization risks by prompting VaultGemma with 50-token prefixes from training documents and checking for reproduction of the subsequent 50 tokens. The model exhibited no detectable memorization, generating unrelated continuations that did not match the original suffixes. This outcome verifies the practical effectiveness of differential privacy in suppressing verbatim recall, even for potentially sensitive training excerpts. The test protocol involved selecting diverse prefixes from various data sources to cover a broad sample, ensuring comprehensive coverage of potential vulnerabilities.
Acknowledgements for the project extend to the Gemma and Google Privacy teams, with specific thanks to Peter Kairouz, Brendan McMahan, and Dan Ramage for feedback on the announcement. Mark Simborg and Kimberly Schwede assisted with visualizations, while broader Google teams supported algorithm design, infrastructure, and production maintenance. Direct contributors, listed alphabetically, include Borja Balle, Zachary Charles, Christopher A. Choquette-Choo, Lynn Chua, Prem Eruvbetine, Badih Ghazi, Steve He, Yangsibo Huang, Armand Joulin, George Kaissis, Pritish Kamath, Ravi Kumar, Daogao Liu, Ruibo Liu, Pasin Manurangsi, Thomas Mesnard, Andreas Terzis, Tris Warkentin, Da Yu, and Chiyuan Zhang.