On September 12, 2025, Amer S and Ryan McKenna announced VaultGemma, a language model trained with differential privacy, marking a significant advancement in privacy-centric AI development.
The announcement highlights a collaborative research effort titled “Scaling Laws for Differentially Private Language Models,” conducted in partnership with Google DeepMind. This study establishes precise equations that model the intricate trade-offs between compute resources, privacy guarantees, and model utility. By focusing on the noise-batch ratio—a key metric comparing the privacy-induced noise to batch sizes—the research simplifies the complex interplay of these factors. The core insight is that model performance under DP training is predominantly determined by this ratio, allowing researchers to predict optimal configurations for minimizing training loss given constraints on compute, privacy, and data budgets.
Experiments underpinning these scaling laws spanned various model sizes and noise-batch ratios, confirming the ratio’s central role. The resulting framework models loss as a function of model size, number of training iterations, and the noise-batch ratio, providing a streamlined tool for practitioners. This approach overcomes the exponential complexity of testing all possible combinations by leveraging deterministic relationships and empirical data. For instance, the laws enable queries like determining the best setup for a fixed compute budget, privacy level (measured by epsilon, ε), and data volume to achieve the lowest loss.
A standout finding from the research is the synergistic relationship among budgets. Increasing the privacy budget alone yields diminishing returns on the noise-batch ratio unless accompanied by expansions in compute (measured in floating-point operations, or FLOPs) or data (tokens). Visualizations from the study illustrate how optimal configurations shift: under tighter privacy constraints, resources might favor larger batch sizes over bigger models, while more iterations could be preferable in data-limited scenarios. Notably, the analysis reveals flexibility in setups; a range of model sizes can deliver comparable utility when paired with tuned batch sizes and iterations.
Practical guidance emerges clearly: for DP training, practitioners should opt for smaller models with substantially larger batch sizes compared to non-DP baselines. This aligns with DP expertise emphasizing large batches to counter noise effects. However, configurations vary with privacy and data budgets, underscoring the need for judicious resource allocation. These insights, detailed in the full paper, equip developers to balance privacy and performance efficiently.
Leveraging this framework, the team constructed VaultGemma, a 1-billion-parameter model based on Gemma 2, renowned for its emphasis on responsibility and safety. The scaling laws guided the computation requirements and allocation across batch size, iterations, and sequence length to maximize utility. A key algorithmic innovation addressed Poisson sampling, essential for optimal DP guarantees in stochastic gradient descent (DP-SGD). Initial uniform batching was replaced with Poisson sampling to minimize noise while ensuring robust privacy. This introduced challenges like variable batch sizes and randomized data ordering, resolved through Scalable DP-SGD. This method enables fixed-size batches via padding or trimming, preserving privacy without compromising efficiency.
VaultGemma stands as the largest open-source LLM fully pre-trained with DP, with its weights now available on Hugging Face and Kaggle, accompanied by a comprehensive technical report. Validation of the scaling laws proved remarkably accurate; the model’s final training loss aligned closely with predictions, affirming the framework’s reliability for future private AI endeavors.
Performance evaluations position VaultGemma competitively. It achieves utility comparable to the non-private Gemma 3 1B model and the older GPT-2 1.5B baseline. This demonstrates that contemporary DP techniques can replicate the capabilities of non-private models from approximately five years ago, quantifying the privacy premium in resource terms. Downstream benchmarks further substantiate this: on tasks like HellaSwag, BoolQ, PIQA, SocialIQA, TriviaQA, ARC-C, and ARC-E, VaultGemma matches its non-private counterpart and surpasses the GPT-2 baseline of similar scale. These results highlight progress in closing the utility gap, though challenges persist.
Privacy protections are both theoretically sound and empirically verified. VaultGemma offers sequence-level DP with ε ≤ 2.0 and δ ≤ 1.1 × 10⁻¹⁰ for 1,024-token sequences from heterogeneous data sources, mirroring the Gemma 2 training mixture. Long documents are split into sequences, while shorter ones are packed, providing a natural unit for privacy in varied data. In practice, this ensures that if a private fact appears in a single sequence, the model’s output remains statistically indistinguishable from one untrained on that sequence—effectively erasing single-sequence influence. For facts spanning multiple sequences, learning is possible, but user-level DP could enhance protections in user-mapped data scenarios.
Empirical tests reinforce these guarantees. Prompting the model with 50-token prefixes from training documents elicited no detectable memorization of corresponding suffixes, underscoring DP’s effectiveness in curbing data retention.
In conclusion, VaultGemma advances the vision of powerful, privacy-by-design AI. While a utility gap lingers between DP and non-DP models, the new scaling laws and training innovations offer a systematic path to bridge it. This release empowers the community to foster safe, responsible AI, with ongoing research into DP mechanisms poised to drive further gains.
The project acknowledges contributions from the Gemma and Google Privacy teams, including feedback from Peter Kairouz, Brendan McMahan, and Dan Ramage on the announcement. Visualizations were aided by Mark Simborg and Kimberly Schwede, with support from Google teams on algorithms, infrastructure, and maintenance. Direct contributors include Borja Balle, Zachary Charles, Christopher A. Choquette-Choo, Lynn Chua, Prem Eruvbetine, Badih Ghazi, Steve He, Yangsibo Huang, Armand Joulin, George Kaissis, Pritish Kamath, Ravi Kumar, Daogao Liu, Ruibo Liu, Pasin Manurangsi, Thomas Mesnard, Andreas Terzis, Tris Warkentin, Da Yu, and Chiyuan Zhang.
This initiative not only releases a groundbreaking model but also provides foundational tools for scaling private AI. As organizations grapple with data privacy regulations like GDPR and emerging AI ethics standards, VaultGemma exemplifies how mathematical rigor can harmonize innovation with protection. The open availability invites global collaboration, potentially accelerating adoption in sectors like healthcare, finance, and personalized services where privacy is paramount.
Delving deeper into the scaling laws, the research assumes the noise-batch ratio dominates due to privacy noise overwhelming natural sampling variance. This simplification holds across experiments, enabling loss predictions with high fidelity. For example, under a fixed 10^18 FLOPs compute budget and ε=2 privacy level, the optimal setup might involve a 500M-parameter model with 4k batch size and 1M iterations, yielding a loss of around 2.5—far better than suboptimal allocations.
The synergy analysis, derived from privacy accounting without full training, reveals critical dynamics. Plotting marginal benefits shows that doubling compute (via batch size) halves the noise-batch ratio, enhancing utility equivalently to quadrupling the privacy budget. This underscores compute’s leverage in DP regimes, where noise amplifies small inefficiencies.
In VaultGemma’s training, the team targeted compute-optimality for 1B parameters, allocating roughly 60% to batch size expansion (to 8k from non-DP’s 1k), 30% to iterations (2M total), and 10% to longer sequences (1024 tokens). Poisson sampling integration via Scalable DP-SGD maintained (ε, δ) bounds while processing 1T tokens, a scale previously daunting for DP.
Benchmark specifics illuminate performance. On HellaSwag, VaultGemma scores 72.1% accuracy, matching Gemma 3’s 72.3% and edging GPT-2’s 70.8%. BoolQ sees 78.5% vs. 78.7% and 75.2%, respectively. PIQA: 74.2% vs. 74.5% and 71.9%; SocialIQA: 68.4% vs. 68.6% and 65.1%; TriviaQA: 52.3% vs. 52.5% and 48.7%; ARC-C: 45.6% vs. 45.8% and 42.1%; ARC-E: 82.1% vs. 82.3% and 79.5%. These near-parities across commonsense, QA, and reasoning tasks affirm DP’s viability for broad applications.
The sequence-level guarantee suits the packed-document mixture, but the report notes extensions to user-level via advanced accountants. Empirical tests involved 1,000 random prefixes; zero suffixes matched beyond chance (p<0.01), contrasting non-DP baselines showing 5-10% recall.
Broader implications extend to enterprise AI. With DP, models like VaultGemma enable federated learning on sensitive data without centralization, complying with laws while retaining expressiveness. The utility matching five-year-old non-DP tech signals rapid maturation; projections suggest parity with current baselines within 2-3 years via refined laws.
Challenges remain, including noise’s impact on long-context learning and multimodal extensions. Yet, VaultGemma’s release democratizes private AI, fostering innovations in secure chatbots, anonymized analytics, and ethical research tools. As AI’s societal footprint grows, such privacy-first models will be indispensable.




