Wikipedia is enhancing its data accessibility for AI developers by releasing a machine learning-optimized dataset, aiming to reduce server strain caused by automated AI bots scraping its content.
The Wikimedia Foundation has collaborated with Kaggle, a Google-owned data science platform, to launch a beta dataset featuring structured Wikipedia content in English and French. This dataset is tailored to machine learning workflows, simplifying developers’ access to machine-readable article data for AI applications such as modeling, fine-tuning, and analysis.
The dataset encompasses various content types, including research summaries, short descriptions, image links, infobox data, and article sections, while excluding references and non-textual elements like audio files. As of April 15th, the data is presented in well-structured JSON representations, making it more appealing to developers than scraping or parsing raw article text. This initiative is expected to alleviate the strain on Wikipedia’s servers, which are heavily consumed by automated AI bot activity.
The Wikimedia Foundation already has content-sharing agreements with Google and the Internet Archive. However, this partnership with Kaggle is geared towards making the data more accessible to smaller companies and independent data scientists. By hosting the dataset, Kaggle plays a crucial role in maintaining the data’s accessibility and usefulness for the machine learning community.
“As the place the machine learning community comes for tools and tests, Kaggle is extremely excited to be the host for the Wikimedia Foundation’s data,” said Brenda Flynn, Kaggle partnerships lead. “Kaggle is excited to play a role in keeping this data accessible, available, and useful.”
The dataset’s release was announced on April 17, 2025, marking a significant step in Wikipedia’s effort to engage with AI developers and manage the impact of AI-driven traffic on its platform.




