OpenAI launches bot that will crawl the internet to educate GPT

Website owners will have to explicitly opt out if they do not want their data harvesting

Tuesday 08 August 2023 15:48 BST

Your support helps us to tell the story

From reproductive rights to climate change to Big Tech, The Independent is on the ground when the story is developing. Whether it's investigating the financials of Elon Musk's pro-Trump PAC or producing our latest documentary, 'The A Word', which shines a light on the American women fighting for reproductive rights, we know how important it is to parse out the facts from the messaging.

At such a critical moment in US history, we need reporters on the ground. Your donation allows us to keep sending journalists to speak to both sides of the story.

The Independent is trusted by Americans across the entire political spectrum. And unlike many other quality news outlets, we choose not to lock Americans out of our reporting and analysis with paywalls. We believe quality journalism should be available to everyone, paid for by those who can afford it.

Your support makes all the difference.

OpenAI has built a new bot that will crawl over the internet, gathering information to educate artificial intelligence systems.

Operators of websites will be forced to actively opt out, and block the bot, if they want to stop it taking data from their site.

Artificial intelligence systems such as OpenAI's ChatGPT rely on vast amounts of data to train their models and learn how to give the correct outputs. So far, much of that data has been taken freely from the web.

That has prompted numerous complaints from authors and other web users. Many have criticised OpenAI and others for taking personal information and copyrighted content to train their models, with that writing potentially informing or even being replicated in the system's answers.

Artificial intelligence companies have also faced criticism from others who claim that such crawlers are stretching their web infrastructure. Elon Musk, for instance, has said that the load from such bots has forced Twitter to place limits on how many posts users could see on the site.

OpenAI's existing ChatGPT 3.5 and 4 were trained on data taken from the internet that was taken up to late 2021. There is no way for owners of that data or the websites it was gathered from to remove it from OpenAI's models.

Now OpenAI says that the new system, named 'GPTBot', will be crawling over data and writing on the web to gather more information to train future models.

It told website administrators that they should include instructions to the bot to stop it from crawling a website, if they did not want that information to be gathered. Administrators are able to include such information in a file called "robots.txt", which gives instructions to other crawlers such as those used by Google for its search results.

OpenAI says the bot "may potentially be used to improve future models". It also says that it is built to "remove sources" that require a paywall, gather personally identifiable information or have text that violates its rules.

It suggested that letting the bot access sites "can help AI models become more accurate and improve their general capabilities and safety".

Join our commenting forum

Join thought-provoking conversations, follow other Independent readers and see their replies

Comments

Stay up to date with notifications from The Independent

Thank you for registering

OpenAI launches bot that will crawl the internet to educate GPT

Website owners will have to explicitly opt out if they do not want their data harvesting

Join our commenting forum

Thank you for registering