Google and Bing now let you stop your content being used to train AI models

Posted by Edith MacLeod on 4 Oct, 2023
View comments Search News
New controls allow you to block your web content from Google’s Bard, Vertex AI, and Bing Chat.

Block content indexing.

Image: Mohamed Nohassi on Unsplash

One concern around the rapid rise of generative AI is the scraping of content to train AI models, without any compensation or consent.

OpenAI published details of how to block its GPTbot on 7 August. Originality AI’s study, updated in September, looked at the top 1,000 websites worldwide to see which were blocking GPTbot.

Their findings show that 25.9% of the top 1000 websites are blocking GPTbot, including big names such as Amazon and Pinterest, and many large media and news publishers such as The New York Times, Reuters, CNN and more.

Top sites blocking access.

Source: Originality.ai

Google and Bing have now both also given details of how to block certain of their AI models from scraping your content.

Google-Extended

In a blog post last week, Google announced its new control mechanism called Google-Extended, which will allow you to choose whether AI models Bard and Vertex AI can access content on your website.

“Today we’re announcing Google-Extended, a new control that web publishers can use to manage whether their sites help improve Bard and Vertex AI generative APIs, including future generations of models that power those products. By using Google-Extended to control access to content on a site, a website administrator can choose whether to help these AI models become more accurate and capable over time.”

On Google’s page detailing Google crawlers and fetchers Google-Extended is defined as:

"A standalone product token that web publishers can use to manage whether their sites help improve Bard and Vertex AI generative APIs, including future generations of models that power those products."

Use robots.txt as below:

robots.txt

** Update 11 October: Search Engine Land has reported that Google-Extended does not work for the AI-answers and snapshots provided in Google's Search Generative Experience. This is because the AI is built into Search, not bolted on, so it's integral to how Search functions.

Bing NOCACHE and NOARCHIVE

Bing has also announced ways for publishers to control the use of their content in Bing Chat and to train Microsoft’s generative AI models. These build on existing mechanisms, using the NOCACHE and NOARCHIVE tags.

Here are the ways to use these controls, as outlined on Bing’s blog:

  • “No action is needed to remain in Bing Chat. Content without NOCACHE tag and without NOARCHIVE tag may be included in Bing Chat answers and will benefit from AI's ability to generate more helpful answers and to increase your ranking opportunities in Bing Chat; site content may be used in training our generative AI foundation models.     
  • Content with the NOCACHE tag may be included in Bing Chat answers. We will only display URL/Snippet/Title in the answer; Going forward, for content in our Bing Index that is labeled NOCACHE, only URLs, Titles and Snippets may be used in training Microsoft’s generative AI foundation models.     
  • Content tagged NOARCHIVE will not be included in Bing Chat answers, not be linked to in the answers. Going forward, for content in our Bing Index that is labeled NOARCHIVE, we will not use the content for training Microsoft’s generative AI foundation models.  
  • If content has both NOCACHE and NOARCHIVE tags, we will treat it as NOCACHE."

Bing added that using the NOCACHE and NOARCHIVE tags would not affect your content appearing in Bing’s search results.

Google accidentally indexed Bard shared chats

Separately, Google is to block shared chats with Bard from being indexed by Search. In response to a post on X (formerly Twitter) from a user who had noticed Google indexing shared Bard conversations, Google said this had not been the intention:

“Bard allows people to share chats, if they choose. We also don't intend for these shared chats to be indexed by Google Search. We're working on blocking them from being indexed now.”

Source: https://twitter.com/searchliaison/status/1706732827705065665?s=20

Recent articles

Google springs December 2024 core update
Posted by Edith MacLeod on 12 December 2024
Google Search Console recommendations now live for all
Posted by Edith MacLeod on 3 December 2024
Google tightens site reputation abuse policy
Posted by Edith MacLeod on 28 November 2024
How to increase website traffic with email marketing
Posted by Maria Fintanidou on 26 November 2024
Google retires Page Experience report in Search Console
Posted by Edith MacLeod on 19 November 2024