Aug 262 min read

Baidu Blocks Google and Bing from Scraping Content Amid Demand for Data Used on AI Projects

Baidu stops Bing and Google from using its content for artificial intelligence initiatives. The robots.txt file update from Baidu Baike limits the access for Googlebot and Bingbot crawlers. This action fits in with a larger trend of data protection for AI model training.

Baidu Baike's robots.txt file has been updated to prevent Googlebot and Bingbot crawlers from indexing content on the Chinese platform.

Google and Bing were granted partial access on the same day prior to this update, which was implemented around August 8, according to records from the Wayback Machine. These search engines are now prohibited from perusing and indexing Baidu Baike's online repository, which contains nearly 30 million entries.

The decision is indicative of Baidu's dedication to protecting its online assets in the face of the increasing demand for extensive data that is necessary for the training and development of artificial intelligence (AI) models and applications. This action is reminiscent of Reddit's similar action in July, in which it restricted the indexation of its online content by all search engines except Google.

In a related development, Microsoft had previously threatened to revoke access to its internet-search data, which was licensed to competitor search engines, if they continued to use it for their chatbots and other generative AI (GenAI) services. This emphasises the increasing significance of data preservation and control in the AI environment.

Despite the fact that the Chinese version of Wikipedia has 1.43 million entries that are accessible to search engine crawlers, the visibility of its content on Google and Bing has been affected by the recent robots.txt update from Baidu Baike. Nevertheless, search results on US platforms continue to display older cached content from the Wikipedia-style service.

Google, Microsoft Bing, ChatGpt — Credit: Shutterstock

As of Friday, representatives from Baidu, Google, and Microsoft have not responded to enquiries regarding this issue. The decision is timely, as major AI developers worldwide are currently in the process of establishing partnerships with content publishers in order to obtain high-quality data for their GenAI projects.

GenAI is a collection of algorithms and services, including ChatGPT, that are employed to produce a variety of content, including audio, code, images, text, simulations, and videos. For example, OpenAI recently entered into an agreement with Time magazine that provides access to the entire archived content of the publication, which spans more than a century.

Baidu blocks Google and Bing from scraping its content for AI projects.
The update to Baidu Baike's robots.txt file restricts Googlebot and Bingbot crawlers.
The move aligns with a broader trend of protecting data for AI model training.

Source: SCMP