The data wall phenomenon has begun to emerge, and AI training data will be exhau

tech

The AI mega-model boom is in full swing, but beneath the prosperity lies hidden concern.

High-quality textual data is the core ingredient upon which large models rely for learning, and to a large extent, it affects the capability level of these models. Therefore, the scarcity of training data has always been a lingering shadow in the minds of manufacturers.

Recently, The Economist published an article titled "AI firms will soon exhaust most of the internet's data," which immediately attracted widespread attention. The article cited predictions from the research firm Epoch AI, pointing out that the high-quality textual data available on the internet may be depleted by 2028. This phenomenon, referred to as the "data wall," could become the biggest problem in slowing down the progress of artificial intelligence.

Advertisement

01

How important is training data?

Coincidentally, a paper published in Nature on July 24th pointed out that training future generations of machine learning models with datasets generated by artificial intelligence (AI) may contaminate their outputs. This concept is known as "model collapse." The study shows that original content can devolve into irrelevant nonsense within a few generations, highlighting the importance of training AI models with reliable data.

In summary, there is a growing concern that AI models may become less intelligent the more they are trained.

To clarify what this prediction means for the development of AI mega-models, it is first necessary to understand the patterns of training and inference in large model operations.

The working principle of AI mega-models can be likened to the learning process of humans. Although a newborn baby cannot speak, it has already begun to sense the surrounding environment and receive human language input from it, including but not limited to adult conversations, musical melodies, video sounds, and so on. This is akin to the initial training of AI mega-models, which involves learning the patterns and rules of human natural language through the input of massive amounts of data.As the input corpus becomes increasingly rich and the understanding of the surrounding environment deepens, infants begin to imitate adults to produce their own sounds, attempting self-expression with simple vocabulary and short sentences. This process is akin to an AI large model being fed a vast amount of data, gradually constructing a model with understanding and predictive capabilities, although at this stage, these capabilities are still relatively basic.

Next, as infants grow, they continuously deepen their learning of human language, gaining a better understanding of the meaning of the input content. They also optimize their output with the improvement of their comprehension, making their expressions more fluent and precise. They begin to form self-awareness and engage in autonomous dialogues with the outside world. This process is similar to the AI large model being updated and iterated upon its initial version, starting to possess certain reasoning and analytical abilities. The current applications of large models in areas such as basic question answering and speech recognition are just like infants trying to solve some simple problems during their growth process.

The impending depletion of high-quality textual data means that the opportunities for infants to learn from the outside world are sharply reduced, and their problem-solving abilities will stagnate. The iteration of large models follows the same principle; with limited input content, the model's reasoning and analytical abilities cannot be effectively optimized, and the application effects in various scenarios are also difficult to improve.

Indeed, adjusting the parameters of large models can also achieve certain optimization effects, but in the stage where the underlying technology has not yet created a significant gap, training corpora remain the key element affecting the performance and quality of large models, and even directly relate to the presentation of the model's accuracy. Moreover, compared to the method of feeding high-quality textual data, increasing model parameters requires a very high cost, with a low cost-benefit ratio.

Wu Chao, the director of the CITIC Think Tank Expert Committee and the head of the CITIC Construction Investment Securities Research Institute, pointed out at the 2023 World Artificial Intelligence Conference, "In the future, the quality of a model will be determined by the algorithm 20% of the time and by data quality 80% of the time. High-quality data will be the key to improving model performance next."

02

The Thirst for Training Data

Why is there such a scarcity of high-quality textual data available for training?

Before answering this question, one should first have a concrete understanding of "how much data is needed to train a large model." Typically, this depends on the scale of the large model, as well as the project requirements it undertakes, especially the complexity, budget, quality requirements of the project, and the annotation and commentary needs for specific projects, and so on.Innovative tech company Shaip has indicated that, from an empirical standpoint, to develop an efficient AI model, the amount of training data required should be ten times the number of model parameters, also known as degrees of freedom. The "tenfold" rule is designed to limit variability and increase data diversity. To develop a high-quality model, a deep learning algorithm that can match human capabilities, 5,000 labeled images per category are needed. For the development of exceptionally complex models, at least 100,000 labeled items are required.

The massive data requirement is just one aspect; another lies in the extremely limited development of data across various industries. Currently, a large amount of professional data in our country is relatively closed. Access to some important materials requires internal permissions, and even when they are public, they cannot be taken out of specific places for review. They can only be perused in databases and libraries, let alone the transfer of related data.

In addition to the blind spots in the field, the distribution of high-quality industry data that can be accessed is also quite scattered and lacks collectivity. To use it as training data, a significant amount of effort is needed upfront for data cleaning and integration.

The general industry data is already like this, and for large models that are deployed in vertical scenarios, obtaining professional data in specific fields is even more difficult. IBM CEO Arvind Krishna has pointed out that almost 80% of AI projects involve work related to collecting, cleaning, and preparing data. He also believes that companies give up on their AI businesses because they cannot keep up with the costs, work, and time required to collect valuable training data.

The development of professional data is limited, and on the other side, the issue of personal user data security has always been a controversial whirlpool. In March last year, Microsoft pushed a large number of update messages to Windows 10 users, accompanied by "forced pop-ups," with the page indicating "your data will be processed outside your country or region," and there was no "cancel" option. Users could only click the "Next" button, otherwise, they could not access the system desktop. This move sparked widespread user concerns about the leakage of private data.

Subsequently, Microsoft responded by stating that after users update to Windows 11, data will be transferred out of China. The reason given is that Microsoft's software registration center is in the United States, and the integration of ChatGPT into Bing search and Edge browsers also requires the support of American data centers, so data may be transferred outside of China.

The public's conservatism about personal privacy data conflicts with the demand for AI large model scenario applications, which in turn causes manufacturers on the supply side to oscillate strategically between using user data to train and optimize AI services and protecting privacy to ensure user retention. The copyright issues involved further push the ethical controversies of artificial intelligence to the forefront.

Earlier, OpenAI received a cease and desist letter from Hollywood star Scarlett Johansson for the voice in an AI voice project that was suspected of infringement. OpenAI immediately suspended the use of that voice. Recently, Meta, having learned its lesson, invested millions of dollars to negotiate with Hollywood stars for the authorization of AI voice projects, attempting to salvage the AI industry's much-criticized copyright issues. According to Caixin, due to the inability to agree on the terms of use for the actors' voices, negotiations between Meta and some actor representatives have experienced multiple interruptions and restarts. The actor representatives are seeking stricter restrictions to protect their rights and interests.03

Seeking the Future Amid the Countdown

Faced with the impending depletion of training materials, many industry insiders are actively seeking alternative solutions, with the use of AI-generated synthetic data being the most hotly discussed option currently.

Alibaba Research Institute has pointed out that synthetic data addresses the issue of the difficulty in observing certain types of real-world data, expands the diversity of training data, and can be used in conjunction with real-world data to enhance the security and reliability of models. More importantly, synthetic data can replace personal characteristic data, which is conducive to user privacy protection and addresses the compliance issues of data acquisition.

For manufacturers who value efficiency, commercial prospects, and risk, this is indeed a relatively good alternative. At present, our country has begun to encourage and guide the development of the synthetic data industry through measures such as planning the construction of data training bases and introducing relevant support policies, in order to help increase the capacity of high-quality data as much as possible.

Some argue that AI-generated synthetic data has its value, but the premise is that the data needs to be strictly filtered or diversified data sources must be used. This not only sets higher requirements for manufacturers' data cleaning capabilities but also shows a greater "appetite" for data acquisition channels. This means that in the process of AI seeking the future, it will inevitably face more severe copyright issues and ethical disputes.

The news of the math problem "9.11 and 9.9 which is bigger" that stumped many AI large models not long ago is still fresh in memory. The unstable mathematical ability is actually a microcosm of the current large model industry. In the most ideal industrial landscape, AI output and human creation should complement each other, but the alarm bell of the data crisis has already rung. Facing the friction between old and new business forms and concepts, it has become a choice that people have to make to support the development of new technologies.