AI is setting off a great scramble for data
Not so long ago analysts were openly wondering whether artificial intelligence (AI) would be the death of Adobe, a maker of software for creative types. New tools like DALL-E 2 and Midjourney, which conjure up pictures from text, seemed set to render Adobes image-editing offerings redundant. As recently as April, Seeking Alpha, a financial-news site, published an article headlined Is AI the Adobe killer? Far from it. Adobe has used its database of hundreds of millions of stock photos to build its own suite of AI tools, dubbed Firefly. Since its release in March the software has been used to create over 1bn images, says Dana Rao, a company executive. By avoiding mining the internet for images, as rivals did, Adobe has skirted the deepening dispute over copyright that now dogs the industry. The firms share price has risen by 36% since Firefly was launched. Adobes triumph over the doomsters illustrates a wider point about the contest for dominance in the fast-developing market for AI tools. The supersize models powering the latest wave of so-called generative AI rely on oodles of data. Having already helped themselves to much of the internet, often without permission, AI firms are now seeking out new data sources to sustain the feeding frenzy. Meanwhile, companies with vast troves of the stuff are weighing up how best to profit from it. A data land grab is under way. The two essential ingredients for an AI model are datasets, on which the system is trained, and processing power, through which the model detects relationships within and among those datasets. Those two ingredients are, to an extent, substitutes: a model can be improved either by ingesting more data or adding more processing power. The latter, however, is becoming difficult owing to a shortage of specialist AI chips, leading model-builders to be doubly focused on seeking out data. Demand for data is growing so fast that the stock of high-quality text available for training may be exhausted by 2026, reckons Epoch AI, a research outfit. The latest AI models from Google and Meta, two tech giants, are likely trained on over 1trn words. By comparison, the sum total of English words on Wikipedia, an online encyclopedia, is about 4bn. It is not only the size of datasets that counts. The better the data, the better the model. Text-based models are ideally trained on long-form, well-written, factually accurate writing, notes Russell Kaplan of Scale AI, a data startup. Models fed this information are more likely to produce similarly high-quality output. Likewise, AI chatbots give better answers when asked to explain their working step by step, increasing demand for sources like textbooks. Specialised information sets are also prized, as they allow models to be fine-tuned for more niche applications. Microsofts purchase of GitHub, a repository for software code, for $7.5bn in 2018 helped it develop a code-writing AI tool. As demand for data grows, accessing it is getting trickier, with content creators now demanding compensation for material that has been ingested into AI models. A number of copyright-infringement cases have already been brought against model-builders in America. A group of authors, including Sarah Silverman, a comedian, are suing OpenAI, maker of ChatGPT, an AI chatbot, and Meta. A group of artists are similarly suing Stability AI, which builds text-to-image tools, and Midjourney. The upshot has been a flurry of dealmaking as AI companies race to secure data sources. In July OpenAI inked a deal with Associated Press, a news agency, to access its archive of stories. It has also recently expanded an agreement with Shutterstock, a provider of stock photography, with which Meta has a deal, too. On August 8th it was reported that Google was in discussions with Universal Music, a record label, to license artists voices to feed a songwriting AI tool. Rumours swirl about AI labs approaching the BBC, Britains public broadcaster. Another supposed target is JSTOR, a digital library of academic journals. Holders of information are taking advantage of their greater bargaining power. Reddit, a discussion forum, and Stack Overflow, a question-and-answer site popular with coders, have increased the cost of access to their data. Both websites are particularly valuable because users upvote preferred answers, helping models know which are most relevant. Twitter (now known as X), a social-media site, has put in place measures to limit the ability of bots to scrape the site and now charges anyone who wishes to access its data. Elon Musk, its mercurial owner, is planning to build his own AI business using the data. Expanding the frontier As a consequence, model-builders are working hard to improve the quality of the inputs they already have. Many AI labs employ armies of data annotators to perform tasks such as labelling images and rating answers. Some of that work is complex; an advert for one such job seeks applicants with a masters degree or doctorate in life sciences. But much of it is mundane, and is being outsourced to places such as Kenya where labour is cheap. AI firms are also gathering data through users interactions with their tools. Many of these have a feedback mechanism, where users indicate which outputs are useful. Fireflys text-to-image generator allows users to pick from one of four options. Bard, Googles chatbot, proposes three answers. Users can give ChatGPT a thumbs-up or thumbs-down to its responses. That information can be fed back as an input into the underlying model, forming what Douwe Kiela, co-founder of Contextual AI, a startup, calls the data flywheel. A stronger signal still of the quality of a chatbots answers is whether users copy the text and paste it elsewhere, he adds. That information helped Google rapidly improve its translation tool. There is, however, one source of data that remains largely untapped: the information that exists within the walls of the tech firms corporate customers. Many businesses possess, often unwittingly, vast amounts of useful data, from call-centre transcripts to customer spending records. Such information is especially valuable because it can be used to fine-tune models for specific business purposes, such as helping call-centre workers answer queries or analysts spot ways to boost sales. Yet making use of that rich resource is not always straightforward. Roy Singh of Bain, a consultancy, notes that most firms have historically paid little attention to the types of vast but unstructured datasets that would prove most useful for training AI tools. Often these are spread across various systems, buried in company servers rather than in the cloud. Unlocking that information would help companies customise AI tools to serve their needs better. Amazon and Microsoft, two tech giants, now offer tools to help companies improve management of their unstructured datasets, as does Google. Christian Kleinerman of Snowflake, a database firm, says that business is booming as clients look to tear down data silos. Startups are piling in. In April Weaviate, an AI-focused database business, raised $50m at a valuation of $200m. Barely a week later PineCone, a rival, raised $100m at a $750m valuation. Earlier this month Neon, another database startup, raised an additional $46m in funding. The scramble for data is only just getting started. To stay on top of the biggest stories in business and technology, sign up to the Bottom Line, our weekly subscriber-only newsletter.