The headline alone signals where the training-data conversation is heading: labs have hoovered up the open web, and simply scaling scraped or licensed volume no longer moves the needle on model quality. That's a tacit admission that curation, provenance, and synthetic or studio-produced data — the kind Shutterstock Studios is positioned to sell — carry a pricing premium over raw bulk corpora.
For data brokers, the implication is stark: buyers like OpenAI, Google, and Meta are shifting spend from per-token licensing toward bespoke, higher-margin data production.
More Data Isn't Enough to Solve AI's Data Scarcity Problem