MIT News wades into the synthetic-data debate at a moment when frontier labs are quietly stress-testing how much model-generated text can substitute for costly human-annotated corpora. If synthetic data holds up as a legitimate substitute, it puts a ceiling on what licensors like news publishers and forums can charge for their archives.
But if it degrades model quality or bakes in bias, buyers like OpenAI and Anthropic still need real human data at a premium—meaning the synthetic-vs-real question is really a pricing question for the entire training-data market.
3 Questions: The pros and cons of synthetic data in AI
— MIT News