Balancing Innovation and Integrity: The Role of Synthetic Data in Market Research

Published on Apr 30, 2024 by Michael Lewis

Assessing the Impact of AI-Generated Data on Industry Standards and Client Decisions

Socratic is successfully using AI technology to improve the speed, accuracy and depth of learning available from qualitative research and open-end questions.  Let’s take a look at a different application of AI and the opportunities and pitfalls for Market Research. 

Synthetic data is artificial data created to mimic real-world observations; typically used to train machine-learning models, synthetic data is becoming a hot topic in the Market Research industry.  It offers multiple advantages, including speed, cost-effectiveness, compliance with data regulations and scalability.  Generating synthetic data is becoming easier and easier, leading to questions around its ability to replace or supplement real-world observations in Market Research.

Using Synthetic Data In Marketing Research

Let’s be clear – synthetic data is and always has been a part of Market Research.  Anytime an analyst clicks on “Replace Missing Data” as part of a correlation matrix or regression analysis, they are relying on the statistical package to create (admittedly crude) synthetic data.  Discrete Choice, Conjoint Analysis and other trade-off designs are essentially using a model trained on real-world data to predict responses to combinations those real-world respondents never evaluated.

Advocates for expanding the use of synthetic data are already building use cases for scenario testing, simulations, model training and validation, and data augmentation.

However, synthetic data as either a substitute for or as a supplement to real-world data risks creating and masking data quality issues that can seriously affect the decisions our clients make based on our findings. 

Key risks in using synthetic data in individual research projects include

  • Potential propagation of biases – despite on-going attempts to “sanitize” databases, multiple commercial AI models continue to exhibit racial and gender biases;
  • Overfitting the model -- providing a false impression of statistical reliability and precision in a model that can’t function effectively beyond the initial synthetic dataset; and,
  • Ethical considerations – chief among them being how our business partners can accurately evaluate and use the findings of research that incorporates synthetic data;

Using synthetic data to supplement or “enhance” research data should be approached with caution.  Carefully examine the data underlying the creation of synthetic data for historical and cultural biases.  Wherever possible, verify findings based on synthetic data using real-world data.  And always disclose the use of synthetic data and ensure your clients are fully aware of the potential consequences.


Market Research exists to reduce the inherent risk in making large-scale business decisions.  As an industry, Market Research needs to carefully evaluate and control any attempts to greatly expand the use of synthetic data in ways that might increase those risks.  We must be transparent about the use of synthetic data and the potential issues it can cause.

While the greatest short-term risk may be in fostering overconfidence in the findings of an individual project, that could easily lead to the greatest long-term risk – damaging the confidence our clients have in Market Research overall.