- Turing Post
- Posts
- Token 1.14: What is Synthetic Data and How to Work with it?
Token 1.14: What is Synthetic Data and How to Work with it?
Will it eliminate the need for real data? Let's explore
According to Gartner's prediction, by 2030, machine learning (ML) models are anticipated to use synthetic data exclusively, eliminating the need for real data. The synthetic data market is on a robust growth trajectory, as evidenced by Cognilytica's forecast. They project the market to expand from $110 million in 2021 to an impressive $1.15 billion by 2027.
What is Synthetic Data and how to work with it? What makes Synthetic Data so in demand?
In this Token, we will answer these questions and explore the following:
The origins of synthetic data
Why you might need it?
How to generate synthetic data?
Considerations in creating a realistic synthetic dataset
Techniques for evaluating synthetic data quality
Use cases
Can I switch to synthetic data completely?
Conclusion
Bonus: A few companies that generate synthetic data
The origins
You might have heard of the Caesar Cipher – a cryptographic technique used by Julius Caesar. It's one of the earliest and simplest methods for encrypting messages, ensuring that only authorized parties can decode them. Similarly, synthetic data was originally used for a related purpose: to conceal personally identifiable information (PII) within datasets. As the name suggests, synthetic data is artificially generated to mimic real-world data. It's created using algorithms, simulations, or predefined rules. Its main purpose is to protect privacy and confidentiality. Today, synthetic data is primarily used in:
research
training machine learning algorithms
data analysis
testing software products
Why you might need synthetic data?
The growing popularity of foundation models (FMs) and specifically large language models (LLMs), has exacerbated the demand for synthetic data. As we covered in the previous Token, these ginormous models are data-hungry and hence, require a huge volume of data for training. At some point, there might not be enough real data to satisfy their appetite.
Other reasons in favor of using synthetic data for FMs and regular ML models:
Bias: Datasets in ML are seldom balanced or unbiased. For instance, if you are running an online survey for users in a remote part of South Asian countries, most of the respondents will likely be male. If you use the same dataset for training, your model will be biased. In such cases, synthetic data could help generate responses for minority classes resulting in a balanced dataset which will likely reduce bias.
Cost and Time: Imagine owning a bank and needing to test the functionality for registering new users. To test it with real users, you would have to wait days or even months to acquire 1,000 new customers. However, using synthetic data could allow you to conclude the test within hours or a few days.
Testing and Validation: In developing ML models, especially FMs, it's crucial to test them under various scenarios, many of which might be rare or difficult to capture in real-world data. Synthetic data can be tailored to simulate these rare conditions, allowing for thorough testing and validation of the models.
Innovation and Experimentation: Synthetic data allows for greater flexibility in model development. Researchers can create data with specific attributes or conditions that may not be readily available in real datasets, thus pushing the boundaries of what's possible in ML research and application.
Regulations and Privacy: Government regulations dictate that in sensitive sectors, such as medicine and healthcare, providers must keep information private. Hence, before using the data for research, it is important to mask any pieces of data that could potentially link to specific individuals.
Now that we know about the benefits of synthetic data, let’s talk about how we can generate it.
How to generate synthetic data?
Actually, there are a lot of ways you can generate synthetic data. The underlying task will dictate the method. Here are some commonly used ways:
The following explanation is available to our Premium users only → please Upgrade to have full access to this and other articles
That’s a fun video to watch:
Please give us feedback |
Thank you for reading, please feel free to share with your friends and colleagues. In the next couple of weeks, we are announcing our referral program 🤍
Previously in the FM/LLM series:
Token 1.1: From Task-Specific to Task-Centric ML: A Paradigm Shift
Token 1.5: From Chain-of-Thoughts to Skeleton-of-Thoughts, and everything in between
Token 1.6: Transformer and Diffusion-Based Foundation Models
Token 1.7: What Are Chain-of-Verification, Chain of Density, and Self-Refine?
Token 1.9: Open- vs Closed-Source AI Models: Which is the Better Choice for Your Business?
Token 1.10: Large vs Small in AI: The Language Model Size Dilemma
Token 1.13: Where to Get Data for Data-Hungry Foundation Models
Reply