Generating Synthetic Data: Next Steps and Putting It to Use
Reuven Shnaps PhD, Earnix Chief Analytics Officer, and Chen Ben Gal, Industry Senior Data Scientist
January 10, 2024
Revisiting the Concept of Synthetic Data
In a previous blog post, we introduced the concept of Synthetic Data and discussed opportunities, challenges, common practices, and ethical considerations associated with its use.
In this follow-on blog post, we will focus on some of the technical aspects behind the generation of Synthetic Data, present some detailed information as to its effectiveness in mimicking original data statistical properties, and present some interesting applications that go beyond replication of the original data.
As we mentioned in the previous blog post, simple anonymization of some of the personally identifiable data elements in a dataset may no longer be sufficient to ensure the safety and security of the data.
However, addressing the safety and security issue is only one pre-requisite for successfully using Synthetic Data. To create meaningful Synthetic Data there are a number of aspects its generation algorithms must address, such as handling different data types, accounting for different statistical or empirical distributions, and dealing with unbalanced samples.
Furthermore, crafting synthetic tabular data to mirror real-world intricacies, such as intricate feature dependencies, proves highly challenging. Accurately replicating relationships between features — be it correlations, non-linear associations, or complex interdependencies — requires sophisticated methodologies. The following figure provides an illustration of Synthetic Data non-accurate feature interplay, demonstrated by a correlation matrix.
Different Methods for Generating Synthetic Data
Accurately representing feature distributions, transformations, and temporal dependencies adds layers of complexity to the use of Synthetic Data.
Striking a balance between mimicking real-world complexities and preventing overfitting in synthetic datasets is vital. Innovative methodologies, including advanced statistics, generative AI, and machine learning (ML)-based models, such as deep learning transformers and LLMs (Large Language Models), show promise in creating synthetic tabular data that captures real-world intricacies and in overcoming these challenges.
Generating Synthetic Data involves various methods, each with its distinct approach and nuances. Differential Privacy, Generative Adversarial Networks (GANs), and Transformers stand out as prominent techniques, with each offering unique solutions and challenges.
Differential Privacy prioritizes safeguarding privacy while maintaining statistical relevance. By introducing noise into the data, it shields sensitive information without significantly compromising overall statistical integrity. However, a drawback emerges in its treatment of individual features separately, potentially leading to inaccuracies in capturing dependencies and correlations among the features. This method excels in preserving privacy but may fall short in accurately representing intricate feature interplays within the data.
Generative Adversarial Networks - GANs
On the other hand, Generative Adversarial Networks - GANs, comprised of generator and discriminator models, take a different approach. The generator is crafting synthetic samples resembling real data, while the discriminator assesses their authenticity. GANs excel in capturing feature interplay and dependencies, ensuring a more realistic representation of the original dataset. However, these models often demand substantial data for convergence, struggle with rare data segments, and require fine-tuning tailored to the dataset's nature.
Transformers, leveraging Large Language Model (LLM) architectures, have recently emerged as a novel approach, especially in non-structured data, but also in handling tabular data. This technique treats rows of tabular data similarly to sentences, utilizing text models to generate new rows. The beauty of Transformers lies in their adaptability to tabular data without the need for extensive fine-tuning. However, while they eliminate the intricacies of fine-tuning, their adaptation to text-based methodologies for tabular data might pose challenges in certain contexts.
Effectiveness of Synthetic data
Synthetic data generation opens new opportunities for insurance companies to collaborate with academic and research institutions, Insurtech companies, or software vendors. It allows carriers to share data that otherwise might be too sensitive to share, even with other departments within the organization.
Often the owners of the data within the organization have concerns about letting research and data science departments work directly on the original data. Synthetic Data can be effective in addressing these concerns. However, the level of investment and the method(s) for generating Synthetic Data will depend on the specific needs of the different parties.
Questions that arise include:
“How effective is the Synthetic Data being generated?”
“Can we hope to fully replace the original data with Synthetic Data?”
The short answer is, as always, that it depends on the specific usage of the data generated. In other words, do we need to develop very accurate predictions based on the data, do we want to explore different modelling approaches as part of the research, or do we need representative data for performing some QA and testing? Our current research shows some encouraging results in terms of the capabilities of the different Transformers-based algorithms we have tested to date.
The initial step in assessing Synthetic Data quality involves examining feature distributions and performing one-way analysis. While individual feature distributions offer a basic sanity check, they alone often do not fully account for multi-feature dependencies. The aim is for each feature's distribution in the Synthetic Data to resemble that of the original dataset, though it doesn't need to be an exact match due to the inherent stochastic nature of Synthetic Data generation.
Additionally, one-way analysis serves as a valuable tool for comprehending the interplay between the target feature and the explanatory variables. This analysis focuses on both specific graph values and the overarching trends observed within the data.
In our example, the dataset is an auto insurance policy renewal database with ~40K observations, where the target variable is renewal indicator, and some of the explanatory variables are Age, Annual Miles, Tenure, and Car Age.
The following charts represent a one-way analysis of the relationship observed in the original data (for a test sample) between the renewal rate and key rating variables, and how well this relationship is captured by logistic regression models based on original and Synthetic Data samples (training samples).
As we can see from the charts, both the original and Synthetic Data-based models perform similarly in mimicking the observed relationships for key features.
We can further evaluate the effectiveness of the Synthetic Data sample by comparing the feature importance derived from GBM models which are based on the original and Synthetic Data samples.
The below chart shows similar ranking and magnitude of importance of the various features in explaining customer retention behavior. This chart, along with the previous ones, provide confidence in using the Synthetic Data sample for generating meaningful insights about key drivers of consumer retention behavior.
Once the consistency between feature distributions, one-way analysis, and feature importance is confirmed across the original and Synthetic Data, the subsequent step involves tackling a more complex task: evaluating the price elasticity within the renewal model.
A reliable synthetic dataset should demonstrate comparable price elasticity in models derived from both the synthetic and original datasets, indicating the Synthetic Data's viability as a potential substitute for the original dataset.
In this chart we can see that the distribution of price elasticity implied by the Synthetic Data sample is fairly similar to the one derived based on the original data sample.
This means that the predicted response of customers to price changes will be similar under both the original and Synthetic Data samples.
This is an important outcome as it would mean that we can potentially use the Synthetic Data not only to generate meaningful insights but also take it to the next level and translate these insights into actions.
This is further illustrated in the following chart, known in the Earnix lexicon as an Average Curve, or in ML jargon as a Partial Dependency Plot (PDP).
This chart shows the expected change in retention rates associated with changes in premiums relative to last period’s premium. While the two curves are not identical, they are reasonably close, and at a minimum the synthetic sample can be used to run multiple simulations and optimization scenarios. This will allow the carrier to test the effectiveness and tradeoffs of different pricing strategies prior to finalizing its pricing strategy based on the original data.
The implication of that is not just a potential for huge time savings and meeting deadlines but also presents the ability to collaborate with research teams inside and outside of the company who could greatly enhance the pricing strategy and impact on business performance.
Applications of Synthetic Data Generation – Going Beyond Data Replication
The real value of Synthetic Data is manifested not only as a possible replacement for the original dataset but also to augment the original dataset and enrich it for various purposes.
Of particular interest is the ability to augment the original data and simulate data samples that are either beyond the current reach of the original data, or where there is very little data available.
For instance, suppose our current renewal model is trained on data covering car values of up to $30,000. In a scenario in which the company plans to expand into a new region where car values are higher, e.g., reaching up to $50,000, Synthetic Data can play a pivotal role.
By generating samples reflecting higher-valued cars and incorporating them into the original dataset, we can enhance the dataset's representation to include those additional cars. Subsequently, this augmented dataset can be utilized to train models that better encapsulate the characteristics of higher-valued cars (based on the assumption that feature relationships are sustained).
This is achieved via an iterative process that systematically refines and expands the existing dataset through repeated cycles of data synthesis, filtering, and continual retraining of synthetic models. This iterative process continues until predefined thresholds for data quantity and expansion objectives are achieved.
Additionally, a complementary data sampling technique from the expanded data pool was developed within this iterative process, to ensure that expansion of the data distribution is as smooth as possible when expanding on the original sample.
Data enrichment techniques aid in label balancing, particularly in modeling rare occurrences like fraud, or specific types of car accidents. For instance, oversampling enables a more extensive representation of these infrequent events within the dataset, and potentially enhancing the accuracy of modeling.
Additionally, this method extends beyond labeled features and addresses potential biases in various dataset segments. For example, when certain segments or regions are underrepresented, oversampling these groups through the generation of additional Synthetic Data helps mitigate biases across the dataset, contributing to a more balanced representation for machine learning models.
In this blog, we have explored the technical foundations and practical implications of Synthetic Data generation.
Synthetic Data utilization requires assessing the balance between privacy and data utility, highlighting methods such as Differential Privacy, GANs, and Transformers, each with its own advantages and disadvantages.
Our current research indicates that Synthetic Data can effectively mirror “real” data, as illustrated in our examination of insurance carriers’ customer retention and price elasticity models.
The potential of Synthetic Data extends beyond replication of the original data, offering opportunities for data enrichment and scenario simulation, which can be instrumental in strategic decision-making. Synthetic Data stands as a promising frontier for secure, insightful, and compliant data analysis and business decision-making.