Data Makes the World Go ‘Round
According to an article in The Economist, the world's most valuable resource today is no longer oil, or gold, or diamonds, but data. Data has become the lifeblood of modern business, particularly with the rise of artificial intelligence (AI), to conquer all sorts of business challenges.
Data is an invaluable asset in any business, but particularly in businesses such as insurance and banking, where the products are intangibles, instead of physical goods. Much of these businesses’ revenue, profitability and market share are driven by using data as a competitive weapon.
As Thomas Wilson, CEO of Allstate, put it, “We are a customer-focused data company.” Note no use of the words “insurance” or “financial services” in his definition.
In this blog post, we turn our attention to synthetic data.
While in use in academia for 30 years, synthetic data has just recently found its way into mainstream commercial use. Gartner estimates that as of 2021, only 1% of data was synthetic, but that by 2024 that figure will astonishingly grow to 60%.
In this post, we’ll look at what’s fueling the rise in synthetic data, and how you can put it to use. The generation and harnessing of synthetic data present opportunities, challenges, and some ethical considerations that you’ll need to keep in mind as you plan and execute on your business strategies and programs.
Swimming in a Sea of Data
We are surrounded by data 24/7/365. We’ve all used tabular/structured data and various analytical techniques to solve business problems for years, but more and more data is now unstructured and dispersed. With the rise of intelligent mobile devices, wearables, smart cars, smart homes, and IoT, just to name a few such diverse sources, we are now not just consumers of data, but generators of vast amounts of data as well. This “real” data is put to uses, mostly good and sometimes not so good, every day, particularly in dealing with us as consumers.
The Rise of Synthetic Data
What if you could produce infinite amounts of the world’s most valuable resource? What business leverage would that generate? Unlike producing more oil, gold, or diamonds, producing vast amounts of data can be accomplished relatively quickly and cheaply.
Data that’s produced, rather than gathered, is termed “synthetic data.” Synthetic data technology enables practitioners to digitally generate the data that they need, on demand, in whatever volume they require, tailored to their precise specifications. A bit like printing money, but legal.
While commercial discussions of synthetic data have accelerated recently, its origins go back 30 years.
In 1993, Donald Rubin, a Harvard University professor, sought to analyze data from the 1990 US census in a way that would protect citizens’ personally-identifiable data. He was looking for a way to “mask” that data so that it would not be exposed to the outside world, while retaining the essence of the underlying data for analysis. This would ensure that the analysis drew the right conclusions, without exposing data such as social security numbers, home addresses, etc.
Rubin’s solution was ground-breaking. He had produced synthetic data, and in doing so added the term to our vocabulary. Over the years, his approach has been used extensively by statisticians, economists, and medical researchers.
The Growing Commercial Use of Synthetic Data
Techniques such as GANs, Transformers, and LLM applications such as ChatGPT have accelerated the production of synthetic data enormously. The thirst for business and competitive advantage, combined with this ability to generate synthetic data more readily, have allowed its use in solving multiple commercial problems:
Filling Data Gaps
Synthetic data is often used when actual data is limited or sensitive, serving as a crucial bridge between the aspirations of insurers and banks, the InsurTechs/FinTechs they employ (such as Earnix), and the data they need to grow and thrive.
With synthetic data InsurTechs/FinTechs and their financial services clients (many of whom are startups looking for a competitive advantage) can access the data they need to develop and test their products without compromising data security and customer privacy. As a result, the insurance sector has experienced a surge in new ideas and technologies that are transforming the industry.
One word of caution here: don’t go to extremes or think that synthetic data is a perfect substitute for “real” data. Relying too heavily on synthetic data might discourage efforts to improve the quality of real data collection and curation, potentially leading to long-term data quality issues, and should be avoided.
Privacy and Security
Synthetic data preserves confidentiality, while offering the ability to perform critical analysis. In a world of widespread data availability (both legal and in such places as the dark web) those who store, handle, and manipulate data must go beyond simple anonymization. They need to use synthetic data techniques to make the subjects of that data comfortable that their personal data will not be made available for identity theft, fraud, and other nefarious uses.
Synthetic data supports more, and more accurate, modeling, simulation, and testing.
One of the significant advantages of synthetic data in risk assessment is its ability to account for extreme or rare events (think floods, hurricanes, droughts, wildfires, etc.). In the real world, certain events may occur infrequently, but have a significant impact when they do. With synthetic data, insurers can simulate these rare occurrences and assess their potential consequences, allowing them to develop more robust risk models without waiting for an actual adverse event to occur.
Synthetic Data Use Cases
Beyond these broad categories, there are several more granular use cases that can benefit from the use of synthetic data:
Simulations, Stress Testing, and Data Expansion
By generating synthetic data (labels, stress testing, simulations, etc.), insurers can improve the performance of their predictive models, leading to better fraud detection, conversion models, and other analytical insights. One of the strongest use cases is the ability to create simulations and stress test the models when applied to new situations. This changes the way we think of introducing new products or serving new or expanded segments, or entering territories where we have little to no data.
Reducing Bias through Feature Balancing
Synthetic data can help balance features like gender, residence, and other sensitive attributes, mitigating biases present in the original data.
Increasing Model Stability
Synthetic data generation increases the quantity of data available for model training, resulting in more stable and robust models.
Accelerating QA Testing
Synthetic data can be rapidly employed for quality assurance (QA) testing, without compromising data security or raising privacy concerns.
Saving Time in Evaluating Data Privacy Risks
Utilizing synthetic data can speed up the evaluation of data privacy risks before using actual data, saving significant time and effort.
In the past, data privacy and sensitivity have stood in the way of working with other industry players, vendors, academic institutions, and even other functions within the organization itself. Synthetic data can be used to overcome these limitations, leading to faster and more comprehensive solutions to common problems.
Like any powerful technology, synthetic data raises several ethical considerations that must be addressed.
For example, do stakeholders, especially policyholders, have a right to know how their data is being used and processed in the generation of synthetic data? How can insurers ensure that the synthetic datasets they use faithfully represent their customers without revealing personally identifiable information (PII)? And, even though the data is not identifiable, do its subjects (e.g., customers) need to provide permission to use synthetic data that has been derived from their “real” data?
Insurers wrestle with similar questions in their daily work, and need to add this set of questions specific to synthetic data to their ethics discussions. Otherwise they might risk losing credibility and trust among their constituencies. They need to define specific rules about the usage of synthetic data, and to ensure that employees understand and consistently abide by those rules.
Addressing these ethical considerations requires collaboration between data scientists, senior management, domain experts, legal professionals, and policymakers. It's important to establish guidelines, standards, and best practices for the responsible generation and use of synthetic data, to ensure that its benefits are maximized while potential harms are minimized.
Synthetic data’s use in commercial enterprises is in its infancy, but holds the promise of helping to solve any number of critical problems in financial services. It opens new opportunities for financial services companies like never before - the ability to generate endless amount of data could change dramatically the industry and level of collaboration between different parties. When combined with AI, machine learning (ML), and advanced analytics, it can help address such vexing and time-consuming issues as filling data gaps, protecting privacy and security, and fraud detection, among others.
At the same time, insurers and bankers must take care to address ethical issues such as data privacy and transparency, consumer consent, and general openness, if synthetic data is to fulfill its potential with the greater good in mind.
To Learn More
This topic was the subject of my recent presentation at the Excelerate 2023 customer conference in London. You can view the presentation in its entirety here, and review all of the presentations from the conference here.