Target Encoding Applied to Insurance: Squeezing More Value from Your Data
Luba Orlovsky
December 11, 2023
- AI
- IT
In the dynamic world of insurance, predicting risk and demand for policies accurately is paramount. By applying machine learning (ML), insurers are now better equipped than ever to make these predictions with confidence as they seek to increase revenue, improve profitability, and manage risk.
What is Target Encoding?
One ML technique that can significantly improve your models and provide additional insights from your data is “target encoding.”
Target encoding is a method that transforms categorical variables based on the mean of the target variable. This technique is particularly useful when dealing with high-cardinality categorical features, a common scenario in insurance datasets. This is why at Earnix we prefer to use the term “high-cardinality encoding.”
Examples in the Insurance Realm
Let me give you two simplified examples to illustrate the idea.
Example 1:
Imagine an insurance company is trying to predict the likelihood of a car accident based on the car's make and model. Traditional encoding methods like one-hot encoding could result in a vast number of binary variables, making the model complex and computationally expensive.
Enter target or high-cardinality encoding. Instead of creating numerous binary variables, each car model is replaced with the average accident rate for that model, significantly simplifying the model without losing essential information.
Example 2:
Similarly, in demand modelling, high-cardinality encoding can replace categorical variables like the policyholder's occupation or residential area with the average purchase probability for each category. This can help in predicting the pricing applicable to new customers, making the process more efficient and accurate.
Not only that, much can be learned by checking the values calculated for each category. It can highlight similarities between different categories, and allow us to group similar categories together for modelling purposes. It can be done using clustering techniques or even manually.
For example, you may find out that two geographic areas are behaving almost identically, although they are located far from each other, and you can group them together in your model. By doing so, you won’t have to use the value calculated by the target encoding in your model, you can still proceed with one-hot encoding, but the number of the categories can be significantly lower.
Cautions with Target Encoding
However, as with any powerful tool, target encoding must be used judiciously.
It's crucial to avoid data leakage from the validation or test sets into the training set. Techniques such as cross-validation or smoothing can be used to prevent this, ensuring the robustness of the model. The order of appearance can also be considered to avoid using future results for evaluation of the past, especially if there is a time stamp in your data.
Another thing to consider is the size of each category. If a category is represented by only a few instances, it would not be wise to treat its average value the same as a category with thousands of representatives. A Bayesian approach can help to solve this problem by weighting between the average value of each category and the total average value of the whole population.
Productizing High-Cardinality Encoding
Now that we've laid out some of the reasons behind using high-cardinality encoding in your work let's see how Earnix can help.
On the Earnix product front, high cardinality is a part of the Earnix Model Accelerator suite that offers integration with external machine learning models and provides insurers with new, advanced ML methodologies.
Summing Up
Target or high-cardinality encoding is proving to be an easy yet powerful tool in the insurance industry.
By simplifying complex categorical variables and enhancing predictive accuracy, it is helping insurers make more informed decisions, ultimately benefiting both the company and its policyholders.
Editor’s Note
You may also be interested in another recent blog post by Luba Orlovsky, on Automatic Generalized Linear Models (AGLM).
To learn more about Model Accelerator, we invite you to view the presentation by Yaron Lavie, former Earnix Vice President of Products, at the recent Earnix Excelerate 2023 conference in London. Model Accelerator is just one of the innovative product roadmap topics covered in his presentation.
You can find all the 2023 conference presentations here.