Many insurance companies are currently grappling with a new type of data: Behavioral telematics data. After decades of estimating risk using “classical” variables, such as age or vehicle type, actuaries and analysts suddenly have to deal with highly detailed data about driving habits, that comes in different degrees of resolution and might be difficult to process and analyze.
It’s a challenge and an opportunity at the same time. To best respond, many insurance companies rely on telematics suppliers or develop themselves, a single “telematics-based driving score” that aggregates the highly detailed telematics data. The companies can then use this telematics score in their risk models. But is this the best way to utilize telematics data?
Well, not really.
This approach doesn’t provide the modeling flexibility needed to completely and accurately model risk, which is critical to improving the success of telematics-based programs.
Understanding the Shortcomings in Common Telematics Approaches
To understand this, let’s look at different approaches: using only classic variables, a classic + telematics scoring model, and an all-variables model.
We’ll use the publicly available data published recently by So, Boucher, and Valdez from the University of Connecticut and the Université du Québec à Montréal. This is a synthetic data set, based on real Canadian telematics data. The data set includes classic variables used by insurance companies, such as age, gender, and credit score, alongside telematics variables, such as the number of hard brakes and acceleration events, mileage, and percentage of trips done during the weekend.
Using this data, we will try to predict the probability to have a claim—about 4% of the drivers in the data had at least one claim. We will start with some descriptive statistics, just to get a basic feeling of the data.
Here are the conditional distributions of three of the classic variables: number of years without claims, car age, and credit score. In each figure, the yellow bars represent the distribution for customers that had claims, and the blue bars represent the distribution for customers that didn’t have any claims.
Insurance Telematics Variables
But telematics variables also deliver valuable predictive power. The three charts below present conditional distributions of miles driven per day, the number of relatively hard brakes per 100 miles, and the share of drives during afternoon rush hours. We can see that individuals who drive more miles per day, have more hard braking and drive more during the rush hours, tend to have more claims. The data includes many more telematics variables, such as accelerations, turns, share of trips on different weekdays and more.
Creating a Telematics Competition
Now let us do a little “horse race” between three alternatives:
- Classic model – use only classic variables to predict risk.
- Classic + score model – aggregate the telematics data into a telematics score, and use this score in combination with classic variables, as currently done by many insurance companies.
- All-variables model – use all the variables together in one model.
For all the alternatives, we will use the same modeling technique: the XGBoost model, with hyper-parameters tuning using grid search, stratified K-folds cross-validator, and weights that are designed to handle the imbalance in the outcome variable. The score will be determined by the area under the ROC curve (AUC) for the test set.
For alternative #2, we will first use only the telematics variables to create a score using the same modeling technique, and then run another XGBoost grid-search using classic variables and the score calculated in the first stage.
Which approach is best?
Which of the three alternatives came out on top? And by how much?
The results are summarized in the following plot:
Using a telematics score increased the AUC of the test set from 0.77 to 0.81, compared with only using classical variables, but using all the variables together increased it to 0.9. According to those results, using a telematics score is just not as efficient as using all the variables in one model.
Of course, this result is specific to our settings, and numbers might be different for other data sets and other models. But any model you choose will find it easier to predict any outcome using all the variables freely, instead of a combination of the classical variables and an aggregated score based on the telematics variables. Once the model is allowed to use all the variables, it can fully exploit the interactions between classic and telematics variables, and according to the results here, it seems that those interactions matter.
The lesson learned from this example is simple: Insurance companies should use all the available telematics data in the models, and not only one aggregated score.
A Better Telematics Data Solution
However, it is not easy to fully integrate telematics data into production. To do that, insurance companies need a flexible platform for handling data and models for both classical features and telematics data features, that can automatically aggregate high-frequency telematics data. The combined and fully integrated solution of Earnix Drive-It and Earnix’s pricing software, Price-It, addresses this specific need.
These integrated solutions provide a powerful, end-to-end platform that delivers a number of valuable benefits:
- Seamless integration between all applications and technology, eliminating manual efforts.
- A field-tested risk calculation solution based on more than five billion miles of driver data and more than one million drivers.
- Personalized offerings in real-time to improve customer experience and retention.
The future of the insurance industry lies in gathering new data and utilizing it to its full potential, and a combination of classic and telematics data in all models is an essential part of this vision. Stay tuned for more articles on telematics insurance, or for more information, download our UBI eBook today.