Every year on my daughter’s birthday, I bake her a cake. Her favorite is chocolate. I found the recipe on the internet a few years ago, and it really made my life easier! You know, two eggs, 200 mg of flour, 200 ml of milk… I just pull the recipe out and follow the instructions, no need to think too much about how to proceed. This year, it suddenly hit me that baking a cake with a well-known recipe is not too different from…using a GLM regression to make a prediction.
Yes, that’s right! If you think of it, you will see that there is a lot in common between those two processes. Let’s dive into this even more.
To make a cake, you need to have specific ingredients: eggs, milk, flour, etc. The same applies to a model, you need to have variables such as age, tenure, and price.
For the cake, you should know the quantities of the ingredients. If you put 500 gr of flour and one egg, you will probably not get the best result. The same for the model – each of the variables should have its own coefficient that defines the impact of the variable on the prediction, beta.
In terms of the appearance, you decide how the cake will look by choosing an ideal shape. It can be round, square or English. Or it can be some custom shape too. When fitting a GLM, you select a link function to control the “shape” of the outcome – for example, identity or logit.
And in the end, this specific cake that you have just baked might be different from expected. Just like in model prediction you will always observe some differences between the prediction and the actual result.
H2: Why GLM?
Generalized Linear Models (GLMs) are already widely used in the banking and insurance industries–regulators and professionals are familiar with them, they are relatively easy to explain, and because they are very transparent, they can be represented as a formula. The formula is “self-explaining.” The change in the prediction of a GLM is directly explained by the change in the predictors. (e.g. age, car, model).
Let me explain what I mean by “self-explaining.” Forget for a moment about the cakes. Let us imagine an ice cream store owner who, after several years of observations, has figured out that the number of ice creams sold depends directly on the weather outside. Specifically, on the temperature level and whether it is raining or not. With this expression in hand and a good weather forecast, he/she can now foresee the demand for the next day and be ready to provide an additional 350 ice creams if tomorrow is 10 degrees warmer.
Why is it hard to build effective GLMs in Insurance & Banking?
If GLM is such a nice model, what is the catch? No catch really, except it can be very hard to build an effective GLM. Think again of our cakes. If you have a good recipe from your grandma that always works out perfectly and you do not need to change anything – great! But what if there is an allergic kid in the class, and you cannot use flour or eggs? Suddenly your recipe does not work, and you need a new one. And now it’s time to be creative. There are lots of questions to answer: How to replace those ingredients? What is the right proportion and combination? When do you add sugar? How long should you bake it and at what temperature? Think how many cakes are wasted before the ultimate recipe is found.
The same thing happens with your GLM. Let’s imagine there is a very nice prediction model that considers lots of factors and works just right. There is nothing to worry about until something changes, and then you need to fit a new model! And here comes the hard part. You need a new recipe for your model – the best variables, variable interactions, new binning and grouping of the variables. In other words, you are back to the most challenging and time-consuming stage – feature selection and engineering. Fitting an effective GLM is a combination of art and science, and often you need a skillful “artist” to build one.
Fortunately, Machine Learning comes to the rescue. The new modeling Earnix application, Automatic GLM (AGLM), automates the hardest modeling step by applying different Machine learning techniques behind the scenes. It is performing the hard work almost without any intervention and specifically performs the following activities:
Initial variable ranking is done by fitting a boosting model. The model itself is not used, only its variable importance is exposed to the user, and it can decide which variables to keep.
Binning of continuous variables
Continuous variables, like age, are split into many bins and the algorithm merges them through lasso regularization so that in the end we have bins that guarantee the best binning for that model.
The algorithm is looking for two-ways interactions by fitting and analyzing boosting models. This step is optional, and the underlying model can be tuned manually.
Categorical variables with lots of categories
AGLM can deal with categorical variables with a large number of categories through target encoding. A user can select either to apply it, if they know that the number of categories is large, or to go with one-hot encoding.
But wait, there is more as it provides the possibility of interactive update of coefficients and support of hierarchical tables and various other important features. The end result is an effective, predictive and transparent GLM that can be used as is or serve as a great basis for further work. AGLM makes the hardest part of fitting a great GLM simply enjoyable! Piece of cake indeed!
Building a Generalized Linear Model (GLM) isn’t always easy, and it requires the right mixture of art and science to achieve an effective model. With Earnix’s Automatic GLM in Insurance & Banking (AGLM), the hardest modeling step can be automated with machine learning techniques such as variable selection, binning of continuous variables, finding interactions, and dealing with categorical variables within many categories. This means that AGLM now makes it possible to build a model with just a click, making it an ideal option for banking and insurance professionals.