Smart Grouping: When Different Categories Tell the Same Story
Yuval Ben Dror, Data Science Researcher, Earnix
July 24, 2025
.png&w=2048&q=75)
Earnix Analytical and Technical Blog Series
How can insurers and banks tackle today’s toughest analytical challenges? At Earnix, we believe it starts with asking the right questions, challenging assumptions, and finding better ways forward.
In this blog series, we explore key issues in financial analytics, addressing complex problems, improving models, and staying competitive. Our first posts covered Model Analysis and Auto-XGBoost.
These technical posts are designed for professionals in actuarial science, data science, and analytics, with a focus on clarity and practical insights. The third topic of the series, which we will cover today, is Smart Grouping of categorical features for GLMs – a feature already implemented in our product as part of our Auto-GLM solution. Let’s get started.
Introduction: Categorical Features Make Life Harder
One of the most important challenges a model-maker must tackle when creating a model is how to treat categorical features. This is specifically relevant for General Linear Models, but it’s also a non-trivial question for other types of ML models. Essentially, models are complex combinations of mathematical formulas, so they can’t directly use “words”.
The most common approach is to use One-Hot Encoding – splitting a categorical feature with n categories into n-1 boolean features, and assigning 0/1 labels depending on the match of each feature to the category it represents. Commonly, we call them is_category_i for 1 ≤ i ≤ n-1 (leaving one out for the intercept).
However, this approach can cause the number of covariates to increase significantly. This, in turn, could make the model more complex and lead to overfitting. Consider the following example:
We’re trying to predict Number of Claims based on City. Say the real average claim number in both Cities A and B is equal to 0.05. Our dataset has a sample of n2n2 observations from city A and n2n2 observations from City B, and each of them is given its own coefficient. The value of each coefficient will vary depending on the size of the sample and the actual distribution, but we can expect it to be relatively close to 0.05. Close, but not equal to it.
Now, let’s say βA=0.04 𝛽A=0.04 and βB=0.06. If we get a new observation from City A or B, we know the desired beta for it would be 0.05. However, our One-Hot Encoded model would assign it either 0.04 or 0.06. If we then merge City A and City B together, we would get n observations from the A-B combo, so we’re likely to get a coefficient that is closer to 0.05.
Theoretically, we know that the manipulation of merging categories could lead to a better model. But how can we tell how to apply this manipulation in practice? How do we select which categories should be merged with each other?
Classic Regularization: Not Good Enough
To answer this question, we need to introduce one of the key features of Auto-GLM – Variable Fusion. As part of the modeling process, AGLM splits numeric features into bins, and then penalizes differences between adjacent bins, leading to a merger of bins. For categorical features, the classic implementation of AGLM simply uses one-hot encoding, which means categories can only be merged with the reference level. This is problematic for multiple reasons - consider the following example:
Continuing our previous example, we’re trying to predict Number of Claims based on City. There are 26 different cities represented by the letters of the alphabet, and for some cities the number of observations is relatively small. City A, the capital city, is the largest, so we use it as the reference level. However, City A is also the city in which the number of claims is highest. Therefore, the lasso can only merge other cities with the rather extreme case of City A. Selecting a different city as the reference, with an income closer to the mean, could improve the validity of the shrinking, but it would still only allow cities to be merged with the reference city.
Our solution: Smart Grouping
Smart Grouping is a variation on AGLM which allows one to incorporate a clustering of categories in the final GLM. It is available as a feature inside AGLM which is a part of our Model Accelerator solution.
The main idea of the algorithm is to use a 2-step regularization scheme to first rank the categories in a regularized multivariate environment and subsequently merge them using variable fusion (with AGLM).
Returning to our previous example, suppose we’re predicting Number of Claims based on both City and Age. We split the age variable into bins based on percentile, and one-hot encode the city variable. We then fit a regularized GLM on all covariates, to obtain a ranking of the city coefficients in the multivariate environment – note that it’s important that this step is regularized, otherwise the ranking itself could be unreliable.
Once the ranking is obtained, we re-encode the city variable as an ordinal variable similar to the encoding of age and perform variable fusion to merge adjacent bins. Lastly, we translate the results back to a one-hot encoding of the city variable, in which multiple categories are grouped together.
For more information, check out our paper outlying the algorithm, which also includes comparative analysis to other models.
Conclusion
This approach provides two main advantages over other methods that deal with categorical features:
Interoperability – the final model gives us a clear grouping of the categories, allowing us to understand the relationship they have with the target.
Multivariate Compatibility – the coefficient that we used to encode the categorical feature into a numeric feature takes into account the other variables in the model.
We also addressed the problem of overfitting by taking two extra steps – performing a regularized GLM to determine the initial ranking of the categories and using validation schemes to perform the merging of bins.
Smart Grouping helps merge categories into groups, resulting in sparser, more explainable, and better models. Less features, more credibility. It’s already fully implemented in our AGLM solution in Model Accelerator – so you can play with it right now!