Hierarchical Level Selector: Finding the Right Level of Detail for Smarter Data Preprocessing
Yuval Ben Dror, Data Science Researcher, Earnix & Eyal Bar Natan, Team Leader, Data Science, Earnix
October 19, 2025
.png&w=2048&q=75)
Questions like “What car do you have?” or “Where do you live?” can be answered very broadly, or very specifically. But which level of detail is most useful for predictive modeling?
Earnix Analytical and Technical Blog Series
How can insurers and banks tackle today’s toughest analytical challenges? At Earnix, we believe it starts with asking the right questions, challenging assumptions, and finding better ways forward.
In this blog series, we explore key issues in financial analytics, addressing complex problems, improving models, and staying competitive. Our first few posts covered Model Analysis, Auto-XGBoost and Smart Grouping (an Auto-GLM feature).
These technical posts are designed for professionals in actuarial science, data science, and analytics, with a focus on clarity and practical insights. The fourth topic of the series, which we will cover today, is Hierarchical Level Selector – an innovative feature which is a part of our new Preprocessing Hub lab. Let’s dive in!
Introduction: How specific should we get?
Let’s start with a simple illustration.
Imagine you’re trying to predict the price of a house based on its location. If the city’s neighborhoods are all quite similar, it might not matter which specific area the house is in. But in a city where neighborhoods vary widely, the exact location can make a big difference in price.
That raises a key question: when building a house-price prediction model, should we use City as a categorical variable? Or maybe we should use Postal Code? Maybe both? The tricky part is that the answer depends on the city - if it’s homogeneous, city-level data may be enough.
Now, let’s return to insurance and banking modeling. One of the biggest challenges when preparing data for modeling is deciding which features to include. Too many features can add noise, but too few can miss important signals. The challenge becomes even more complex when your dataset includes hierarchical features, which are very common in insurance data tables, like Location (Region → City → Postal Code) or Car Model (Manufacturer Country → Car Type → Car Model).
The problem with using all levels, and our solution
As the example suggests, choosing just one level might not be ideal. But what about using all of them? That, too, can cause problems:
High time and memory cost – Using every level means more categorical variables, often with many unique values. This increases model size, training time, and memory usage.
Overfitting – More granular levels tend to have many rare categories, which can cause overfitting.
Multicollinearity – In GLMs, one-hot encoding all levels can lead to convergence errors, since higher-level variables may be linear combinations of lower-level ones.
Difficult interpretation – Coefficients can depend on each other, making interpretation unstable and inconsistent.
So, what’s the alternative?
At Earnix, we developed the Hierarchical Level Selector, part of our Preprocessing Hub lab. This algorithm automatically collapses a hierarchical set of categorical features into a single, optimized column for modeling.
It evaluates each level in the hierarchy to decide whether its lower levels add meaningful information or can be merged into their parent. This approach mitigates the issues above—reducing overfitting, eliminating collinearity, improving interpretability, and significantly speeding up training.
How does it work?
The important thing to note here is that the algorithm is target-based, so it works per model. We start by representing the categories as a hierarchical tree. Then, moving from the bottom up, we process each level:
For the current level, the data is filtered and split into training and validation sets.
A grid of category subsets is created based on credibility - a function of both the number of observations in each category and the variability of the target within it.
Using predictive performance on the validation set, we determine which categories remain separate and which should be merged into their parent level.
This process continues until we reach the root. Categories merged all the way up to the root are labeled as “Other.”
Example:
Let’s take a simplified insurance example with the hierarchy Car Brand → Car Model, and assume we’re predicting claim cost.
For some brands (say, Toyota), most models might be merged into “Toyota” because they behave similarly in terms of claim cost, except for a few high-end or low-end outliers. For others (like BMW), each model might remain distinct due to large differences between them. If two smaller brands (Skoda and Seat, for example) show very similar behavior, they might both be merged into “Other.”
The result is a single new column (let’s call it Vehicle Definition) that captures the optimal level of granularity for predictive modeling. Instead of debating whether to use Brand, Model, or both, we can simply use this new derived column. It provides better performance, accuracy, and explainability.
Implementation in Earnix
In Earnix’s Preprocessing Hub lab, the final output of Hierarchical Level Selector is a formula that maps hierarchical values to a single optimized category.
The lab interface visualizes the result with a pie chart showing the proportion of categories retained at each level of the hierarchy. This provides transparency and helps users understand how their data was reduced.
Conclusion: Tackling the hierarchy challenge
The Hierarchical Level Selector offers several clear advantages over traditional approaches:
Dimensionality reduction: One column instead of many, with fewer total categories.
Data-driven optimization: Combines statistical and ML techniques to select the most meaningful levels for the current modeling task.
Explainability: In GLMs, coefficients become more stable and interpretable.
When you’re building a model that involves hierarchical categorical data, Hierarchical Level Selector can be an invaluable preprocessing step - making your models faster, cleaner, and easier to understand. Give it a try in the Preprocessing Hub!