Real World ML: The Hashing Trick: An Elegant Solution to Dynamic Category Encoding

Real World ML: The Hashing Trick: An Elegant Solution to Dynamic Category Encoding

Have you ever encountered a situation where your ML model is not doing well in production due to encountering a category it hasn't seen before?

Or perhaps your model treats new categories the same way it treats unpopular or unknown ones, leading to suboptimal performance?

These are common problems that arise when dealing with categorical features in real-world scenarios.

In this article, we'll explore a powerful technique called the hashing trick, which can help you tackle these challenges effectively.

Problem of Changing Categories in Production

Imagine you're building a recommender system for a big e-commerce platform like Amazon.

One of the features you want to incorporate is the product brand.

But, the number of brands can reach around the hundred of thousands.

To handle this, you might consider encoding each brand as a number or using one-hot encoding.

But, in production, your model crashes when it encounters a brand it hasn't seen before.

To mitigate this, you could create a catch-all category called "UNKNOWN" to handle unseen brands.

Now, your model doesn't crash anymore, but there is a new challenge.

If your model didn't see the "UNKNOWN" category in the training set, it may not recommend any products from the "UNKNOWN" brand, leading to complaints from sellers about their new brands not receiving traffic.

You might attempt to fix this by encoding only the top 99% most popular brands and categorizing the bottom 1% as "UNKNOWN."

This way, your model can at least handle "UNKNOWN" brands.

But, this solution is short-lived. Then, you notice that the click-through rate on product recommendations plummets.

New brands join your site regularly: some are new luxury brands, some are sketchy knockoffs, and others are established brands.

Unfortunately, your model treats all these new brands the same way it treats unpopular brands.

This scenario is common in various domains: predicting spam comments, analyzing new product types, identifying new website domains, or handling new user accounts.

In all these cases, the challenge of dynamically handling new categories without degrading model performance arises.

The Solution: The Hashing Trick

The hashing trick offers an elegant solution to this problem.

It involves using a hash function to generate a hashed value for each category, which then becomes the index of that category.

One potential issue with hash functions is collision, where two categories are assigned the same index.

New brands can share an index with any of the existing brands, rather than always sharing an index with unpopular brands.

To mitigate the impact of collisions, you can choose a large hash space large or a strong hash function such as MurmurHash.

This method can be particularly useful in continual learning settings where your model learns from incoming examples in production.

Benefits of the Hashing Trick

The hashing trick offers several benefits when dealing with categorical features in ML:

  • Handling new categories: new categories are automatically mapped to hashed indices. It eliminates the need for manual intervention or model retraining for every new category.

  • Reducing memory usage: By using hashed indices instead of one-hot encoding or storing the original categorical values, it can significantly reduce memory usage. Beneficial when dealing with a large number of categories.

  • Computational efficiency: It allows for efficient computation, as the hashed indices can be directly used as feature values.

  • Flexibility in continual learning: Useful in continual learning settings, as new categories emerge without need of retraining

Understanding the Hashing Trick

The hashing trick, also known as feature hashing, is a technique used to convert categorical data into numerical features.

This method leverages hash functions to map categories into a fixed number of indices in a hash table.

How It Works

  1. Hash Function: A hash function takes an input (in this case, a category) and returns a fixed-size string of bytes. The output appears random but is deterministic.

  2. Hash Space: Choose a hash space, which is the size of the hash table. The larger the hash space, the fewer the collisions.

  3. Index Assignment: For each category, apply the hash function to generate an index in the hash table. This index is used as the feature for that category.

Example

Consider a scenario where you have a list of product brands: ["Nike", "Adidas", "Puma", "Reebok", "NewBrand"].

Using a hash function, you convert these brands into indices within a hash space of size 10.

For instance, if the hash function maps:

  • "Nike" to 3

  • "Adidas" to 7

  • "Puma" to 5

  • "Reebok" to 2

  • "NewBrand" to 3

In this case, "Nike" and "NewBrand" share the same index due to a collision.

Despite collisions, this approach is effective because it allows the model to handle new categories dynamically.

Implementation

Let's implement the hashing trick in Python using the FeatureHasher from sklearn.

Conclusion

The hashing trick provides a powerful and efficient solution to the challenge of dynamic category encoding in ML models.

Its ability to handle new categories without requiring extensive retraining makes it particularly valuable in production environments.

By choosing an appropriate hash function, configuring a suitable hash space, and scaling features effectively, you can leverage the hashing trick to improve your model's robustness and performance.

Incorporating this trick into your ML workflows ensures that your models remain adaptive and resilient, even as the underlying data evolves.

It is not only beneficial for recommender systems but also extends to various other applications, including spam detection, network security, and text classification.

By understanding and applying the hashing trick, you can address the limitations of traditional encoding methods and build more effective, scalable, and adaptive ML models.

In a world where data is constantly changing, the hashing trick stands out as a reliable and elegant solution to the problem of dynamic category encoding.

PR: If you like this article, share it with others ♻️

Would help a lot ❤️

And feel free to follow me for articles more like this.