In Part 1I introduced you to the challenge of personalized recipe recommendations on Cookidoo and covered the Model Selection phase. This article touches on the Candidate Selection and Scoring & Ranking components of our recommender system. I will walk you through some of the challenges we faced, the choices we made, and the lessons we learned from building the first iteration of our recipe recommender system. In the end, I will share a few of the steps we have planned for future improvements to our recipe recommendation system for Cookidoo.
Note the difference between the Candidate Selection and the Training Data Selection phase. In the former, we exclude items that we don't want to influence our training (see, Training Data Selection - Global vs. Local in Part 1). In the latter, we filter out items that we want to include in algorithm training but don't want to recommend to a specific user or at a specific time. Another reason for including a candidate selection component could be that your item catalog is too big to be scored fully. In this case, you can train candidate selection models that are cheaper to score to pre-select items. You often find this two-stage system described in blogs and papers from large platform companies (e.g., Pinterest, Instagram, Netflix). In our case, simple filtering rules to narrow down the set of candidate items are sufficient; there is no need to maintain another machine learning model.
The extent to which simple filtering rules and business logic are employed in recommender systems is often surprising to people new to the field. Also, blogs and papers don't often describe their application's business logic (maybe because they require deep domain knowledge or because rule-based systems are not "cool enough"). However, to give you an idea, here are some of the crucial filters and heuristics we employ:
- Filter recipes for different Recipe Stripes
- Exclude recipes the user already cooked
- Exclude recipes that are out of season
- Exclude recipes that don't fit the user's preferred locale, e.g., to prevent a bad user experience caused by a different measurement system
- Exclude recipes that we find "too boring" to recommend, e.g., Boiled Eggs or Chopped Onions
- Introduce some randomness to results to increase the variety of recommended recipes over time
Let's dive a bit deeper into the first three:
Filter recipes for different Recipe Stripes: We present recipes to the user in differently-themed recipe stripes. In the screenshot above, you can see an exemplary "For You" page for a user with recipes grouped thematically into different Recipe Stripes (due to constant improvements, localization, and ongoing A/B Tests, your Cookidoo might look different). For now, the selection of candidate recipes for these stripes follows filtering rules on recipe meta-data like categories, ratings, and tags. These rules range from simple filters on categories, like on the desserts category to produce the stripe "Süßes für dich" (in English "Desserts for you"), to more complex business logic like for the stripe "Easy Alltagsrezepte für dich" (in English "Everyday recipes for you"). The latter requires a more extensive heuristics system to classify simple recipes for daily cooking sessions.
Exclude recipes the user already cooked: We filter out recipes the user has already cooked because we want our recipe recommendations to inspire users to try something new. Ideally, the user goes to our recommendations to discover recipes that are to their taste, fit their time budgetand at the same time are new to them - something they wouldn't have searched on their own.1
Exclude recipes that are out of season: We also filter out recipes based on the current time of the year. Cooking is highly seasonal due to user preferences and ingredient availability. For example, users in Germany only cook meals including asparagus between April and June (compare The Rythm of Food - Asparagus), and recipes like self-made ice cream, cold soups, and cocktails are mostly cooked during the summer. Without excluding these recipes from the recommendation candidates, our model could recommend recipes that might fit the user's taste but not the season and ingredient availability.
Scoring & Ranking
The scoring and ranking phase challenges are primarily on the engineering side. To get a ranked list of top recipes per user, we use the trained model to score candidate items from the previous phase and rank them accordingly. Scoring is done by multiplying the user embedding with the recipe embeddings of all candidate recipes and subsequently ranking them by the resulting score. When a user logs into Cookidoo, executing this step in real-time can take prohibitively long when implemented naively. We considered multiple options to tackle this challenge:
- Using a database like ElasticSearch or an approximate nearest neighbours search index such as Annoy or Faiss to retrieve (approximate) scores fast enough for live scoring
- Further narrowing down the number of candidates to score in the Candidate Selection step by building a second, faster to score model to pre-select relevant items for live scoring
- Avoiding live scoring by pre-computing a set of recommendations per user and storing them in a database for quick access
The option of pre-computing recommendations doesn't provide the highest flexibility but was the most reasonable to implement from a cost-benefit point of view. Since most of our users cook only once a day, a higher update frequency of our recommendations wouldn't benefit users.2
It also comes with several other advantages:
- We don't need to worry about speed at inference time at all because all we do is a simple database lookup
- We can easily pre-compute different sets of recommendations per user and run A/B tests on these
The implementation is straightforward; we have a daily job on AWS Batch that computes the users' top recommendations and saves them in DynamoDB. Whenever a user logs into Cookidoo we get an API request and look up the user's recommendations.
The way we create scores from the embeddings from our matrix factorization algorithm is a bit non-standard. We were faced with the challenge that recommendations wouldn't vary much over time, i.e., even if users would cook new recipes, their recommendations wouldn't change much based on these new interactions.
To tackle this problem, we decided to ignore the user embeddings learned by the model and build new user embeddings by averaging the recipe embeddings of the user's last (randomly sampled) n cooking interactions. This way, user embeddings reflect the user's latest behaviour instead of including the entire history of their cooking interactions. At the same time, we can still train recipe embeddings on the complete set of historical data. Additionally, this approach requires a lower training frequency as the new user embeddings can be updated without retraining or partially training the model.
This change in the scoring logic was easy to implement and we have seen exciting improvements in A/B tests and later in production for all our users.3
Where We Want To Go Next
Future steps for our team revolve around improving recommendations for Cookidoo users and increasing our online testing capacity. Building the first iteration of our production recommender system was a considerable engineering challenge. We put much effort into integrating our end-to-end system with the larger platform and building up an excellent CI/CD pipeline to enable fast iteration. With this firm basis, we can focus on improving our algorithm. We'd also like to increase iteration speed from idea to user testing constantly. We think fast iteration is key to building products that users love - Optimizing for iteration speed.
On the algorithm side, we have lots of challenges and ideas in our backlog; a few examples:
- Improve the diversity of recommendations to help the user enjoy the full range of available recipes on Cookidoo
- Test more themes for our Recipe Stripes and personalize site composition per user
- We started to experiment with Deep Learning architectures such as Two Tower Networks or Deep & Cross Networksbut it is too early to talk about improvements
- Calibrated RecommendationsA user's niche tastes are essential and shouldn't be crowded out by our recommender system. We are thinking about implementing something along the lines of the Calibrated Recommendations paper by Harald Steck (if you have experience with calibrated recommendations in production, I would love to hear your thoughts).
- Healthy Food RecommendationsHow can Cookidoo support users with healthy eating goals? What role does the recommender system play in these goals? How can we make adjustments towards healthy eating without being too paternalistic about it?
I hope you learned something new from our experience with building a recipe recommender system for Cookidoo. Of course, there are many topics I haven't touched on, especially more engineering-heavy issues around cloud architecture, deployment, and ensuring GDPR compliance. Maybe we'll continue this series with more blog posts in the future and follow up with our learnings from architecting, deploying, and testing recommender systems.
I am the one putting this into writing, but all that I've written about was implemented with a fantastic team: Bora Kiliclar, Stephan Geuter, Andreas Hausmann, Carsten Böhm, Michael Gehring, Felix Althammer, Stefan Schaub, Leon Luithlen
I thoroughly enjoy tackling all kinds of challenges with the incredible team at Vorwerk Elektrowerke GmbH & Co. KG and Alexander Thamm GmbH. If you also find these challenges exciting and want to work with us, have a look at the open positions at Vorwerk and open positions at Alexander Thamm.
- In the recipe recommender literature, this last part is called serendipity. We don't measure or explicitly optimize for serendipity but filtering out already cooked recipes is an easy step in that direction.
- Real-time machine learning can be a hard problem to solve. Chip Huyen and Eugene Yan wrote great articles about the problem: Machine learning is going real-time, Real-time machine learning: challenges and solutions, Real-time Machine Learning For Recommendations
- Of course, there are other ways to ensure recency, e.g., decaying interaction scores before training or training a separate sequence-based recommender model like GRU4Rec.