In case you missed the hyperlink, find a bare-bones Flask interface for interacting with the API here. The codebase (assuming you should have reached out by now!) is here.
Since this is my first end-to-end data science workflow, I’ll structure the categories as such, Kaggle style.
Data Collection
Using the database for aiFood, I first loaded all the recipes into a DataFrame to extract the set of all ingredients.
As mentioned in aiFood’s synopsis, I explored many options but ended up being able to retrieve macro/micro nutrition and serving size data from Nutritionix’s API for all 6000+ ingredients.
Data Processing
To process the ingredients and extract meaningful features, I dropped nan values and standardized all ingredient’s macro and micronutrients to 100 grams.
Because ingredients that appear less than twice are: a) likely to be outliers, and b) not likely to be user-friendly (considering this is a dataset of different cuisines), I neglected all those belonging to less than 3 meals.
The distribution of the number of meals an ingredient was used in over all ingredients was a power law distribution. This may be emblematic of the accessibility of such ingredients, so it may be wise to weight the likelihood of an ingredient appearing by such.
The distribution of the meal sizes was very jagged (which will come into play later).
Modeling
As I talked about in aiFood, the heuristic behind Spectral Clustering was very naive, and the reason is that a healthy, tasty meal not only includes compatible ingredients, but needs to be considered as a whole, and compatibility across pairwise ingredients don’t extend to all ingredients in a meal.
I realized I needed to engineer features on meals, rather than ingredients, and as thus my intuition and a StackOverflow post naturally led to Kernel Density Estimation.
Kernel Density
To model the smooth nature of meal quality, I went with the default gassian kernel. Then, I ran a grid search of 100 samples over the log space from -10 to 10 with base 10 to find the best bandwidth parameter in this space (i.e. the variance of the gaussian). As mentioned in the Data Processing step, the irregular nature of the distribution of meal sizes made it hard to train a generalized model on all training data, so instead, I trained separate Kernel Density estimators on each meal size from 1 to 24 ingredients. I cached these bandwidths so they can be used in the Flask app.
sklearn’s KernelDensity methods allows me to randomly sample new points from the density space. This naturally leads to a better meal generation algorithm – by using the generated data point, I can retrieve back the dimensions of a meal and suitable ingredients, then find actual ingredients to fill in for that. This is also a scalable solution, as the algorithm improves as the database grows.
Personalized recommendations could be generated from a compounding model of the macro & micronutrient distribution of meals that the user likes. This enriches the nutritional quality, since the heuristic I now adopt is that knowledge of the
[calories, protein, fat, carb, saturated fat, cholesterol, sodium, dietary fiber, sugar, potassium, p]
vector for each constituent ingredient and the meal as a whole, together, is a strong indicator of how likely a health-minded user would like (save) the list. (Those micronutrients are used because those were what Nutritionix API provides for free).
This raises the question: given knowledge that a user likes to plan his/her meals in a certain distribution of nutrients, how can I best recommend new meals? It seemed too wasteful to only use such a distribution once and have to generate a new one, so it made sense to cluster ingredients hierarchically. This led me to Agglomerative Clustering.
Agglomerative Clustering
Ingredients were clustered hierarchically such that the variance within each cluster is minimized at each step. I think this intuitively made sense because a min-variance cluster would have ingredients that “play the same role” in a meal (i.e. side, condiment, protein-source, etc.), which makes it easy to substitute an ingredient with another to generate a completely new list without recycling ingredients.
Ingredients were clustered hierarchically to allow generation of a meal of any input length (starting at the top of the hierarchy go down until there’s input-length number of clusters, then select a representative from each).
That said…
Ingredients of different clusters may have nutritional variety but not necessarily nutritional compatibility. Thus, I reintroduced affinity strictly between different clusters (i.e. weighting two ingredients’ affinity proportional to the number of meals they were both part of), then Spectral Clustering them in the same way as before.
Data Visualization
I haven’t worked on this much, but I’ve studied the documentation for bokeh, a Python visualization package.