I have a table in a database with about 1600 rows containing an ingredient name column with another column with all the recipe ids that ingredient is in. This second column also includes data about the number of ingredients in that recipe and a number representing the recipe popularity. So for example ;
butter <r>3897,10,200</r><r>15600,12,90</r><r>29354,7,10</r>
chopped tomatoes <r>14130,10,200</r><r>387,10,150</r><r>20888,20,50</r>
vegetable stock <r>2455,17,120</r><r>3500,12,45</r><r>3588,5,5</r>
basil <r>14130,10,200</r><r>3777,14,70</r><r>377,19,0</r>
olive oil <r>14130,10,200</r><r>387,10,150</r><r>497,7,150</r>
The r tag indicates each recipe, within those tags is recipe id, number of ingredients in recipe, popularity value. There are more than 30,000 recipes in the database, so the example above is quite a simplified one.
Given an input of a list of ingredient names & a maximum ingredients per recipe integer value in this format ;
Input ingredients: chopped tomatoes,basil,olive oil
Maximum ingredients/recipe : 15
I would like some python code that would quickly return:
* A list of unique recipe id's from the database those ingredients are in
* The list should exclude recipe id's with more than the input maximum ingredients value
* The list should be sorted by the the recipe popularity value, then by number of ingredients in each recipe.
* The list should be multidimensional, so for each recipe id, these values should also be included ; number of ingredient matches to input string, recipe popularity value, total number of ingredients in each recipe.
So for the example input values above, I would get a list like this:
[[14130, 3, 10, 200], [387, 2, 10, 150], [497, 1, 7,150]]
Any suggestions most welcome, thank you in advance.
Hi,
assuming you use a RDBMS, the table design is pretty weird. You should have at least three tables: one for ingredients, one for receipes and one respesenting the many-to-many relation between ingredient and receipe combined with holding the needed amount / weight of the ingredient for the given receipe. Then it will be way easier to query the database using SQL.
Regards, noisefloor
Hi
Firstly, sorry if it didn't sound like I had tried to solve the problem myself & this is for a personal learning project as a newbie.
noisefloor I did initially try something like that with 3 tables as you suggested, but I think with a large amount of data it was quite slow using a SQL query.
I do already have something similar working in a simpler way where only the recipe id's sorted by matches to ingredients in input string are returned see underneath. The second database table column is simpler for this with just the recipe ids for each ingredient:
UsersIngredientList= UsersIngredientList.split(",")
recipeidlist=()
counter=0
while counter<len(UsersIngredientList):
cursor.execute("select recipe_ids from ingredient where display_name = '"+UsersIngredientList[counter]+"'")
recipeidlist = sum((recipeidlist, cursor.fetchone()), ())
counter+=1
cursor.close()
recipeidlist = ''.join(recipeidlist)
recipeidlist=recipeidlist.replace("</id>","")
recipeidlist = list(recipeidlist.split("<id>"))
recipeidlist.pop(0)
runtime = round(time.time() - start_time, 2)
print ("Getting list of recipe ID's from database for users ingredients took ", runtime," seconds to run")
start_time = time.time()
CleanedandSortedRecipeIDList = Counter(recipeidlist).most_common()
runtime = round(time.time() - start_time, 2)
print ("Sorting the recipe ID list by frequency then removing duplicates took ", runtime," seconds to run")
start_time = time.time()
if len(CleanedandSortedRecipeIDList)>500:
CleanedandSortedRecipeIDList= CleanedandSortedRecipeIDList[0:500]
runtime = round(time.time() - start_time, 2)
print ("Limiting recipe ID list to 500 items took ", runtime," seconds to run")
start_time = time.time()
counter = 0
CleanedandSortedRecipeIDListString=""
while counter<len(CleanedandSortedRecipeIDList):
CleanedandSortedRecipeIDListString= CleanedandSortedRecipeIDListString+CleanedandSortedRecipeIDList[counter][0]+","
counter = counter+1
CleanedandSortedRecipeIDListString = CleanedandSortedRecipeIDListString[:-1]
runtime = round(time.time() - start_time, 2)
print ("Building a string from all the recipe ID's for the SQL statement took ", runtime," seconds to run")
(Feb-11-2023, 08:22 PM)Timbo03 Wrote: [ -> ]I did initially try something like that with 3 tables as you suggested, but I think with a large amount of data it was quite slow using a SQL query
I disagree with you. Relational databases are specifically designed to store, combine, select and extract data. I would be surprised if you can do this faster with an interpreted language like Python.
Maybe I'll give that another try then but still not very confident about that solution tbh.
Hi,
Quote:I did initially try something like that with 3 tables as you suggested, but I think with a large amount of data it was quite slow using a SQL query.
That's hard to imagine - assuming the tables, relations and indexes were set-up correctly. Do you still have the
CREATE TABLE
statements you used?
Except this: 1600 lines is not a lot of data at all. Database are geared towards hundered thounds or millions of data base rows. Also for in-memory processing and possible manually iterating of the data 1600 lines are not a lot. What can make it slow is using slow algorithms or a poor set-up of the DB tables.
Regards, noisefloor
Hi,
Firstly thank you again for your help, it's much appreciated.
These are the create statements for the two tables in the database ;
CREATE TABLE ingredient
(
display_name
varchar(100) CHARACTER SET utf8 COLLATE utf8_unicode_ci DEFAULT NULL,
recipe_ids
mediumtext CHARACTER SET utf8 COLLATE utf8_unicode_ci,
id
int NOT NULL AUTO_INCREMENT,
PRIMARY KEY (id
)
) ENGINE=InnoDB AUTO_INCREMENT=1726 DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;
CREATE TABLE recipe
(
id
int NOT NULL AUTO_INCREMENT,
name
varchar(200) CHARACTER SET utf8 COLLATE utf8_unicode_ci DEFAULT NULL,
different_ingredients
int DEFAULT NULL,
picture_url
varchar(300) CHARACTER SET utf8 COLLATE utf8_unicode_ci DEFAULT NULL,
rating
float DEFAULT NULL,
rating_quantity
int DEFAULT NULL,
method
varchar(10000) CHARACTER SET utf8 COLLATE utf8_unicode_ci DEFAULT NULL,
servings
varchar(100) CHARACTER SET utf8 COLLATE utf8_unicode_ci DEFAULT NULL,
PRIMARY KEY (id
)
) ENGINE=InnoDB AUTO_INCREMENT=43276 DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;
~30,000 recipes does mean there would be potentially a lot of rows in a third recipe_ingredient table. E.g. say each recipe on average has 8 different ingredients, that's 240,000 rows. That might not sound like much, but I am also interested in making the API perform well under load/multiple users. I have a load testing tool for this so it is easy for me to test different strategies out.
Hi,
you certainly do not use the capabilities of a rational database, which are relations. Or, to say it the other way round: ypur database design is no good.
You should have at least three tables: one for ingredients, one for receipe and (put a good name here) holding in each a row an ingredeint, the correspding receipt and the quantity of the ingredient needed for the receip. This is what is called a "many-to-many relationship" in SQL. There are plenty of websites out there explaing how that works.
I'm not sure to understand correctly what your rating system should look like, but you may need more tables to implement a proper rating system.
I'm not sure what "method" is for, but if it is soming like "pan frying, cooking, stewing, ...", this would be another table with a one-to-many relationship (each receipe has exactly one method.
I also would suggest to unify all receips to a fixed number of servings (e.g. two) and calculate the ingredients for more or less servings based on that. That is eaier to implement.
Regards, noisefloor