Posts: 168
Threads: 42
Joined: May 2019
Aug-29-2023, 07:52 PM
(This post was last modified: Sep-01-2023, 01:11 PM by cubangt.)
I have the below logic that im getting results from, but when righting to the csv, each data value is being written to row instead of collectively into columns and rows.
import requests
from bs4 import BeautifulSoup
import pandas as pd
productdetails = []
headers = {
'User-Agent' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.93 Safari/537.36'
}
df2 = pd.DataFrame(columns=['Description', 'Price1', 'Price2'],index=range(1))
for x in range(1,180):
response = requests.get(f'https://www.site.com/c/mens/mens-footwear?&page_{x}', verify=False, headers=headers)
soup = BeautifulSoup(response.content,'lxml')
element_list = soup.find_all('div',class_='product-content')
for element in element_list:
for link in element.find_all('a', class_='product-card-simple-title'):
productdetails.append("Description: " + link.get_text().strip())
for price in element.find_all('span',class_='sr-only'):
productdetails.append("Price1: " + price.get_text().strip().replace('\n', '').replace(' ','').replace('dollars','.').replace('cents',''))
if len(element.find_all('span',class_='sr-only')) == 2:
productdetails.append("Price2: " + price.get_text().strip().replace('\n', '').replace(' ','').replace('dollars','.').replace('cents',''))
print(productdetails) Im currently working on checking if there are more than 1 prices, so that i can add that as 2nd price in the dataframe.
CSV data currently:
0,Description: Brooks Men's Adrenaline GTS 23 Running Shoes
1,Price1: 139.99
2,Description: Nike Men's Revolution 6 Next Nature Running Shoes
3,Price1: 22.37
4,Price2: 44.97 Expected CSV data:
Description, Price1, Price2
Brooks Men's Adrenaline GTS 23 Running Shoes, 139.99,
Nike Men's Revolution 6 Next Nature Running Shoes,22.37,44.97
Posts: 168
Threads: 42
Joined: May 2019
How can i correct my code to capture the dataset in a row per result?
For the life of me i cant figure out why its being capture like it is now.. i have written other results to CSV and didnt do anything special to have 1 row per set of data results.
Posts: 6,250
Threads: 16
Joined: Feb 2020
Sep-01-2023, 12:46 PM
(This post was last modified: Sep-01-2023, 12:46 PM by deanhystad.)
What do you see if you print productdetails? That should show you the error.
Why are you using pandas? I don't see where you get anything from putting productdetails in a dataframe. Just write the strings to a file.
Posts: 168
Threads: 42
Joined: May 2019
Here is what the console shows when i print out the ProductDetails. This is just a portion of the results it printed out.
"Description: Justin Men's Rugged Bay Gaucho EH Steel Toe Wellington Work Boots", 'Price1: 249.99', "Description: Justin Men's Rugged Bay Gaucho EH Wellington Work Boots", 'Price1: 239.99', "Description: Ariat Men's Cascade Steel Toe Lace Up Work Boots", 'Price1: 169.99', "Description: Ariat Men's Terrain H2O Lace Up Work Boots", 'Price1: 139.99', "Description: Ariat Men's Rambler Western Soft Toe Boots", 'Price1: 179.99', "Description: Ariat Men's Sport Outfitter Western Boots", 'Price1: 179.99', "Description: Ariat Men's Heritage Roper Western Boots", 'Price1: 159.99', "Description: Ariat Men's Heritage Crepe Western Boots", 'Price1: 199.99', "Description: Ariat Men's Quickdraw Western Boots", 'Price1: 224.99', "Description: Ariat Men's Tycoon Western Boots", 'Price1: 244.99', "Description: Ariat Men's Mesteno Western Boots", 'Price1: 219.99', 'Description: Softspikes Pins Cleat Kit', 'Price1: 24.99', "Description: Sperry Men's Brewster Duck Boots", 'Price1: 109.99', "Description: Thorogood Shoes Men's American Heritage 6 in Wedge Lace Up Work Boots", 'Price1: 215.', "Description: Thorogood Shoes Men's American Heritage 6 in Moc Toe Wedge Lace Up Work Boots", 'Price1: 215.', "Description: Justin Men's Hybred Turq EH Steel Toe Wellington Work Boots", 'Price1: 214.99', "Description: Dexter Men's Pro AM II Bowling Shoes", 'Price1: 54.99', "Description: Chippewa Boots Men's Bay Apache EH Steel Toe Lace Up Work Boots", "Description: Chippewa Boots Men's Briar Insulated EH Steel Toe Lace Up Work Boots", "Description: Chippewa Boots Men's Insulated Logger Lace Up Work Boots", 'Price1: 219.99', "Description: Chippewa Boots Men's Insulated EH Steel Toe Lace Up Work Boots", "Description: Chippewa Boots Men's EH Steel Toe Lace Up Work Boots", "Description: Chippewa Boots Men's Bay Apache Utility EH Composite Toe Lace Up Work Boots", "Description: Chippewa Boots Men's Heavy Duty Tough Bark Utility EH Lace Up Work Boots", "Description: Chippewa Boots Men's Engineer EH Steel Toe Wellington Work Boots", 'Price1: 265.99', "Description: Danner Men's Duty Tanicus Tactical Boots", "Description: Tony Lama Men's Suntan Century Americana Western Boots", 'Price1: 254.99', "Description: Tony Lama Men's Worn Goat Americana Western Boots", 'Price1: 254.99', "Description: Tony Lama Men's Pecan Bison Americana Western Boots", 'Price1: 254.99', "Description: Tony Lama Men's Stallion Americana Western Boots", 'Price1: 254.99', "Description: Chippewa Boots Men's Rugged Outdoor Snake Boots", 'Price1: 299.99']
Posts: 6,250
Threads: 16
Joined: Feb 2020
Sep-01-2023, 01:03 PM
(This post was last modified: Sep-01-2023, 03:41 PM by deanhystad.)
productdetails can be 1 of 3 things:
1: List of strings where each string is a row in the CSV file
2: List of lists, where each list contains the information for 1 item,
3: List of dictionaries, where each dictionary is the infomation for one item.
1. Make strings that are CSV formatted rows
productdetails = [
"Justin Men's Rugged Bay Gaucho EH Steel Toe Wellington Work Boots|249.99|249.99"
"Ariat Men's Cascade Steel Toe Lace Up Work Boots|169.99|171.99"
] There is a problem with your prices. Unless all products have the same number of prices, the resulting CSV will be ragged (not same number of columns for each row). Very few tools can read a ragged CSV file because it is not tabular. I suggest getting all the prices, sorting, and only including the low and high price.
Since each string in productdetails is a CSV format string, you don't need to do any conversion. Open the file, write a header line, and then all the product details.
2. productdetails is list of lists:
productdetails = [
["Justin Men's Rugged Bay Gaucho EH Steel Toe Wellington Work Boots", 249.99, 249.99],
["Ariat Men's Cascade Steel Toe Lace Up Work Boots", 169.99, 171.99]
] To write the CSV file you would first construct a dataframe:
df = pd.DataFrame(productdetails, columns=("Description", "Low", "High"))
df.to_csv('ASO.csv', sep="|", index=None) 3. productdetails is a list of dictionaries.
productdetails = [
{"Description": "Justin Men's Rugged Bay Gaucho EH Steel Toe Wellington Work Boots", "Low": 249.99, High: 249.99},
{"Description": "Ariat Men's Cascade Steel Toe Lace Up Work Boots", "Low": 169.99, "High": 171.99}
] To write the CSV file you would first construct a dataframe:
df = pd.DataFrame(productdetails)
df.to_csv('ASO.csv', sep="|", index=None) What you are doing is an odd mix of approaches 1 and 3. productdetails is a list of strings, but you format the strings to look like they are dictionaries. Formatting a string to make it look like a dictionary does not turn it into a dictionary. Plus you have the problem that there is noting that groups your descriptions and prices.
Posts: 168
Threads: 42
Joined: May 2019
I updated the original post to reflect the lastest attempt at fixing this..
Since i found that it was returning 2 prices for some and not all, i want to capture both prices if there are 2 and only one if only one..
If there is something obvious im not seeing it..
The only thing i can think is that i need to change how i capture it into the list? Maybe have a multi-dimension list?
Posts: 6,250
Threads: 16
Joined: Feb 2020
Sep-01-2023, 03:40 PM
(This post was last modified: Sep-01-2023, 03:40 PM by deanhystad.)
Quote:Since i found that it was returning 2 prices for some and not all, i want to capture both prices if there are 2 and only one if only one..
You cannot do this. Everything must have 2 prices or 1 price. You could make the second price None, but you need the same number of prices for each item so each row will have the same number of columns.
Another option is put all the prices in one column. This will make price a string that looks like it has multiple price values. That is the only way you could have an undefined number of prices. If you are only going to have 2, use None for the second price when needed.
I think you are using the wrong approach to this problem. I have done no web-scraping myself, but I don't think beautiful soup is the correct tool. Instead of using BS4 I think most examples use json.
productdetails = []
for x in range(1,180):
response = requests.get(f'https://www.site.com/c/mens/mens-footwear?&page_{x}', verify=False, headers=headers)
data = response.json()
for product in data["product-content"]:
prices = list(product["not-sure-what-goes-here"])
if len(prices) == 1: # Pad/slice prices to length 2
prices.append(None)
else:
prices = prices[:2]
productdetails.append({
"Description": product["product-card-simple-title",
"Price 1": prices[0],
"Price 2": prices[1]}) This code is completely untested, but I would at least look into using json to get lists and dictionaries instead of using BS4 to parse html.
Posts: 168
Threads: 42
Joined: May 2019
So i got closer and alot cleaner in the results, but upon validating the correct prices are being captured, it seems to be setting my price2 the same as price1, not sure why its not picking up the 2nd element when there are 2 prices. If i add the index, it complains with the below error, but if i remove the index, i get the nice list below, just wrong prices
Error: price2 = price[1].get_text().strip().replace('\n', '').replace(' ','').replace('dollars','.').replace('cents','')
File ~\Miniconda3\lib\site-packages\bs4\element.py:1573 in __getitem__
return self.attrs[key]
KeyError: 1
27 Nike Men's Vapor Edge Pro 360 2 Football Cleats 124.99 None
28 Nike Men's React Infinity 3 Running Shoes 159.99 159.99
29 adidas 3-Stripe Crew Socks 3 Pack 13.99 13.99
30 ASICS Men's Jolt 3 Running Shoes 44.99 44.99
31 Sof Sole Team Performance Adults' Baseball Soc... 9.99 None
32 Nike Adult Force Trout 8 Keystone Mid RM Baseb... 49.99 49.99
33 Nike Men's Air Max Systm Shoes 99.99 99.99
34 Crocs™ Adults' Classic Clogs 49.99 49.99
35 adidas Men's Freak Spark Football Cleats 49.99 49.99
36 ASICS Men's GEL-VENTURE 8 Trail Running Shoes 49.99 49.99
37 Wolverine Men's Potomac 2 EH Steel Toe Lace Up... 99.99 99.99
38 Sof Sole Team Men's Performance Football Socks... 9.99 None
39 adidas Men's Galaxy 6 Running Shoes 49.99 49.99
40 Under Armour Men's Blur Smoke 2.0 MC Football ... 49.99 49.99
41 ASICS Men's GT-1000 11 Running Shoes 99.99 99.99
42 Nike Youth Vapor Edge Shark 2 Football Cleats 99.99 99.99
43 adidas Rivalry Over The Calf Socks 2 Pack 11.99 None
44 adidas Men’s adizero Spark Football Cleats 69.99 69.99
45 Under Armour Men's Charged Assert 10 Running S... 69.99 69.99
Posts: 6,250
Threads: 16
Joined: Feb 2020
Sep-02-2023, 01:03 AM
(This post was last modified: Sep-02-2023, 01:03 AM by deanhystad.)
I was looking at your code in the first post. Your indentation is wrong. Not syntactically wrong, but logically wrong. This indentation makes it look like the prices are part of the description:
element_list = soup.find_all('div',class_='product-content')
for element in element_list:
for link in element.find_all('a', class_='product-card-simple-title'):
productdetails.append("Description: " + link.get_text().strip())
for price in element.find_all('span',class_='sr-only'):
productdetails.append("Price1: " + price.get_text().strip().replace('\n', '').replace(' ','').replace('dollars','.').replace('cents',''))
if len(element.find_all('span',class_='sr-only')) == 2:
productdetails.append("Price2: " + price.get_text().strip().replace('\n', '').replace(' ','').replace('dollars','.').replace('cents','')) This is not true. The description and prices are both parts of the element. I would write like this:
for product in soup.find_all('div', class_='product-content'):
name = product.find('a', class_='product-card-simple-title').get_text().strip()
price = [
p.get_text().strip().replace('\n', '').replace(' ', '').replace('dollars', '.').replace('cents', '')
for p in product.find_all('span', class_='sr-only', limit=2)
]
if len(price < 2):
price.append[None]
productdetails.append([name, price[0], price[1]]) This indentation shows that the description and prices are attributes of the product (which is a much better variable name than element).
Even if there is more than one product description, you only use one. Why find_all when you only want to find one? And if you are only going to use 2 prices, why return more than 2?
Posts: 168
Threads: 42
Joined: May 2019
Def looks cleaner but now my dataframe is returning empty. No errors just empty. And i do see the issue with the indent.
soup = BeautifulSoup(response.content,'lxml')
for product in soup.find_all('div',class_='product-content'):
name = product.find('a', class_='product-card-simple-title').get_text().strip()
price = [
p.get_text().strip().replace('\n', '').replace(' ', '').replace('dollars', '.').replace('cents', '')
for p in product.find_all('span', class_='sr-only', limit=2)
]
if len(price)<2:
price.append[None]
productdetails.append([name, price[0], price[1]])
#products.append({'Description:':descrip,'Price1:':price1,'Price2:':price2})
df = pd.DataFrame(productdetails)
df.to_csv('ASO.csv',index=False)
print(df) Console output:
Empty DataFrame
Columns: []
Index: []
|