Capturing BS4 values into DF and writing to CSV

cubangt · (This post was last modified: Sep-01-2023, 01:11 PM by cubangt.)

I have the below logic that im getting results from, but when righting to the csv, each data value is being written to row instead of collectively into columns and rows.

import requests
from bs4 import BeautifulSoup
import pandas as pd

productdetails = []

headers = {
    'User-Agent' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.93 Safari/537.36'
}

df2 = pd.DataFrame(columns=['Description', 'Price1', 'Price2'],index=range(1))

for x in range(1,180):
    response = requests.get(f'https://www.site.com/c/mens/mens-footwear?&page_{x}', verify=False, headers=headers)
    soup = BeautifulSoup(response.content,'lxml')

    element_list = soup.find_all('div',class_='product-content')
    for element in element_list:
        for link in element.find_all('a', class_='product-card-simple-title'):
            productdetails.append("Description: " + link.get_text().strip())
            for price in element.find_all('span',class_='sr-only'):
                productdetails.append("Price1: " + price.get_text().strip().replace('\n', '').replace(' ','').replace('dollars','.').replace('cents',''))
                if len(element.find_all('span',class_='sr-only')) == 2:
                    productdetails.append("Price2: " + price.get_text().strip().replace('\n', '').replace(' ','').replace('dollars','.').replace('cents',''))

print(productdetails)

Im currently working on checking if there are more than 1 prices, so that i can add that as 2nd price in the dataframe.

CSV data currently:

0,Description: Brooks Men's Adrenaline GTS 23 Running Shoes
1,Price1: 139.99
2,Description: Nike Men's Revolution 6 Next Nature Running Shoes
3,Price1: 22.37
4,Price2: 44.97

Expected CSV data:

Description, Price1, Price2
Brooks Men's Adrenaline GTS 23 Running Shoes, 139.99,
Nike Men's Revolution 6 Next Nature Running Shoes,22.37,44.97

cubangt · Aug-31-2023, 02:57 PM

How can i correct my code to capture the dataset in a row per result?

For the life of me i cant figure out why its being capture like it is now.. i have written other results to CSV and didnt do anything special to have 1 row per set of data results.

**deanhystad** · (This post was last modified: Sep-01-2023, 12:46 PM by deanhystad.)

What do you see if you print productdetails? That should show you the error.

Why are you using pandas? I don't see where you get anything from putting productdetails in a dataframe. Just write the strings to a file.

cubangt · Sep-01-2023, 12:53 PM

Here is what the console shows when i print out the ProductDetails. This is just a portion of the results it printed out.

"Description: Justin Men's Rugged Bay Gaucho EH Steel Toe Wellington Work Boots", 'Price1: 249.99', "Description: Justin Men's Rugged Bay Gaucho EH Wellington Work Boots", 'Price1: 239.99', "Description: Ariat Men's Cascade Steel Toe Lace Up Work Boots", 'Price1: 169.99', "Description: Ariat Men's Terrain H2O Lace Up Work Boots", 'Price1: 139.99', "Description: Ariat Men's Rambler Western Soft Toe Boots", 'Price1: 179.99', "Description: Ariat Men's Sport Outfitter Western Boots", 'Price1: 179.99', "Description: Ariat Men's Heritage Roper Western Boots", 'Price1: 159.99', "Description: Ariat Men's Heritage Crepe Western Boots", 'Price1: 199.99', "Description: Ariat Men's Quickdraw Western Boots", 'Price1: 224.99', "Description: Ariat Men's Tycoon Western Boots", 'Price1: 244.99', "Description: Ariat Men's Mesteno Western Boots", 'Price1: 219.99', 'Description: Softspikes Pins Cleat Kit', 'Price1: 24.99', "Description: Sperry Men's Brewster Duck Boots", 'Price1: 109.99', "Description: Thorogood Shoes Men's American Heritage 6 in Wedge Lace Up Work Boots", 'Price1: 215.', "Description: Thorogood Shoes Men's American Heritage 6 in Moc Toe Wedge Lace Up Work Boots", 'Price1: 215.', "Description: Justin Men's Hybred Turq EH Steel Toe Wellington Work Boots", 'Price1: 214.99', "Description: Dexter Men's Pro AM II Bowling Shoes", 'Price1: 54.99', "Description: Chippewa Boots Men's Bay Apache EH Steel Toe Lace Up Work Boots", "Description: Chippewa Boots Men's Briar Insulated EH Steel Toe Lace Up Work Boots", "Description: Chippewa Boots Men's Insulated Logger Lace Up Work Boots", 'Price1: 219.99', "Description: Chippewa Boots Men's Insulated EH Steel Toe Lace Up Work Boots", "Description: Chippewa Boots Men's EH Steel Toe Lace Up Work Boots", "Description: Chippewa Boots Men's Bay Apache Utility EH Composite Toe Lace Up Work Boots", "Description: Chippewa Boots Men's Heavy Duty Tough Bark Utility EH Lace Up Work Boots", "Description: Chippewa Boots Men's Engineer EH Steel Toe Wellington Work Boots", 'Price1: 265.99', "Description: Danner Men's Duty Tanicus Tactical Boots", "Description: Tony Lama Men's Suntan Century Americana Western Boots", 'Price1: 254.99', "Description: Tony Lama Men's Worn Goat Americana Western Boots", 'Price1: 254.99', "Description: Tony Lama Men's Pecan Bison Americana Western Boots", 'Price1: 254.99', "Description: Tony Lama Men's Stallion Americana Western Boots", 'Price1: 254.99', "Description: Chippewa Boots Men's Rugged Outdoor Snake Boots", 'Price1: 299.99']

**deanhystad** · (This post was last modified: Sep-01-2023, 03:41 PM by deanhystad.)

productdetails can be 1 of 3 things:
1: List of strings where each string is a row in the CSV file
2: List of lists, where each list contains the information for 1 item,
3: List of dictionaries, where each dictionary is the infomation for one item.

1. Make strings that are CSV formatted rows

productdetails = [
    "Justin Men's Rugged Bay Gaucho EH Steel Toe Wellington Work Boots|249.99|249.99"
    "Ariat Men's Cascade Steel Toe Lace Up Work Boots|169.99|171.99"
]

There is a problem with your prices. Unless all products have the same number of prices, the resulting CSV will be ragged (not same number of columns for each row). Very few tools can read a ragged CSV file because it is not tabular. I suggest getting all the prices, sorting, and only including the low and high price.

Since each string in productdetails is a CSV format string, you don't need to do any conversion. Open the file, write a header line, and then all the product details.

2. productdetails is list of lists:

productdetails = [
    ["Justin Men's Rugged Bay Gaucho EH Steel Toe Wellington Work Boots", 249.99, 249.99],
    ["Ariat Men's Cascade Steel Toe Lace Up Work Boots", 169.99, 171.99]
]

To write the CSV file you would first construct a dataframe:

df = pd.DataFrame(productdetails, columns=("Description", "Low", "High"))
df.to_csv('ASO.csv', sep="|", index=None)

3. productdetails is a list of dictionaries.

productdetails = [
    {"Description": "Justin Men's Rugged Bay Gaucho EH Steel Toe Wellington Work Boots", "Low": 249.99, High: 249.99},
    {"Description": "Ariat Men's Cascade Steel Toe Lace Up Work Boots", "Low": 169.99, "High": 171.99}
]

To write the CSV file you would first construct a dataframe:

df = pd.DataFrame(productdetails)
df.to_csv('ASO.csv', sep="|", index=None)

What you are doing is an odd mix of approaches 1 and 3. productdetails is a list of strings, but you format the strings to look like they are dictionaries. Formatting a string to make it look like a dictionary does not turn it into a dictionary. Plus you have the problem that there is noting that groups your descriptions and prices.

cubangt · Sep-01-2023, 01:14 PM

I updated the original post to reflect the lastest attempt at fixing this..

Since i found that it was returning 2 prices for some and not all, i want to capture both prices if there are 2 and only one if only one..

If there is something obvious im not seeing it..

The only thing i can think is that i need to change how i capture it into the list? Maybe have a multi-dimension list?

**deanhystad** · (This post was last modified: Sep-01-2023, 03:40 PM by deanhystad.)

Quote:Since i found that it was returning 2 prices for some and not all, i want to capture both prices if there are 2 and only one if only one..

You cannot do this. Everything must have 2 prices or 1 price. You could make the second price None, but you need the same number of prices for each item so each row will have the same number of columns.

Another option is put all the prices in one column. This will make price a string that looks like it has multiple price values. That is the only way you could have an undefined number of prices. If you are only going to have 2, use None for the second price when needed.

I think you are using the wrong approach to this problem. I have done no web-scraping myself, but I don't think beautiful soup is the correct tool. Instead of using BS4 I think most examples use json.

productdetails = []
for x in range(1,180):
    response = requests.get(f'https://www.site.com/c/mens/mens-footwear?&page_{x}', verify=False, headers=headers)
    data = response.json()
    for product in data["product-content"]:
        prices = list(product["not-sure-what-goes-here"])
        if len(prices) == 1:  # Pad/slice prices to length 2
            prices.append(None)
        else:
            prices = prices[:2]
        productdetails.append({
            "Description": product["product-card-simple-title",
            "Price 1": prices[0],
            "Price 2": prices[1]})

This code is completely untested, but I would at least look into using json to get lists and dictionaries instead of using BS4 to parse html.

cubangt · Sep-01-2023, 05:03 PM

So i got closer and alot cleaner in the results, but upon validating the correct prices are being captured, it seems to be setting my price2 the same as price1, not sure why its not picking up the 2nd element when there are 2 prices. If i add the index, it complains with the below error, but if i remove the index, i get the nice list below, just wrong prices

Error:    price2 = price[1].get_text().strip().replace('\n', '').replace(' ','').replace('dollars','.').replace('cents','')

  File ~\Miniconda3\lib\site-packages\bs4\element.py:1573 in __getitem__
    return self.attrs[key]

KeyError: 1

27    Nike Men's Vapor Edge Pro 360 2 Football Cleats  124.99    None
28          Nike Men's React Infinity 3 Running Shoes  159.99  159.99
29                  adidas 3-Stripe Crew Socks 3 Pack   13.99   13.99
30                   ASICS Men's Jolt 3 Running Shoes   44.99   44.99
31  Sof Sole Team Performance Adults' Baseball Soc...    9.99    None
32  Nike Adult Force Trout 8 Keystone Mid RM Baseb...   49.99   49.99
33                     Nike Men's Air Max Systm Shoes   99.99   99.99
34                       Crocs™ Adults' Classic Clogs   49.99   49.99
35           adidas Men's Freak Spark Football Cleats   49.99   49.99
36      ASICS Men's GEL-VENTURE 8 Trail Running Shoes   49.99   49.99
37  Wolverine Men's Potomac 2 EH Steel Toe Lace Up...   99.99   99.99
38  Sof Sole Team Men's Performance Football Socks...    9.99    None
39                adidas Men's Galaxy 6 Running Shoes   49.99   49.99
40  Under Armour Men's Blur Smoke 2.0 MC Football ...   49.99   49.99
41               ASICS Men's GT-1000 11 Running Shoes   99.99   99.99
42      Nike Youth Vapor Edge Shark 2 Football Cleats   99.99   99.99
43          adidas Rivalry Over The Calf Socks 2 Pack   11.99    None
44         adidas Men’s adizero Spark Football Cleats   69.99   69.99
45  Under Armour Men's Charged Assert 10 Running S...   69.99   69.99

**deanhystad** · (This post was last modified: Sep-02-2023, 01:03 AM by deanhystad.)

I was looking at your code in the first post. Your indentation is wrong. Not syntactically wrong, but logically wrong. This indentation makes it look like the prices are part of the description:

element_list = soup.find_all('div',class_='product-content')
for element in element_list:
    for link in element.find_all('a', class_='product-card-simple-title'):
        productdetails.append("Description: " + link.get_text().strip())
        for price in element.find_all('span',class_='sr-only'):
            productdetails.append("Price1: " + price.get_text().strip().replace('\n', '').replace(' ','').replace('dollars','.').replace('cents',''))
            if len(element.find_all('span',class_='sr-only')) == 2:
                productdetails.append("Price2: " + price.get_text().strip().replace('\n', '').replace(' ','').replace('dollars','.').replace('cents',''))

This is not true. The description and prices are both parts of the element. I would write like this:

for product in soup.find_all('div', class_='product-content'):
    name = product.find('a', class_='product-card-simple-title').get_text().strip()
    price = [
        p.get_text().strip().replace('\n', '').replace(' ', '').replace('dollars', '.').replace('cents', '')
        for p in product.find_all('span', class_='sr-only', limit=2)
    ]
    if len(price < 2):
        price.append[None]
    productdetails.append([name, price[0], price[1]])

This indentation shows that the description and prices are attributes of the product (which is a much better variable name than element).

Even if there is more than one product description, you only use one. Why find_all when you only want to find one? And if you are only going to use 2 prices, why return more than 2?

cubangt · Sep-02-2023, 02:25 AM

Def looks cleaner but now my dataframe is returning empty. No errors just empty. And i do see the issue with the indent.

    soup = BeautifulSoup(response.content,'lxml')

    for product in soup.find_all('div',class_='product-content'):
        name = product.find('a', class_='product-card-simple-title').get_text().strip()
        price = [
            p.get_text().strip().replace('\n', '').replace(' ', '').replace('dollars', '.').replace('cents', '')
            for p in product.find_all('span', class_='sr-only', limit=2)
        ]
        if len(price)<2:
            price.append[None]
        productdetails.append([name, price[0], price[1]])
        #products.append({'Description:':descrip,'Price1:':price1,'Price2:':price2})

df = pd.DataFrame(productdetails)
df.to_csv('ASO.csv',index=False)
print(df)

Console output:

Empty DataFrame
Columns: []
Index: []

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	capturing multiline output for number of parameters	jss	3	852	Sep-01-2023, 05:42 PM Last Post: jss
	Json filter is not capturing desired key/element	mrapple2020	1	1,174	Nov-24-2022, 09:22 AM Last Post: ibreeden
	Capturing inputs values from internal python script	limors11	11	5,254	Jun-16-2019, 05:05 PM Last Post: DeaD_EyE
	Capturing a snapshot from the video	sreeramp96	1	2,197	May-24-2019, 07:02 AM Last Post: heiner55
	HTTP response capturing issue	miunika	1	2,070	Mar-16-2019, 01:46 PM Last Post: Larz60+
	HTTP response capturing issue	anna	2	2,540	Mar-15-2019, 03:08 PM Last Post: Larz60+
	Capturing error from sql	Grego	1	2,479	Jun-29-2018, 11:17 AM Last Post: ichabod801
	parsing values and writing back in xml file	deepa	4	3,959	Sep-11-2017, 09:07 AM Last Post: deepa
	Writing values at a desired column in a line of text file	Gupta	3	3,506	Jul-28-2017, 11:08 PM Last Post: Larz60+
	Can PyAudio (Port Audio) validly accept float values when writing to stream?	cdrandin	1	3,820	Mar-26-2017, 07:54 PM Last Post: nilamo

Capturing BS4 values into DF and writing to CSV

User Panel Messages

Announcements