Python Forum
Capturing BS4 values into DF and writing to CSV
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Capturing BS4 values into DF and writing to CSV
#11
Now i get this error

Error:
productdetails.append([name, price[0], price[1]]) IndexError: list index out of range
Reply
#12
Well thank you for all the help, i got the following code working..Its returning the data needed BUT having mixed results everytime i run it.

productdetails = []

for x in range(1,186):
    response = requests.get(f'https://www.site.com/c/mens/mens-footwear?&page_{x}')
    soup = BeautifulSoup(response.content,'lxml')

    products = soup.find_all('div',class_='product-content')
    
    for product in products:
        cart = ''
        prices = ''
        for link in product.find_all('a', class_='product-card-simple-title'):
            name = link.get_text().strip()          
                
        for price in product.find_all('span',class_='sr-only'):
            prices = [
                p.get_text().strip().replace('\n', '').replace(' ', '').replace('dollars', '.').replace('cents', '')
                for p in product.find_all('span', class_='sr-only', limit=2)
                ]
        for c in product.find_all('strong',class_='css-1c9g7b0'):
            cart = c.get_text().strip()
        
        if (len(prices) == 0 and (len(cart) > 0)):
            productdetails.append([name,None,cart])
        else:
            productdetails.append([name,prices,cart])
        
        
df = pd.DataFrame(productdetails,columns=['Description','Prices','See In Cart'])

df.to_csv('ASOout.csv', index=False)
By mixed results i mean that i can run it now and get a total of 115 results.. then run it again a few minutes later and get 161 results. But when you look at the site and the number of pages, it should return just over 4000k records from 187 pages.

Since there are no errors returning how can i trouble shoot why its returning different number of records and not the entire listing?
Reply
#13
This code is assigning a blank string to cart if no cart is found.
cart = ''
for c in product.find_all('strong',class_='css-1c9g7b0'):
     cart = c.get_text().strip()
It might not have been your intention, but that is what the code accomplishes. I would try to make the purpose more obvious.
cart = product.find('strong', class_='css-1c9g7b0')
cart = cart.get_text().strip() if cart else ''
If you are doing this for cart, why don't you do the same for name?

You are doing something similar for price, but this is really confusing.
 for price in product.find_all('span',class_='sr-only'):
    prices = [
        p.get_text().strip().replace('\n', '').replace(' ', '').replace('dollars', '.').replace('cents', '')
        for p in product.find_all('span', class_='sr-only', limit=2)
    ]
This says "For every price I find I am going to load all the prices into a list." If there were 100 prices you would remake the list of 100 prices 100 times. This accomplishes the same thing without needlessly banging away on the poor website.
prices = [
    p.get_text().strip().replace('\n', '').replace(' ', '').replace('dollars', '.').replace('cents', '')
    for p in product.find_all('span', class_='sr-only', limit=2)
]
productdetails.append(name, prices if prices else None, cart]
Reply
#14
i was clearing the variable because i was getting prices and cart values and thats not valid.. if the item has see in cart, then there is no price to get..
The above code works and provides the expected results.. the only thing thats not working is that it should be scraping 187 pages and returning around 4000k results, but only getting a different number each run and its only under 200 records.

I made the recommended changes and now only getting 97 records.. Its all valid data, just incomplete.

There has to be something wrong in the request? Should i check for a response from the page load before trying to read it?
Reply
#15
ive added a 2 and 3 second sleep between requests and i still only get around 200 records...
Reply
#16
Found the problem, apparently the scraping is triggering their captcha and being sent to a verification page, then being redirected to the page..

i have increased the sleep between pages to 10 to see if that helps
Reply
#17
Yea so increasing it to 10 seconds didnt help, it is looping all 187 pages, but only parsing 5 or 6 pages. and the rest are all returning the captcha pages/links
Reply
#18
Maybe it is time to abuse some other website.
Reply
#19
So the biggest issue i found and not sure how i didnt catch it sooner.. But i wasnt passing headers with the request.. I added the header and changed the delay down to 6 seconds and it processed 64 pages before the CAPTCHA pages.. So i read the robot.txt and there is nothing specific about the pages im looking at and there is no time specified between calls to the pages.. So tonight ill increase the time between requests to 10 seconds or higher to see if i can get more successful results.

Once confirmed ill post the working code and snippet of the results in the csv.

Thank you again for the suggestions and examples.
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  capturing multiline output for number of parameters jss 3 826 Sep-01-2023, 05:42 PM
Last Post: jss
  Json filter is not capturing desired key/element mrapple2020 1 1,144 Nov-24-2022, 09:22 AM
Last Post: ibreeden
  Capturing inputs values from internal python script limors11 11 5,215 Jun-16-2019, 05:05 PM
Last Post: DeaD_EyE
  Capturing a snapshot from the video sreeramp96 1 2,191 May-24-2019, 07:02 AM
Last Post: heiner55
  HTTP response capturing issue miunika 1 2,046 Mar-16-2019, 01:46 PM
Last Post: Larz60+
  HTTP response capturing issue anna 2 2,513 Mar-15-2019, 03:08 PM
Last Post: Larz60+
  Capturing error from sql Grego 1 2,463 Jun-29-2018, 11:17 AM
Last Post: ichabod801
  parsing values and writing back in xml file deepa 4 3,939 Sep-11-2017, 09:07 AM
Last Post: deepa
  Writing values at a desired column in a line of text file Gupta 3 3,477 Jul-28-2017, 11:08 PM
Last Post: Larz60+
  Can PyAudio (Port Audio) validly accept float values when writing to stream? cdrandin 1 3,787 Mar-26-2017, 07:54 PM
Last Post: nilamo

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020