Python Forum
Thread Rating:
  • 1 Vote(s) - 4 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Scraping with BeautifulSoup
#1
hi,

i am trying to scrape the website "https://www.newegg.com/Product/ProductList.aspx?Submit=ENE&N=-1&IsNodeId=1&Description=GTX&bop=And&Page=1&PageSize=36&order=BESTMATCH"

what i am trying to do scrape, product name, it's price and image link

i got the success a bit with one problem, name, price and image are coming in every cell, like formatting is so poor.

can someone help me to ammend codes so that i can get name in name column, price in price column and image in image column.

from urllib.request import urlopen
from bs4 import BeautifulSoup

#page_url = "https://www.newegg.com/Product/ProductList.aspx?Submit=ENE&N=-1&IsNodeId=1&Description=GTX&bop=And&Page=1&PageSize=36&order=BESTMATCH"
#html = urlopen(page_url)
#bs0bj = BeautifulSoup(html, "html.parser")

#page_details = bs0bj.find_all("div", {"class":"item-container"})

f = open("Scrapedetails.csv", "w")
Headers = "Item_Name, Price, Image\n"
f.write(Headers)

#for i in page_details:
#    Item_Name = i.find("a", {"class":"item-title"})
#    Price = i.find("li", {"class":"price-current"})
#    Image = i.find("img")
#    Name_item = Item_Name.get_text()
#    Prin = Price.get_text()
#    imgf = Image["src"]# to get the key src 
#    f.write("{}".format(Name_item)+ ",{}".format(Prin)+ ",{}".format(imgf))
#f.close()

for page in range(1,15):
    page_url = "https://www.newegg.com/Product/ProductList.aspx?Submit=ENE&N=-1&IsNodeId=1&Description=GTX&bop=And&Page={}&PageSize=36&order=BESTMATCH".format(page)
    html = urlopen(page_url)
    bs0bj = BeautifulSoup(html, "html.parser")
    page_details = bs0bj.find_all("div", {"class":"item-container"})
    for i in page_details:
        Item_Name = i.find("a", {"class":"item-title"})
        Price = i.find("li", {"class":"price-current"})
        Image = i.find("img")
        Name_item = Item_Name.get_text()
        Prin = Price.get_text()
        imgf = Image["src"]# to get the key src 
        f.write("{}".format(Name_item)+ ",{}".format(Prin)+ ",{}".format(imgf)+ "\n")
f.close()
i am attaching the excel file too and what are the new ways to save data in csv ,can someone help me in it with codes too?
Reply
#2
it looks like there is newlines somewhere in the strings that you are writing messing up the csv file. Find the newlines and remove them before writing to the file. IF its before or after the first and last character you can use str.strip() to remove them.
Recommended Tutorials:
Reply
#3
Nope , no success with strip sir and unable to find the new line even, i tried everything but go no sucess, and i am not sure how to solve it Huh Sad Wall
Reply
#4
to detect any symbols:
  • load the html page into notepad++
  • select View-->Show Symbol-->Show All Characters
the EOL and other characters will be highlighted
Reply
#5
it looks like the newline is within the string, not at the beginning or the end.
Quote:
//images10.newegg.com/NeweggImage/ProductImageCompressAll300/A85V_1_20170906967475116.jpg

Refurbished: MSI GeForce GT 730 DirectX 12 N730K-2GD5LP/OC 2GB 64-Bit GDDR5 PCI Express 2.0 x16 HDCP Ready V$
deo Card
There is no comma in here so this appears to be one element.

Also i just noticed after running your program over and over, that it triggered a captcha for me causing your script to fail.
Recommended Tutorials:
Reply
#6
(Sep-07-2017, 11:15 AM)metulburr Wrote: it looks like the newline is within the string, not at the beginning or the end.
Quote:
//images10.newegg.com/NeweggImage/ProductImageCompressAll300/A85V_1_20170906967475116.jpg

Refurbished: MSI GeForce GT 730 DirectX 12 N730K-2GD5LP/OC 2GB 64-Bit GDDR5 PCI Express 2.0 x16 HDCP Ready V$
deo Card
There is no comma in here so this appears to be one element.

Also i just noticed after running your program over and over, that it triggered a captcha for me causing your script to fail.

i even tried to put comma's at the f.write at the bottom but result was same...what should i do? how to solve it?
Reply
#7
adding commas are not going to change it. Because the newlines in the content are being passed to your csv file creating new rows.

You could split the strings by newlines to "remove" them and then join them back together before writing to the file
>>> ''.join('text\ntest'.split())
'texttest'
or replace the newlines in the string
>>> "line 1\nline 2\n...".replace('\n', '')
'line 1line 2...'
Recommended Tutorials:
Reply
#8
You have to be more exact to get clean data,before you loop and write.
Example here you get clean price out,
and added https: so image links will work.
from urllib.request import urlopen
from bs4 import BeautifulSoup

page_url = "https://www.newegg.com/Product/ProductList.aspx?Submit=ENE&N=-1&IsNodeId=1&Description=GTX&bop=And&Page=1&PageSize=36&order=BESTMATCH"
html = urlopen(page_url)
bs0bj = BeautifulSoup(html, "html.parser")
page_details = bs0bj.find_all("div", {"class":"item-container"})
for i in page_details:
    Item_Name = i.find("a", {"class":"item-title"})
    Price = i.find("li", {"class":"price-current"})
    Image = i.find("img")
    Name_item = Item_Name.get_text()
    imgf = Image["src"]

    # Fix
    #print(Name_item)
    print(Price.find('strong').text)
    #print('https:{}'.format(imgf))
Output:
179 479 589 579 479 559 469 489 ...
print image will be:
Output:
https://images10.newegg.com/ProductImageCompressAll300/14-487-292-06.jpg https://images10.newegg.com/NeweggImage/ProductImageCompressAll300/14-487-321-S99.jpg https://images10.newegg.com/NeweggImage/ProductImageCompressAll300/14-487-319-S99.jpg https://images10.newegg.com/NeweggImage/ProductImageCompressAll300/14-487-318-S99.jpg .........
Reply
#9
Alright, i got it solved guys....just replace while writing product name ",","|"

Below are the codes

from urllib.request import urlopen
from bs4 import BeautifulSoup

f = open("Scrapedetails.csv", "w")
Headers = "Item_Name, Price, Image\n"
f.write(Headers)

for page in range(1,15):
    page_url = "https://www.newegg.com/Product/ProductList.aspx?Submit=ENE&N=-1&IsNodeId=1&Description=GTX&bop=And&Page={}&PageSize=36&order=BESTMATCH".format(page)
    html = urlopen(page_url)
    bs0bj = BeautifulSoup(html, "html.parser")
    page_details = bs0bj.find_all("div", {"class":"item-container"})
    for i in page_details:
        Item_Name = i.find("a", {"class":"item-title"})
        Price = i.find("li", {"class":"price-current"}).find('strong')
        Image = i.find("img")
        Name_item = Item_Name.get_text().strip()
        prin = Price.get_text()
        imgf = Image["src"]# to get the key src 
        

        print(Name_item)
        print(prin)
        print('https:{}'.format(imgf))
        f.write("{}".format(Name_item).replace(",", "|")+ ",{}".format(prin)+ ",https:{}".format(imgf)+ "\n")
f.close()
Thank you so much everybody, everyone who helped in this code, this is an very best platform for all python lovers who all wants to be a great programmer.

I am also attaching the end result

Attached Files

.csv   Scrapedetails.csv (Size: 7.11 KB / Downloads: 330)
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Scraping based on years BeautifulSoup rhat398 0 1,736 May-22-2021, 07:20 PM
Last Post: rhat398
  Beautifulsoup Scraping PolskaYBZ 3 3,148 Jun-22-2019, 10:05 AM
Last Post: PolskaYBZ
  Combining selenium and beautifulsoup for web scraping sumandas89 3 11,581 Jan-30-2018, 02:14 PM
Last Post: metulburr

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020