Scraping with BeautifulSoup - Printable Version

Scraping with BeautifulSoup - Printable Version

+- Python Forum (https://python-forum.io)
+-- Forum: Python Coding (https://python-forum.io/forum-7.html)
+--- Forum: Web Scraping & Web Development (https://python-forum.io/forum-13.html)
+--- Thread: Scraping with BeautifulSoup (/thread-4755.html)

Scraping with BeautifulSoup - Prince_Bhatia - Sep-06-2017

hi,

i am trying to scrape the website "https://www.newegg.com/Product/ProductList.aspx?Submit=ENE&N=-1&IsNodeId=1&Description=GTX&bop=And&Page=1&PageSize=36&order=BESTMATCH"

what i am trying to do scrape, product name, it's price and image link

i got the success a bit with one problem, name, price and image are coming in every cell, like formatting is so poor.

can someone help me to ammend codes so that i can get name in name column, price in price column and image in image column.

from urllib.request import urlopen
from bs4 import BeautifulSoup

#page_url = "https://www.newegg.com/Product/ProductList.aspx?Submit=ENE&N=-1&IsNodeId=1&Description=GTX&bop=And&Page=1&PageSize=36&order=BESTMATCH"
#html = urlopen(page_url)
#bs0bj = BeautifulSoup(html, "html.parser")

#page_details = bs0bj.find_all("div", {"class":"item-container"})

f = open("Scrapedetails.csv", "w")
Headers = "Item_Name, Price, Image\n"
f.write(Headers)

#for i in page_details:
#    Item_Name = i.find("a", {"class":"item-title"})
#    Price = i.find("li", {"class":"price-current"})
#    Image = i.find("img")
#    Name_item = Item_Name.get_text()
#    Prin = Price.get_text()
#    imgf = Image["src"]# to get the key src 
#    f.write("{}".format(Name_item)+ ",{}".format(Prin)+ ",{}".format(imgf))
#f.close()

for page in range(1,15):
    page_url = "https://www.newegg.com/Product/ProductList.aspx?Submit=ENE&N=-1&IsNodeId=1&Description=GTX&bop=And&Page={}&PageSize=36&order=BESTMATCH".format(page)
    html = urlopen(page_url)
    bs0bj = BeautifulSoup(html, "html.parser")
    page_details = bs0bj.find_all("div", {"class":"item-container"})
    for i in page_details:
        Item_Name = i.find("a", {"class":"item-title"})
        Price = i.find("li", {"class":"price-current"})
        Image = i.find("img")
        Name_item = Item_Name.get_text()
        Prin = Price.get_text()
        imgf = Image["src"]# to get the key src 
        f.write("{}".format(Name_item)+ ",{}".format(Prin)+ ",{}".format(imgf)+ "\n")
f.close()

i am attaching the excel file too and what are the new ways to save data in csv ,can someone help me in it with codes too?

RE: Scraping with BeautifulSoup - metulburr - Sep-06-2017

it looks like there is newlines somewhere in the strings that you are writing messing up the csv file. Find the newlines and remove them before writing to the file. IF its before or after the first and last character you can use str.strip() to remove them.

RE: Scraping with BeautifulSoup - Prince_Bhatia - Sep-07-2017

Nope , no success with strip sir and unable to find the new line even, i tried everything but go no sucess, and i am not sure how to solve it Huh

RE: Scraping with BeautifulSoup - Larz60+ - Sep-07-2017

to detect any symbols:

load the html page into notepad++
select View-->Show Symbol-->Show All Characters

the EOL and other characters will be highlighted

RE: Scraping with BeautifulSoup - metulburr - Sep-07-2017

it looks like the newline is within the string, not at the beginning or the end.

Quote:

//images10.newegg.com/NeweggImage/ProductImageCompressAll300/A85V_1_20170906967475116.jpg

Refurbished: MSI GeForce GT 730 DirectX 12 N730K-2GD5LP/OC 2GB 64-Bit GDDR5 PCI Express 2.0 x16 HDCP Ready V$
deo Card

There is no comma in here so this appears to be one element.

Also i just noticed after running your program over and over, that it triggered a captcha for me causing your script to fail.

RE: Scraping with BeautifulSoup - Prince_Bhatia - Sep-07-2017

(Sep-07-2017, 11:15 AM)metulburr Wrote: it looks like the newline is within the string, not at the beginning or the end.
Quote:
//images10.newegg.com/NeweggImage/ProductImageCompressAll300/A85V_1_20170906967475116.jpg

Refurbished: MSI GeForce GT 730 DirectX 12 N730K-2GD5LP/OC 2GB 64-Bit GDDR5 PCI Express 2.0 x16 HDCP Ready V$
deo Card
There is no comma in here so this appears to be one element.

Also i just noticed after running your program over and over, that it triggered a captcha for me causing your script to fail.

i even tried to put comma's at the f.write at the bottom but result was same...what should i do? how to solve it?

RE: Scraping with BeautifulSoup - metulburr - Sep-07-2017

adding commas are not going to change it. Because the newlines in the content are being passed to your csv file creating new rows.

You could split the strings by newlines to "remove" them and then join them back together before writing to the file

>>> ''.join('text\ntest'.split())
'texttest'

or replace the newlines in the string

>>> "line 1\nline 2\n...".replace('\n', '')
'line 1line 2...'

RE: Scraping with BeautifulSoup - snippsat - Sep-07-2017

You have to be more exact to get clean data,before you loop and write.
Example here you get clean price out,
and added https: so image links will work.

from urllib.request import urlopen
from bs4 import BeautifulSoup

page_url = "https://www.newegg.com/Product/ProductList.aspx?Submit=ENE&N=-1&IsNodeId=1&Description=GTX&bop=And&Page=1&PageSize=36&order=BESTMATCH"
html = urlopen(page_url)
bs0bj = BeautifulSoup(html, "html.parser")
page_details = bs0bj.find_all("div", {"class":"item-container"})
for i in page_details:
    Item_Name = i.find("a", {"class":"item-title"})
    Price = i.find("li", {"class":"price-current"})
    Image = i.find("img")
    Name_item = Item_Name.get_text()
    imgf = Image["src"]

    # Fix
    #print(Name_item)
    print(Price.find('strong').text)
    #print('https:{}'.format(imgf))

Output:179
479
589
579
479
559
469
489
...

print image will be:

Output:https://images10.newegg.com/ProductImageCompressAll300/14-487-292-06.jpg
https://images10.newegg.com/NeweggImage/ProductImageCompressAll300/14-487-321-S99.jpg
https://images10.newegg.com/NeweggImage/ProductImageCompressAll300/14-487-319-S99.jpg
https://images10.newegg.com/NeweggImage/ProductImageCompressAll300/14-487-318-S99.jpg
.........

RE: Scraping with BeautifulSoup - Prince_Bhatia - Sep-07-2017

Alright, i got it solved guys....just replace while writing product name ",","|"

Below are the codes

from urllib.request import urlopen
from bs4 import BeautifulSoup

f = open("Scrapedetails.csv", "w")
Headers = "Item_Name, Price, Image\n"
f.write(Headers)

for page in range(1,15):
    page_url = "https://www.newegg.com/Product/ProductList.aspx?Submit=ENE&N=-1&IsNodeId=1&Description=GTX&bop=And&Page={}&PageSize=36&order=BESTMATCH".format(page)
    html = urlopen(page_url)
    bs0bj = BeautifulSoup(html, "html.parser")
    page_details = bs0bj.find_all("div", {"class":"item-container"})
    for i in page_details:
        Item_Name = i.find("a", {"class":"item-title"})
        Price = i.find("li", {"class":"price-current"}).find('strong')
        Image = i.find("img")
        Name_item = Item_Name.get_text().strip()
        prin = Price.get_text()
        imgf = Image["src"]# to get the key src 
        

        print(Name_item)
        print(prin)
        print('https:{}'.format(imgf))
        f.write("{}".format(Name_item).replace(",", "|")+ ",{}".format(prin)+ ",https:{}".format(imgf)+ "\n")
f.close()

Thank you so much everybody, everyone who helped in this code, this is an very best platform for all python lovers who all wants to be a great programmer.

I am also attaching the end result