Scraping with BeautifulSoup - Printable Version +- Python Forum (https://python-forum.io) +-- Forum: Python Coding (https://python-forum.io/forum-7.html) +--- Forum: Web Scraping & Web Development (https://python-forum.io/forum-13.html) +--- Thread: Scraping with BeautifulSoup (/thread-4755.html) |
Scraping with BeautifulSoup - Prince_Bhatia - Sep-06-2017 hi, i am trying to scrape the website "https://www.newegg.com/Product/ProductList.aspx?Submit=ENE&N=-1&IsNodeId=1&Description=GTX&bop=And&Page=1&PageSize=36&order=BESTMATCH" what i am trying to do scrape, product name, it's price and image link i got the success a bit with one problem, name, price and image are coming in every cell, like formatting is so poor. can someone help me to ammend codes so that i can get name in name column, price in price column and image in image column. from urllib.request import urlopen from bs4 import BeautifulSoup #page_url = "https://www.newegg.com/Product/ProductList.aspx?Submit=ENE&N=-1&IsNodeId=1&Description=GTX&bop=And&Page=1&PageSize=36&order=BESTMATCH" #html = urlopen(page_url) #bs0bj = BeautifulSoup(html, "html.parser") #page_details = bs0bj.find_all("div", {"class":"item-container"}) f = open("Scrapedetails.csv", "w") Headers = "Item_Name, Price, Image\n" f.write(Headers) #for i in page_details: # Item_Name = i.find("a", {"class":"item-title"}) # Price = i.find("li", {"class":"price-current"}) # Image = i.find("img") # Name_item = Item_Name.get_text() # Prin = Price.get_text() # imgf = Image["src"]# to get the key src # f.write("{}".format(Name_item)+ ",{}".format(Prin)+ ",{}".format(imgf)) #f.close() for page in range(1,15): page_url = "https://www.newegg.com/Product/ProductList.aspx?Submit=ENE&N=-1&IsNodeId=1&Description=GTX&bop=And&Page={}&PageSize=36&order=BESTMATCH".format(page) html = urlopen(page_url) bs0bj = BeautifulSoup(html, "html.parser") page_details = bs0bj.find_all("div", {"class":"item-container"}) for i in page_details: Item_Name = i.find("a", {"class":"item-title"}) Price = i.find("li", {"class":"price-current"}) Image = i.find("img") Name_item = Item_Name.get_text() Prin = Price.get_text() imgf = Image["src"]# to get the key src f.write("{}".format(Name_item)+ ",{}".format(Prin)+ ",{}".format(imgf)+ "\n") f.close()i am attaching the excel file too and what are the new ways to save data in csv ,can someone help me in it with codes too? RE: Scraping with BeautifulSoup - metulburr - Sep-06-2017 it looks like there is newlines somewhere in the strings that you are writing messing up the csv file. Find the newlines and remove them before writing to the file. IF its before or after the first and last character you can use str.strip() to remove them. RE: Scraping with BeautifulSoup - Prince_Bhatia - Sep-07-2017 Nope , no success with strip sir and unable to find the new line even, i tried everything but go no sucess, and i am not sure how to solve it RE: Scraping with BeautifulSoup - Larz60+ - Sep-07-2017 to detect any symbols:
RE: Scraping with BeautifulSoup - metulburr - Sep-07-2017 it looks like the newline is within the string, not at the beginning or the end. Quote:There is no comma in here so this appears to be one element.//images10.newegg.com/NeweggImage/ProductImageCompressAll300/A85V_1_20170906967475116.jpg Refurbished: MSI GeForce GT 730 DirectX 12 N730K-2GD5LP/OC 2GB 64-Bit GDDR5 PCI Express 2.0 x16 HDCP Ready V$ deo Card Also i just noticed after running your program over and over, that it triggered a captcha for me causing your script to fail. RE: Scraping with BeautifulSoup - Prince_Bhatia - Sep-07-2017 (Sep-07-2017, 11:15 AM)metulburr Wrote: it looks like the newline is within the string, not at the beginning or the end. i even tried to put comma's at the f.write at the bottom but result was same...what should i do? how to solve it? RE: Scraping with BeautifulSoup - metulburr - Sep-07-2017 adding commas are not going to change it. Because the newlines in the content are being passed to your csv file creating new rows. You could split the strings by newlines to "remove" them and then join them back together before writing to the file >>> ''.join('text\ntest'.split()) 'texttest'or replace the newlines in the string >>> "line 1\nline 2\n...".replace('\n', '') 'line 1line 2...' RE: Scraping with BeautifulSoup - snippsat - Sep-07-2017 You have to be more exact to get clean data,before you loop and write. Example here you get clean price out, and added https : so image links will work.from urllib.request import urlopen from bs4 import BeautifulSoup page_url = "https://www.newegg.com/Product/ProductList.aspx?Submit=ENE&N=-1&IsNodeId=1&Description=GTX&bop=And&Page=1&PageSize=36&order=BESTMATCH" html = urlopen(page_url) bs0bj = BeautifulSoup(html, "html.parser") page_details = bs0bj.find_all("div", {"class":"item-container"}) for i in page_details: Item_Name = i.find("a", {"class":"item-title"}) Price = i.find("li", {"class":"price-current"}) Image = i.find("img") Name_item = Item_Name.get_text() imgf = Image["src"] # Fix #print(Name_item) print(Price.find('strong').text) #print('https:{}'.format(imgf)) print image will be:
RE: Scraping with BeautifulSoup - Prince_Bhatia - Sep-07-2017 Alright, i got it solved guys....just replace while writing product name ",","|" Below are the codes from urllib.request import urlopen from bs4 import BeautifulSoup f = open("Scrapedetails.csv", "w") Headers = "Item_Name, Price, Image\n" f.write(Headers) for page in range(1,15): page_url = "https://www.newegg.com/Product/ProductList.aspx?Submit=ENE&N=-1&IsNodeId=1&Description=GTX&bop=And&Page={}&PageSize=36&order=BESTMATCH".format(page) html = urlopen(page_url) bs0bj = BeautifulSoup(html, "html.parser") page_details = bs0bj.find_all("div", {"class":"item-container"}) for i in page_details: Item_Name = i.find("a", {"class":"item-title"}) Price = i.find("li", {"class":"price-current"}).find('strong') Image = i.find("img") Name_item = Item_Name.get_text().strip() prin = Price.get_text() imgf = Image["src"]# to get the key src print(Name_item) print(prin) print('https:{}'.format(imgf)) f.write("{}".format(Name_item).replace(",", "|")+ ",{}".format(prin)+ ",https:{}".format(imgf)+ "\n") f.close()Thank you so much everybody, everyone who helped in this code, this is an very best platform for all python lovers who all wants to be a great programmer. I am also attaching the end result |