Python Forum

Full Version: Learning WebScraping
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Pages: 1 2
I am very new to web scraping. I am learning web Scraping online.
website i am trying to scrape is http://econpy.pythonanywhere.com/ex/001.html
i have written a code that will scrape

from bs4 import BeautifulSoup
from urllib.request import urlopen

url = urlopen("http://econpy.pythonanywhere.com/ex/001.html")

def getTitle():
    global url
    bs0bj = BeautifulSoup(url, "html.parser")
    for i in bs0bj.find_all(title="buyer-name"):
        print(i.get_text())
getTitle()


#def getTitle():
#    global url
#    bs0bj = BeautifulSoup(url, "html.parser")
#    for i in bs0bj.find_all(title="buyer-info"):
#        print(i.get_text())
#getTitle()

def getPrice():
    global url
    bs0bj = BeautifulSoup(url, "html.parser")
    for i in bs0bj.find_all("span", {"class":"item-price"}):
        print(i.get_text())
getPrice()
now there are few questions:(please uncomment the codes)
Q1: when i run this code using buyer info it prints price,how to get the data of next pages also?
Q2: why it doesn't print price , when run individually(just buyername)?
Q3: how to write these data into CSV?
(Aug-25-2017, 11:57 AM)Prince_Bhatia Wrote: [ -> ]Q1: when i run this code using buyer info it prints price,how to get the data of next pages also?
>>> url = "http://econpy.pythonanywhere.com/ex/00{}.html"
>>> for page in range(1,4):
...     print(url.format(page))
...     
http://econpy.pythonanywhere.com/ex/001.html
http://econpy.pythonanywhere.com/ex/002.html
http://econpy.pythonanywhere.com/ex/003.html
Prince_Bhatia Wrote:Q2: why it doesn't print price , when run individually(just buyername)?
It print Price for me when i run it.
Prince_Bhatia Wrote:Q3: how to write these data into CSV?
You have to think of how to separate data.
Can show a exmple where use i zip() on name and price.
Now can item[0] and item[1],be written together in a CSV file.
from bs4 import BeautifulSoup
import requests

def name_price(url):
    soup = BeautifulSoup(url, "html.parser")
    for item in zip(soup.find_all(title="buyer-name"), soup.find_all("span", {"class":"item-price"})):
        print(item[0].text, item[1].text)

if __name__ == '__main__':
    url = 'http://econpy.pythonanywhere.com/ex/001.html'
    url = requests.get(url).content
    name_price(url)
Output:
Carson Busses $29.95 Earl E. Byrd $8.37 Patty Cakes $15.26 Derri Anne Connecticut $19.2 .............
Edit:
You see that global is gone,and that url is given as argument.
urllib is gone and use Requests.
If install lxml pip install lxml,change this line to.
soup = BeautifulSoup(url, 'lxml')
Then using lxml which is a faster parser.
Hi,

Thank you for your answer , while i used this quote it prints only links but what if i want the content inside the pages which is same as of 1st page.

How can i find print data inside these pages using this code
Quote:for page in range(1,4):
... print(url.format(page))

for this above mentioned quote.

since i am new to web scraping so i dont have much familiarity with lxml, i am going one by one python libraries
(Aug-28-2017, 06:47 AM)Prince_Bhatia Wrote: [ -> ]Hi,

Thank you for your answer , while i used this quote it prints only links but what if i want the content inside the pages which is same as of 1st page.

How can i find print data inside these pages using this code
Quote:for page in range(1,4):
... print(url.format(page))
He was showing you how to loop the pages to get links of all. You have to load each "link" with BeautifulSoup still.
hi,

i have wrote this code but it is not working ..for same website
from bs4 import BeautifulSoup
from urllib.request import urlopen

page_url = "http://econpy.pythonanywhere.com/ex/001.html"
new_file = "graphics_cards.csv"
f = open(new_file, "w")
Headers = "Header1, Header2\n"
f.write(Headers)


html = urlopen(page_url)
soup = BeautifulSoup(html, "html.parser")
buyer_info = soup.find_all("div", {"title":"buyer-info"})
for i in buyer_info:
    Header1 = soup.find_all("div", {"title":"buyer_name"})
    Header2 = soup.find_all("spam", {"class":"item-price"})
    print("Header1" + Header1)
    print("Header2"+ Header2)
    f.write(Header1 + Header2+"\n")
f.close()
       
but it is giving error, without adding any additional code, how to make it work?
Give your error next time please.

The error i get is from your attempt to concatenate a string object with a bs4.element.ResultSet (AKA list).

If you want to inject the content for printing use format method like this
    print("Header1 {}".format(Header1))
    print("Header2 {}".format(Header2))
    f.write('{} {}\n'.format(Header1, Header2))
note; you have a typo in your search criteria buyer_name
(Aug-28-2017, 12:22 PM)Prince_Bhatia Wrote: [ -> ]but it is giving error, without adding any additional code, how to make it work?
You can not use soup.find_all() in the loop.
Have to use "for i in buyer_info:" as buyer info has all info.
Have to call text before can write anything.
Never write anything before you have done test print() of the output.
Example getting name.
from bs4 import BeautifulSoup
from urllib.request import urlopen
 
page_url = "http://econpy.pythonanywhere.com/ex/001.html"
new_file = "graphics_cards.csv"
#f = open(new_file, "w")
Headers = "Header1, Header2\n"
#f.write(Headers) 
 
html = urlopen(page_url)
soup = BeautifulSoup(html, "html.parser")
buyer_info = soup.find_all("div", {"title":"buyer-info"})

# Using i and find() with .text
for i in buyer_info:
    print(i.find('div', {"title":"buyer-name"}).text)
Output:
Carson Busses Earl E. Byrd Patty Cakes Derri Anne Connecticut .........
Thank you so much for all help, but now i reached from where i started, how to write this csv or excel?

I know this is painful but since , most of the majority has questions about how to scrape multiple pages and how to write them into excel , one simple example with {quotes} can help all people who browse this forum for help.

can you please amend these codes using above libraries to write them into excel and code to scrape next pages also? You can bring the difference
Alright, i reached till here

Quote:from bs4 import BeautifulSoup
from urllib.request import urlopen

page_url = "http://econpy.pythonanywhere.com/ex/001.html"
new_file = "Mynew.csv"
f = open(new_file, "w")
Headers = "Header1, Header2\n"
f.write(Headers)


html = urlopen(page_url)
soup = BeautifulSoup(html, "html.parser")
buyer_info = soup.find_all("div", {"title":"buyer-info"})
for i in buyer_info:
Header1 = i.find("div", {"title":"buyer-name"})
Header2 = i.find("span", {"class":"item-price"})
salmon = print(Header1.get_text())
salam = print(Header2.get_text())
f.write("{}".format(salmon)+ "{}".format(salam))
f.close()
Wall
now it throws no error but prints only Header1 and Header2 and NONETYPE
Please dont mind the intended space
Pages: 1 2