Python Forum
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Learning WebScraping
#1
I am very new to web scraping. I am learning web Scraping online.
website i am trying to scrape is http://econpy.pythonanywhere.com/ex/001.html
i have written a code that will scrape

from bs4 import BeautifulSoup
from urllib.request import urlopen

url = urlopen("http://econpy.pythonanywhere.com/ex/001.html")

def getTitle():
    global url
    bs0bj = BeautifulSoup(url, "html.parser")
    for i in bs0bj.find_all(title="buyer-name"):
        print(i.get_text())
getTitle()


#def getTitle():
#    global url
#    bs0bj = BeautifulSoup(url, "html.parser")
#    for i in bs0bj.find_all(title="buyer-info"):
#        print(i.get_text())
#getTitle()

def getPrice():
    global url
    bs0bj = BeautifulSoup(url, "html.parser")
    for i in bs0bj.find_all("span", {"class":"item-price"}):
        print(i.get_text())
getPrice()
now there are few questions:(please uncomment the codes)
Q1: when i run this code using buyer info it prints price,how to get the data of next pages also?
Q2: why it doesn't print price , when run individually(just buyername)?
Q3: how to write these data into CSV?
Reply
#2
I'd suggest watching these two tutorials: https://python-forum.io/Thread-Web-Scraping-part-1
and https://python-forum.io/Thread-Web-scraping-part-2
Reply
#3
(Aug-25-2017, 11:57 AM)Prince_Bhatia Wrote: Q1: when i run this code using buyer info it prints price,how to get the data of next pages also?
>>> url = "http://econpy.pythonanywhere.com/ex/00{}.html"
>>> for page in range(1,4):
...     print(url.format(page))
...     
http://econpy.pythonanywhere.com/ex/001.html
http://econpy.pythonanywhere.com/ex/002.html
http://econpy.pythonanywhere.com/ex/003.html
Prince_Bhatia Wrote:Q2: why it doesn't print price , when run individually(just buyername)?
It print Price for me when i run it.
Prince_Bhatia Wrote:Q3: how to write these data into CSV?
You have to think of how to separate data.
Can show a exmple where use i zip() on name and price.
Now can item[0] and item[1],be written together in a CSV file.
from bs4 import BeautifulSoup
import requests

def name_price(url):
    soup = BeautifulSoup(url, "html.parser")
    for item in zip(soup.find_all(title="buyer-name"), soup.find_all("span", {"class":"item-price"})):
        print(item[0].text, item[1].text)

if __name__ == '__main__':
    url = 'http://econpy.pythonanywhere.com/ex/001.html'
    url = requests.get(url).content
    name_price(url)
Output:
Carson Busses $29.95 Earl E. Byrd $8.37 Patty Cakes $15.26 Derri Anne Connecticut $19.2 .............
Edit:
You see that global is gone,and that url is given as argument.
urllib is gone and use Requests.
If install lxml pip install lxml,change this line to.
soup = BeautifulSoup(url, 'lxml')
Then using lxml which is a faster parser.
Reply
#4
Hi,

Thank you for your answer , while i used this quote it prints only links but what if i want the content inside the pages which is same as of 1st page.

How can i find print data inside these pages using this code
Quote:for page in range(1,4):
... print(url.format(page))

for this above mentioned quote.

since i am new to web scraping so i dont have much familiarity with lxml, i am going one by one python libraries
Reply
#5
(Aug-28-2017, 06:47 AM)Prince_Bhatia Wrote: Hi,

Thank you for your answer , while i used this quote it prints only links but what if i want the content inside the pages which is same as of 1st page.

How can i find print data inside these pages using this code
Quote:for page in range(1,4):
... print(url.format(page))
He was showing you how to loop the pages to get links of all. You have to load each "link" with BeautifulSoup still.
Recommended Tutorials:
Reply
#6
hi,

i have wrote this code but it is not working ..for same website
from bs4 import BeautifulSoup
from urllib.request import urlopen

page_url = "http://econpy.pythonanywhere.com/ex/001.html"
new_file = "graphics_cards.csv"
f = open(new_file, "w")
Headers = "Header1, Header2\n"
f.write(Headers)


html = urlopen(page_url)
soup = BeautifulSoup(html, "html.parser")
buyer_info = soup.find_all("div", {"title":"buyer-info"})
for i in buyer_info:
    Header1 = soup.find_all("div", {"title":"buyer_name"})
    Header2 = soup.find_all("spam", {"class":"item-price"})
    print("Header1" + Header1)
    print("Header2"+ Header2)
    f.write(Header1 + Header2+"\n")
f.close()
       
but it is giving error, without adding any additional code, how to make it work?
Reply
#7
Give your error next time please.

The error i get is from your attempt to concatenate a string object with a bs4.element.ResultSet (AKA list).

If you want to inject the content for printing use format method like this
    print("Header1 {}".format(Header1))
    print("Header2 {}".format(Header2))
    f.write('{} {}\n'.format(Header1, Header2))
note; you have a typo in your search criteria buyer_name
Recommended Tutorials:
Reply
#8
(Aug-28-2017, 12:22 PM)Prince_Bhatia Wrote: but it is giving error, without adding any additional code, how to make it work?
You can not use soup.find_all() in the loop.
Have to use "for i in buyer_info:" as buyer info has all info.
Have to call text before can write anything.
Never write anything before you have done test print() of the output.
Example getting name.
from bs4 import BeautifulSoup
from urllib.request import urlopen
 
page_url = "http://econpy.pythonanywhere.com/ex/001.html"
new_file = "graphics_cards.csv"
#f = open(new_file, "w")
Headers = "Header1, Header2\n"
#f.write(Headers) 
 
html = urlopen(page_url)
soup = BeautifulSoup(html, "html.parser")
buyer_info = soup.find_all("div", {"title":"buyer-info"})

# Using i and find() with .text
for i in buyer_info:
    print(i.find('div', {"title":"buyer-name"}).text)
Output:
Carson Busses Earl E. Byrd Patty Cakes Derri Anne Connecticut .........
Reply
#9
Thank you so much for all help, but now i reached from where i started, how to write this csv or excel?

I know this is painful but since , most of the majority has questions about how to scrape multiple pages and how to write them into excel , one simple example with {quotes} can help all people who browse this forum for help.

can you please amend these codes using above libraries to write them into excel and code to scrape next pages also? You can bring the difference
Reply
#10
Alright, i reached till here

Quote:from bs4 import BeautifulSoup
from urllib.request import urlopen

page_url = "http://econpy.pythonanywhere.com/ex/001.html"
new_file = "Mynew.csv"
f = open(new_file, "w")
Headers = "Header1, Header2\n"
f.write(Headers)


html = urlopen(page_url)
soup = BeautifulSoup(html, "html.parser")
buyer_info = soup.find_all("div", {"title":"buyer-info"})
for i in buyer_info:
Header1 = i.find("div", {"title":"buyer-name"})
Header2 = i.find("span", {"class":"item-price"})
salmon = print(Header1.get_text())
salam = print(Header2.get_text())
f.write("{}".format(salmon)+ "{}".format(salam))
f.close()
Wall
now it throws no error but prints only Header1 and Header2 and NONETYPE
Please dont mind the intended space
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Webscraping news articles by using selenium cate16 7 3,152 Aug-28-2023, 09:58 AM
Last Post: snippsat
  Webscraping with beautifulsoup cormanstan 3 1,984 Aug-24-2023, 11:57 AM
Last Post: snippsat
  Webscraping returning empty table Buuuwq 0 1,402 Dec-09-2022, 10:41 AM
Last Post: Buuuwq
  WebScraping using Selenium library Korgik 0 1,049 Dec-09-2022, 09:51 AM
Last Post: Korgik
  How to get rid of numerical tokens in output (webscraping issue)? jps2020 0 1,956 Oct-26-2020, 05:37 PM
Last Post: jps2020
  Python Webscraping with a Login Website warriordazza 0 2,610 Jun-07-2020, 07:04 AM
Last Post: warriordazza
  Help with basic webscraping Captain_Snuggle 2 3,940 Nov-07-2019, 08:07 PM
Last Post: kozaizsvemira
  Can't Resolve Webscraping AttributeError Hass 1 2,316 Jan-15-2019, 09:36 PM
Last Post: nilamo
  How to exclude certain links while webscraping basis on keywords Prince_Bhatia 0 3,247 Oct-31-2018, 07:00 AM
Last Post: Prince_Bhatia
  Webscraping homework Ghigo1995 1 2,652 Sep-23-2018, 07:36 PM
Last Post: nilamo

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020