Python Forum
First web scraping project using bs4, review my code
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
First web scraping project using bs4, review my code
#1
Hello people, I just finished my first web scraping project and I'd like to have an opinion about the code: what I did right, what wrong and what can be improved. (The code works as I expected) I choose https://www.immobiliare.it/affitto-case/bologna/ to get data about housing prices in a random city.

The aim was to build a pandas dataframe with the basic info about a certain house: price, rooms, surface, bathrooms and floor. Those info are standard to every house, so they were a good candidate.

I wrote four functions:

1) connect: it takes a webpage, requests.get and return a BeautifulSoup object

2) get_pages: it connects to the main page, looks for the last page number and returns a list with the addresses (for example in the city of Bologna there are 675 houses, 28 pages, each page with 25 houses)

3) create_df: the scraping was tricky because of the structure of html and the fact that many houses have missing data. The best I could come up with was getting a list for every house sort of: [€ 700, 2, locali, 40, m, 2, 1, bagni, 4, piano] (€700, 2 rooms, 40 squared meters, 1 bathroom, 4th floor) and write an if-else chain to retrieve the data I needed. It takes a house at a time and it creates the columns of the dataframe returning the dataframe

4) collect: it goes through every page creating and appending the page's dataframe every iteration.

I'm quite happy with the result, I wanted it to be as reusable as possible within the website and it works fine with other cities, non-residential estates and with missing data.

Thank you all in advance.

The code:

   import requests
   from bs4 import BeautifulSoup
   import pandas as pd
   
   website = "https://www.immobiliare.it/affitto-case/bologna"
   
   def connect(web_addr):
   	resp = requests.get(web_addr)
   	return BeautifulSoup(resp.content, "html.parser")
   	
   def get_pages(main):
   	soup = connect(main)
   	max = soup.find_all("span", class_="pagination__label")
   	last_page = int(max[-1].contents[0])
   	pages = [main]
   	
   	for n in range(2,last_page):	
   		page_num = "/?pag={}".format(n)
   		pages.append(main + page_num)
   		
   	return pages
   
   def create_df(offers):
   	price = []
   	rooms = []
   	surface = []
   	bathrooms = []
   	floor = []
   	
   	for offer in offers:
   		l = list(offer.stripped_strings)
   		
   		if "€" in l[0]:
   			stripped = l[0].replace("€ ", "").replace(".","")
   			price.append(stripped)
   		else:
   			price.append(None)
   			
   		if "locali" in l:
   			r = l.index("locali")-1
   			rooms.append(l[r])
   		else:
   			rooms.append(None)
   			
   		if "m" in l:
   			s = l.index("m")-1
   			surface.append(l[s])
   		else:
   			surface.append(None)
   			
   		if "bagni" in l:
   			b = l.index("bagni")-1
   			bathrooms.append(l[b])
   		else:
   			bathrooms.append(None)
   			
   		if "piano" in l:
   			fl = l.index("piano")-1
   			floor.append(l[fl])
   		else:
   			floor.append(None)
   			
   	return pd.DataFrame.from_dict({"Price": price, "Rooms": rooms, "Surface": surface, "Bathrooms": bathrooms, "Floor": floor})
   	
   def collect():
   	pages = get_pages(website)
   	df = pd.DataFrame(columns=["Price", "Rooms", "Surface", "Bathrooms", "Floor"])
   	
   	for page in pages:
   		soup = connect(page)
   		offers = soup.find_all("ul", class_="listing-features list-piped")
   		data = create_df(offers)
   		df = df.append(data, ignore_index=True)
   				
   		
   	return df	
   		
   data = collect()
   
   print(data)
Reply
#2
You should add error checking:
for requests:
def connect(web_addr):
    resp = requests.get(web_addr)
    if resp.status_code = 200:
        return BeautifulSoup(resp.content, "html.parser")
    else:
        print(f'Encountered status_code: {resp.status_code} in attempt to connect to: {web_addr}')
        return None
Then add get_pages
change (after what is line 12 in your listing) to:
    if soup:
plus indentation.
Reply
#3
Thanks!
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Code Review Help dmcquay 4 3,630 Jan-05-2018, 11:20 PM
Last Post: dmcquay
  Code review needed for a basic repl implementation RickyWilson 2 2,868 Dec-27-2017, 02:20 PM
Last Post: mpd

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020