First web scraping project using bs4, review my code

OhNoSegFaultAgain · Mar-30-2019, 11:28 AM

Hello people, I just finished my first web scraping project and I'd like to have an opinion about the code: what I did right, what wrong and what can be improved. (The code works as I expected) I choose https://www.immobiliare.it/affitto-case/bologna/ to get data about housing prices in a random city.

The aim was to build a pandas dataframe with the basic info about a certain house: price, rooms, surface, bathrooms and floor. Those info are standard to every house, so they were a good candidate.

I wrote four functions:

1) connect: it takes a webpage, requests.get and return a BeautifulSoup object

2) get_pages: it connects to the main page, looks for the last page number and returns a list with the addresses (for example in the city of Bologna there are 675 houses, 28 pages, each page with 25 houses)

3) create_df: the scraping was tricky because of the structure of html and the fact that many houses have missing data. The best I could come up with was getting a list for every house sort of: [€ 700, 2, locali, 40, m, 2, 1, bagni, 4, piano] (€700, 2 rooms, 40 squared meters, 1 bathroom, 4th floor) and write an if-else chain to retrieve the data I needed. It takes a house at a time and it creates the columns of the dataframe returning the dataframe

4) collect: it goes through every page creating and appending the page's dataframe every iteration.

I'm quite happy with the result, I wanted it to be as reusable as possible within the website and it works fine with other cities, non-residential estates and with missing data.

Thank you all in advance.

The code:

   import requests
   from bs4 import BeautifulSoup
   import pandas as pd
   
   website = "https://www.immobiliare.it/affitto-case/bologna"
   
   def connect(web_addr):
   	resp = requests.get(web_addr)
   	return BeautifulSoup(resp.content, "html.parser")
   	
   def get_pages(main):
   	soup = connect(main)
   	max = soup.find_all("span", class_="pagination__label")
   	last_page = int(max[-1].contents[0])
   	pages = [main]
   	
   	for n in range(2,last_page):	
   		page_num = "/?pag={}".format(n)
   		pages.append(main + page_num)
   		
   	return pages
   
   def create_df(offers):
   	price = []
   	rooms = []
   	surface = []
   	bathrooms = []
   	floor = []
   	
   	for offer in offers:
   		l = list(offer.stripped_strings)
   		
   		if "€" in l[0]:
   			stripped = l[0].replace("€ ", "").replace(".","")
   			price.append(stripped)
   		else:
   			price.append(None)
   			
   		if "locali" in l:
   			r = l.index("locali")-1
   			rooms.append(l[r])
   		else:
   			rooms.append(None)
   			
   		if "m" in l:
   			s = l.index("m")-1
   			surface.append(l[s])
   		else:
   			surface.append(None)
   			
   		if "bagni" in l:
   			b = l.index("bagni")-1
   			bathrooms.append(l[b])
   		else:
   			bathrooms.append(None)
   			
   		if "piano" in l:
   			fl = l.index("piano")-1
   			floor.append(l[fl])
   		else:
   			floor.append(None)
   			
   	return pd.DataFrame.from_dict({"Price": price, "Rooms": rooms, "Surface": surface, "Bathrooms": bathrooms, "Floor": floor})
   	
   def collect():
   	pages = get_pages(website)
   	df = pd.DataFrame(columns=["Price", "Rooms", "Surface", "Bathrooms", "Floor"])
   	
   	for page in pages:
   		soup = connect(page)
   		offers = soup.find_all("ul", class_="listing-features list-piped")
   		data = create_df(offers)
   		df = df.append(data, ignore_index=True)
   				
   		
   	return df	
   		
   data = collect()
   
   print(data)

**Larz60+** · Mar-30-2019, 02:33 PM

You should add error checking:
for requests:

def connect(web_addr):
    resp = requests.get(web_addr)
    if resp.status_code = 200:
        return BeautifulSoup(resp.content, "html.parser")
    else:
        print(f'Encountered status_code: {resp.status_code} in attempt to connect to: {web_addr}')
        return None

Then add get_pages
change (after what is line 12 in your listing) to:

    if soup:

plus indentation.

OhNoSegFaultAgain · Mar-30-2019, 04:45 PM

Thanks!

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Code Review Help	dmcquay	4	3,634	Jan-05-2018, 11:20 PM Last Post: dmcquay
	Code review needed for a basic repl implementation	RickyWilson	2	2,873	Dec-27-2017, 02:20 PM Last Post: mpd

First web scraping project using bs4, review my code

User Panel Messages

Announcements