Mar-30-2019, 11:28 AM
Hello people, I just finished my first web scraping project and I'd like to have an opinion about the code: what I did right, what wrong and what can be improved. (The code works as I expected) I choose https://www.immobiliare.it/affitto-case/bologna/ to get data about housing prices in a random city.
The aim was to build a pandas dataframe with the basic info about a certain house: price, rooms, surface, bathrooms and floor. Those info are standard to every house, so they were a good candidate.
I wrote four functions:
1) connect: it takes a webpage, requests.get and return a BeautifulSoup object
2) get_pages: it connects to the main page, looks for the last page number and returns a list with the addresses (for example in the city of Bologna there are 675 houses, 28 pages, each page with 25 houses)
3) create_df: the scraping was tricky because of the structure of html and the fact that many houses have missing data. The best I could come up with was getting a list for every house sort of: [€ 700, 2, locali, 40, m, 2, 1, bagni, 4, piano] (€700, 2 rooms, 40 squared meters, 1 bathroom, 4th floor) and write an if-else chain to retrieve the data I needed. It takes a house at a time and it creates the columns of the dataframe returning the dataframe
4) collect: it goes through every page creating and appending the page's dataframe every iteration.
I'm quite happy with the result, I wanted it to be as reusable as possible within the website and it works fine with other cities, non-residential estates and with missing data.
Thank you all in advance.
The code:
The aim was to build a pandas dataframe with the basic info about a certain house: price, rooms, surface, bathrooms and floor. Those info are standard to every house, so they were a good candidate.
I wrote four functions:
1) connect: it takes a webpage, requests.get and return a BeautifulSoup object
2) get_pages: it connects to the main page, looks for the last page number and returns a list with the addresses (for example in the city of Bologna there are 675 houses, 28 pages, each page with 25 houses)
3) create_df: the scraping was tricky because of the structure of html and the fact that many houses have missing data. The best I could come up with was getting a list for every house sort of: [€ 700, 2, locali, 40, m, 2, 1, bagni, 4, piano] (€700, 2 rooms, 40 squared meters, 1 bathroom, 4th floor) and write an if-else chain to retrieve the data I needed. It takes a house at a time and it creates the columns of the dataframe returning the dataframe
4) collect: it goes through every page creating and appending the page's dataframe every iteration.
I'm quite happy with the result, I wanted it to be as reusable as possible within the website and it works fine with other cities, non-residential estates and with missing data.
Thank you all in advance.
The code:
import requests from bs4 import BeautifulSoup import pandas as pd website = "https://www.immobiliare.it/affitto-case/bologna" def connect(web_addr): resp = requests.get(web_addr) return BeautifulSoup(resp.content, "html.parser") def get_pages(main): soup = connect(main) max = soup.find_all("span", class_="pagination__label") last_page = int(max[-1].contents[0]) pages = [main] for n in range(2,last_page): page_num = "/?pag={}".format(n) pages.append(main + page_num) return pages def create_df(offers): price = [] rooms = [] surface = [] bathrooms = [] floor = [] for offer in offers: l = list(offer.stripped_strings) if "€" in l[0]: stripped = l[0].replace("€ ", "").replace(".","") price.append(stripped) else: price.append(None) if "locali" in l: r = l.index("locali")-1 rooms.append(l[r]) else: rooms.append(None) if "m" in l: s = l.index("m")-1 surface.append(l[s]) else: surface.append(None) if "bagni" in l: b = l.index("bagni")-1 bathrooms.append(l[b]) else: bathrooms.append(None) if "piano" in l: fl = l.index("piano")-1 floor.append(l[fl]) else: floor.append(None) return pd.DataFrame.from_dict({"Price": price, "Rooms": rooms, "Surface": surface, "Bathrooms": bathrooms, "Floor": floor}) def collect(): pages = get_pages(website) df = pd.DataFrame(columns=["Price", "Rooms", "Surface", "Bathrooms", "Floor"]) for page in pages: soup = connect(page) offers = soup.find_all("ul", class_="listing-features list-piped") data = create_df(offers) df = df.append(data, ignore_index=True) return df data = collect() print(data)