Posts: 77
Threads: 35
Joined: Aug 2019
Aug-27-2019, 02:53 PM
(This post was last modified: Aug-27-2019, 02:53 PM by zarize.)
hi guys,
i would love to get an assist with getting rid off [''] while i am scraping data - ideally i want to get organized data and i am working on it :P
![[Image: 86l7Dss.png]](https://i.imgur.com/86l7Dss.png)
i am working on my first scrapping project as i want to learn it :)
Posts: 8,165
Threads: 160
Joined: Sep 2016
Show your code. You get your results as one-element lists...
Posts: 77
Threads: 35
Joined: Aug 2019
Aug-27-2019, 03:03 PM
(This post was last modified: Aug-27-2019, 03:18 PM by buran.)
(Aug-27-2019, 03:00 PM)buran Wrote: Show your code. You get your results as one-element lists...
from bs4 import BeautifulSoup
import requests
import pandas as pd
import re
headers = {
'Sec-Fetch-Mode': 'cors',
'Referer': 'https://www.seloger.com/list.htm?types=1%2C2&projects=1&enterprise=0&furnished=1&places=%5B%7Bcp%3A75%7D%5D&qsVersion=1.0',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.100 Safari/537.36',
}
params = (
('lp', '1'),
('p', 'ZXZlbnRUeXBlPW51bWJlckNsaWNrcyZudW1iZXI9MCZ0aW1lPTE1NjY4OTIwNzMzODEmc2l0ZUNvZGU9bXd5cHJsZWViMiZ2aXNpdG9yQ29kZT0xbWFzeXhycDVsdWcxc3g0JnZpc2l0TnVtYmVyPTImc3RhcnRPZlZpc2l0PWZhbHNlJnNjcmlwdFZlcnNpb249MjAxOTAxMTUmbm9uY2U9QTUxMzBDQkQyNjM5Nzk4Ng=='),
)
r = requests.get('https://www.seloger.com/list.htm?types=1%2C2&projects=1&enterprise=0&furnished=1&places=%5B%7Bcp%3A75%7D%5D&qsVersion=1.0', headers=headers, params=params)
content = (r.text)
soup = BeautifulSoup(content, 'html.parser')
for col in soup.find_all(class_='c-pa-list c-pa-sl c-pa-gold cartouche '):
for line in col.get_text().split('\n'):
step1 = str(line.strip().replace('Site web', "").replace('Exclusivité', "").replace('Voir toutes les photos', "").replace('20', "").replace('', "").split('\n'))
step2 = step1.replace('\\xa0', " ")
print(step2)
break its first steps, but ideally i would like to obtain data in columns, so i would appreciate any advices :)
Posts: 5,151
Threads: 396
Joined: Sep 2016
Aug-27-2019, 03:29 PM
(This post was last modified: Aug-27-2019, 03:33 PM by metulburr.)
If you change it to this
for col in soup.find_all(class_='c-pa-list c-pa-sl c-pa-gold cartouche '):
for line in col.get_text().split('\n'):
stripped = line.strip()
if stripped:
if stripped not in ['Site web', 'Exclusivité', 'Voir toutes les photos', '20']:
print(stripped)
break you will get this
Appartement
4 p
3 ch
85 m²
3 800 €
CC
Paris 16ème However grabbing visible text is not the ideal way to parse HTML.
Recommended Tutorials:
Posts: 77
Threads: 35
Joined: Aug 2019
Posts: 5,151
Threads: 396
Joined: Sep 2016
A better way to grab text is by actually searching the elements instead of grabbing the visible text for example
if you change from this:
for col in soup.find_all(class_='c-pa-list c-pa-sl c-pa-gold cartouche '):
for line in col.get_text().split('\n'):
stripped = line.strip()
if stripped:
if stripped not in ['Site web', 'Exclusivité', 'Voir toutes les photos', '20']:
print(stripped)
break to this:
for section in soup.find_all(class_='c-pa-info'):
print(section.find('div', {'class':'c-pa-criterion'}).text.strip())
print(section.find('span', {'class':'c-pa-cprice'}).text.strip())
print(section.find('div', {'class':'c-pa-city'}).text.strip())
print('---') You will get this
Output: 4 p
3 ch
85 m²
3 800 €
Paris 16ème
---
2 p
1 ch
33 m²
1 450 €
Paris 5ème
---
2 p
1 ch
30 m²
1 290 €
Paris 4ème
---
5 p
3 ch
195 m²
15 000 €
Paris 8ème
---
4 p
2 ch
72 m²
3 500 €
Paris 16ème
---
6 p
4 ch
150 m²
5 000 €
Paris 16ème
---
3 p
2 ch
105 m²
3 400 €
Paris 7ème
---
1 p
33 m²
1 asc
1 820 €
Paris 8ème
---
5 p
3 ch
122 m²
4 700 €
Paris 17ème
---
3 p
2 ch
106 m²
3 570 €
Paris 1er
---
4 p
2 ch
156 m²
5 000 €
Paris 16ème
---
4 p
3 ch
145 m²
7 000 €
Paris 7ème
---
3 p
2 ch
51 m²
2 800 €
Paris 1er
---
5 p
3 ch
224 m²
9 000 €
Paris 16ème
---
6 p
3 ch
165 m²
6 700 €
Paris 16ème
---
5 p
3 ch
133 m²
4 200 €
Paris 16ème
---
3 p
2 ch
67 m²
2 990 €
Paris 6ème
---
3 p
2 ch
79 m²
2 737 €
Paris 16ème
---
3 p
2 ch
92 m²
5 200 €
Paris 6ème
---
1 p
41 m²
1 asc
1 500 €
Paris 9ème
---
Recommended Tutorials:
Posts: 8,165
Threads: 160
Joined: Sep 2016
first - why you get, what you get:
on line 24 line.strip().replace('Site web', "").replace('Exclusivité', "").replace('Voir toutes les photos', "").replace('20', "").replace('', "").split('\n') will produce list with one element of type str. Then you convert it to str and try to replace \\xa0' on line 25.
However, what you do is wrong. You need to use bs4 to parse the html source. Also note that hard-coded values (e.g. 29 (the number of photo) will not work.
replace lines 22-27 with
for pa in soup.find_all('div', {'class':'c-pa-list c-pa-sl c-pa-gold cartouche'}):
pa_info = pa.find('div', {'class':'c-pa-info'})
pa_type = pa_info.find('a', {'class':'c-pa-link'}).text.strip()
pa_criterion = pa.find('div', {'class':'c-pa-criterion'})
pa_p, pa_ch, pa_sq = [em.text for em in pa_criterion.find_all('em')]
print(f'property: {pa_type}, people: {pa_p}, ch: {pa_ch}, sq.m: {pa_sq}') and what you will get is
Output: property: Appartement, people: 4 p, ch: 3 ch, sq.m: 85 m²
property: Appartement, people: 2 p, ch: 1 ch, sq.m: 33 m²
property: Appartement, people: 2 p, ch: 1 ch, sq.m: 30 m²
property: Appartement, people: 4 p, ch: 2 ch, sq.m: 72 m²
property: Appartement, people: 6 p, ch: 4 ch, sq.m: 150 m²
property: Appartement, people: 1 p, ch: 33 m², sq.m: 1 asc
property: Appartement, people: 5 p, ch: 3 ch, sq.m: 122 m²
property: Appartement, people: 3 p, ch: 2 ch, sq.m: 106 m²
property: Appartement, people: 4 p, ch: 3 ch, sq.m: 145 m²
property: Appartement, people: 3 p, ch: 2 ch, sq.m: 51 m²
property: Appartement, people: 3 p, ch: 2 ch, sq.m: 67 m²
property: Appartement, people: 3 p, ch: 2 ch, sq.m: 79 m²
property: Appartement, people: 3 p, ch: 2 ch, sq.m: 92 m²
property: Appartement, people: 1 p, ch: 41 m², sq.m: 1 asc
note that there are 2 apartments for 1 p, that has slightly different output. You will need to process the result more carefully to make sure output is consistent
@ metulburr was faster than me
Posts: 77
Threads: 35
Joined: Aug 2019
it looks really awesome guys, thank you for your support :) i ll try to make a dataframe from this to make it into columns, in case i wouldnt figure it out i ll ask you for help :)
Posts: 77
Threads: 35
Joined: Aug 2019
Aug-28-2019, 12:24 PM
(This post was last modified: Aug-28-2019, 12:25 PM by zarize.)
Firstly, i cannot edit my post (probably too old? or i am missing it, if yes then sorry)
Secondly, sorry for my newbie questions, but i am new and i want to learn python :P
In case i would like to make variables?
lets say i want to make:
price = section.find('span', {'class':'c-pa-cprice'}).text.strip()
print(price)
and it doesn't work... it returns something about tab
so i tried to add ":" on the end of the variable sentence but it also did not work
full my try is below:
for section in soup.find_all(class_='c-pa-info'):
sbathrooms = section.find('div', {'class':'c-pa-criterion'}).text.strip()
sprice = section.find('span', {'class':'c-pa-cprice'}).text.strip()
cc = section.find('span', {'class':'c-pa-sprice'}).text.strip()
sneighborhood = section.find('div', {'class':'c-pa-city'}).text.strip()
#print('---')
print(sbathrooms)
Posts: 8,165
Threads: 160
Joined: Sep 2016
(Aug-28-2019, 12:24 PM)zarize Wrote: and it doesn't work... it returns something about tab what does it return? Probably error that you mix tab with spaces?
|