getting rid of [''] while print

zarize · (This post was last modified: Aug-27-2019, 02:53 PM by zarize.)

hi guys,

i would love to get an assist with getting rid off [''] while i am scraping data - ideally i want to get organized data and i am working on it :P
[Image: 86l7Dss.png]

i am working on my first scrapping project as i want to learn it :)

**buran** · Aug-27-2019, 03:00 PM

Show your code. You get your results as one-element lists...

zarize · (This post was last modified: Aug-27-2019, 03:18 PM by buran.)

(Aug-27-2019, 03:00 PM)buran Wrote: Show your code. You get your results as one-element lists...

from bs4 import BeautifulSoup
import requests
import pandas as pd
import re

headers = {
    'Sec-Fetch-Mode': 'cors',
    'Referer': 'https://www.seloger.com/list.htm?types=1%2C2&projects=1&enterprise=0&furnished=1&places=%5B%7Bcp%3A75%7D%5D&qsVersion=1.0',
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.100 Safari/537.36',
}

params = (
    ('lp', '1'),
    ('p', 'ZXZlbnRUeXBlPW51bWJlckNsaWNrcyZudW1iZXI9MCZ0aW1lPTE1NjY4OTIwNzMzODEmc2l0ZUNvZGU9bXd5cHJsZWViMiZ2aXNpdG9yQ29kZT0xbWFzeXhycDVsdWcxc3g0JnZpc2l0TnVtYmVyPTImc3RhcnRPZlZpc2l0PWZhbHNlJnNjcmlwdFZlcnNpb249MjAxOTAxMTUmbm9uY2U9QTUxMzBDQkQyNjM5Nzk4Ng=='),
)

r = requests.get('https://www.seloger.com/list.htm?types=1%2C2&projects=1&enterprise=0&furnished=1&places=%5B%7Bcp%3A75%7D%5D&qsVersion=1.0', headers=headers, params=params)
content = (r.text)
soup = BeautifulSoup(content, 'html.parser')


for col in soup.find_all(class_='c-pa-list c-pa-sl c-pa-gold cartouche '):
	for line in col.get_text().split('\n'):
		step1 = str(line.strip().replace('Site web', "").replace('Exclusivité', "").replace('Voir toutes les photos', "").replace('20', "").replace('', "").split('\n'))
		step2 = step1.replace('\\xa0', " ")
		print(step2)
	break

its first steps, but ideally i would like to obtain data in columns, so i would appreciate any advices :)

***metulburr*** · (This post was last modified: Aug-27-2019, 03:33 PM by metulburr.)

If you change it to this

for col in soup.find_all(class_='c-pa-list c-pa-sl c-pa-gold cartouche '):
    for line in col.get_text().split('\n'):
        stripped = line.strip()
        if stripped:
            if stripped not in ['Site web', 'Exclusivité', 'Voir toutes les photos', '20']:
                print(stripped)
    break

you will get this

Appartement
4 p
3 ch
85 m²
3 800 €
CC
Paris 16ème

However grabbing visible text is not the ideal way to parse HTML.

zarize · Aug-27-2019, 03:30 PM

woah, thanks!! :)

***metulburr*** · Aug-27-2019, 03:42 PM

A better way to grab text is by actually searching the elements instead of grabbing the visible text for example

if you change from this:

for col in soup.find_all(class_='c-pa-list c-pa-sl c-pa-gold cartouche '):
    for line in col.get_text().split('\n'):
        stripped = line.strip()
        if stripped:
            if stripped not in ['Site web', 'Exclusivité', 'Voir toutes les photos', '20']:
                print(stripped)
    break

to this:

for section in soup.find_all(class_='c-pa-info'):
    print(section.find('div', {'class':'c-pa-criterion'}).text.strip())
    print(section.find('span', {'class':'c-pa-cprice'}).text.strip())
    print(section.find('div', {'class':'c-pa-city'}).text.strip())
    print('---')

You will get this

Hide/Show

Output:4 p
3 ch
85 m²
3 800 €
Paris 16ème
---
2 p
1 ch
33 m²
1 450 €
Paris 5ème
---
2 p
1 ch
30 m²
1 290 €
Paris 4ème
---
5 p
3 ch
195 m²
15 000 €
Paris 8ème
---
4 p
2 ch
72 m²
3 500 €
Paris 16ème
---
6 p
4 ch
150 m²
5 000 €
Paris 16ème
---
3 p
2 ch
105 m²
3 400 €
Paris 7ème
---
1 p
33 m²
1 asc
1 820 €
Paris 8ème
---
5 p
3 ch
122 m²
4 700 €
Paris 17ème
---
3 p
2 ch
106 m²
3 570 €
Paris 1er
---
4 p
2 ch
156 m²
5 000 €
Paris 16ème
---
4 p
3 ch
145 m²
7 000 €
Paris 7ème
---
3 p
2 ch
51 m²
2 800 €
Paris 1er
---
5 p
3 ch
224 m²
9 000 €
Paris 16ème
---
6 p
3 ch
165 m²
6 700 €
Paris 16ème
---
5 p
3 ch
133 m²
4 200 €
Paris 16ème
---
3 p
2 ch
67 m²
2 990 €
Paris 6ème
---
3 p
2 ch
79 m²
2 737 €
Paris 16ème
---
3 p
2 ch
92 m²
5 200 €
Paris 6ème
---
1 p
41 m²
1 asc
1 500 €
Paris 9ème
---

**buran** · Aug-27-2019, 03:56 PM

first - why you get, what you get:
on line 24

line.strip().replace('Site web', "").replace('Exclusivité', "").replace('Voir toutes les photos', "").replace('20', "").replace('', "").split('\n')

will produce list with one element of type str. Then you convert it to str and try to replace \\xa0' on line 25.

However, what you do is wrong. You need to use bs4 to parse the html source. Also note that hard-coded values (e.g. 29 (the number of photo) will not work.

replace lines 22-27 with

for pa in soup.find_all('div', {'class':'c-pa-list c-pa-sl c-pa-gold cartouche'}):
    pa_info = pa.find('div', {'class':'c-pa-info'})
    pa_type = pa_info.find('a', {'class':'c-pa-link'}).text.strip()
    pa_criterion = pa.find('div', {'class':'c-pa-criterion'})
    pa_p, pa_ch, pa_sq = [em.text for em in pa_criterion.find_all('em')] 
    print(f'property: {pa_type}, people: {pa_p}, ch: {pa_ch}, sq.m: {pa_sq}')

and what you will get is

Output:property: Appartement, people: 4 p, ch: 3 ch, sq.m: 85 m²
property: Appartement, people: 2 p, ch: 1 ch, sq.m: 33 m²
property: Appartement, people: 2 p, ch: 1 ch, sq.m: 30 m²
property: Appartement, people: 4 p, ch: 2 ch, sq.m: 72 m²
property: Appartement, people: 6 p, ch: 4 ch, sq.m: 150 m²
property: Appartement, people: 1 p, ch: 33 m², sq.m: 1 asc
property: Appartement, people: 5 p, ch: 3 ch, sq.m: 122 m²
property: Appartement, people: 3 p, ch: 2 ch, sq.m: 106 m²
property: Appartement, people: 4 p, ch: 3 ch, sq.m: 145 m²
property: Appartement, people: 3 p, ch: 2 ch, sq.m: 51 m²
property: Appartement, people: 3 p, ch: 2 ch, sq.m: 67 m²
property: Appartement, people: 3 p, ch: 2 ch, sq.m: 79 m²
property: Appartement, people: 3 p, ch: 2 ch, sq.m: 92 m²
property: Appartement, people: 1 p, ch: 41 m², sq.m: 1 asc

note that there are 2 apartments for 1 p, that has slightly different output. You will need to process the result more carefully to make sure output is consistent

@metulburr was faster than me

zarize · Aug-28-2019, 10:00 AM

it looks really awesome guys, thank you for your support :) i ll try to make a dataframe from this to make it into columns, in case i wouldnt figure it out i ll ask you for help :)

zarize · (This post was last modified: Aug-28-2019, 12:25 PM by zarize.)

Firstly, i cannot edit my post (probably too old? or i am missing it, if yes then sorry)
Secondly, sorry for my newbie questions, but i am new and i want to learn python :P

In case i would like to make variables?
lets say i want to make:
price = section.find('span', {'class':'c-pa-cprice'}).text.strip()
print(price)

and it doesn't work... it returns something about tab
so i tried to add ":" on the end of the variable sentence but it also did not work

full my try is below:

for section in soup.find_all(class_='c-pa-info'):
    sbathrooms = section.find('div', {'class':'c-pa-criterion'}).text.strip()
    sprice = section.find('span', {'class':'c-pa-cprice'}).text.strip()
    cc = section.find('span', {'class':'c-pa-sprice'}).text.strip()
    sneighborhood = section.find('div', {'class':'c-pa-city'}).text.strip()
    #print('---')
	print(sbathrooms)

**buran** · Aug-28-2019, 12:31 PM

(Aug-28-2019, 12:24 PM)zarize Wrote: and it doesn't work... it returns something about tab

what does it return? Probably error that you mix tab with spaces?

getting rid of [''] while print

User Panel Messages

Announcements