Python Forum

Full Version: getting rid of [''] while print
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Pages: 1 2
hi guys,

i would love to get an assist with getting rid off [''] while i am scraping data - ideally i want to get organized data and i am working on it :P
[Image: 86l7Dss.png]
i am working on my first scrapping project as i want to learn it :)
Show your code. You get your results as one-element lists...
(Aug-27-2019, 03:00 PM)buran Wrote: [ -> ]Show your code. You get your results as one-element lists...

from bs4 import BeautifulSoup
import requests
import pandas as pd
import re

headers = {
    'Sec-Fetch-Mode': 'cors',
    'Referer': 'https://www.seloger.com/list.htm?types=1%2C2&projects=1&enterprise=0&furnished=1&places=%5B%7Bcp%3A75%7D%5D&qsVersion=1.0',
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.100 Safari/537.36',
}

params = (
    ('lp', '1'),
    ('p', 'ZXZlbnRUeXBlPW51bWJlckNsaWNrcyZudW1iZXI9MCZ0aW1lPTE1NjY4OTIwNzMzODEmc2l0ZUNvZGU9bXd5cHJsZWViMiZ2aXNpdG9yQ29kZT0xbWFzeXhycDVsdWcxc3g0JnZpc2l0TnVtYmVyPTImc3RhcnRPZlZpc2l0PWZhbHNlJnNjcmlwdFZlcnNpb249MjAxOTAxMTUmbm9uY2U9QTUxMzBDQkQyNjM5Nzk4Ng=='),
)

r = requests.get('https://www.seloger.com/list.htm?types=1%2C2&projects=1&enterprise=0&furnished=1&places=%5B%7Bcp%3A75%7D%5D&qsVersion=1.0', headers=headers, params=params)
content = (r.text)
soup = BeautifulSoup(content, 'html.parser')


for col in soup.find_all(class_='c-pa-list c-pa-sl c-pa-gold cartouche '):
	for line in col.get_text().split('\n'):
		step1 = str(line.strip().replace('Site web', "").replace('Exclusivité', "").replace('Voir toutes les photos', "").replace('20', "").replace('', "").split('\n'))
		step2 = step1.replace('\\xa0', " ")
		print(step2)
	break
its first steps, but ideally i would like to obtain data in columns, so i would appreciate any advices :)
If you change it to this
for col in soup.find_all(class_='c-pa-list c-pa-sl c-pa-gold cartouche '):
    for line in col.get_text().split('\n'):
        stripped = line.strip()
        if stripped:
            if stripped not in ['Site web', 'Exclusivité', 'Voir toutes les photos', '20']:
                print(stripped)
    break
you will get this
Appartement
4 p
3 ch
85 m²
3 800 €
CC
Paris 16ème
However grabbing visible text is not the ideal way to parse HTML.
woah, thanks!! :)
A better way to grab text is by actually searching the elements instead of grabbing the visible text for example

if you change from this:
for col in soup.find_all(class_='c-pa-list c-pa-sl c-pa-gold cartouche '):
    for line in col.get_text().split('\n'):
        stripped = line.strip()
        if stripped:
            if stripped not in ['Site web', 'Exclusivité', 'Voir toutes les photos', '20']:
                print(stripped)
    break
to this:
for section in soup.find_all(class_='c-pa-info'):
    print(section.find('div', {'class':'c-pa-criterion'}).text.strip())
    print(section.find('span', {'class':'c-pa-cprice'}).text.strip())
    print(section.find('div', {'class':'c-pa-city'}).text.strip())
    print('---')
You will get this
first - why you get, what you get:
on line 24 line.strip().replace('Site web', "").replace('Exclusivité', "").replace('Voir toutes les photos', "").replace('20', "").replace('', "").split('\n') will produce list with one element of type str. Then you convert it to str and try to replace \\xa0' on line 25.

However, what you do is wrong. You need to use bs4 to parse the html source. Also note that hard-coded values (e.g. 29 (the number of photo) will not work.

replace lines 22-27 with
for pa in soup.find_all('div', {'class':'c-pa-list c-pa-sl c-pa-gold cartouche'}):
    pa_info = pa.find('div', {'class':'c-pa-info'})
    pa_type = pa_info.find('a', {'class':'c-pa-link'}).text.strip()
    pa_criterion = pa.find('div', {'class':'c-pa-criterion'})
    pa_p, pa_ch, pa_sq = [em.text for em in pa_criterion.find_all('em')] 
    print(f'property: {pa_type}, people: {pa_p}, ch: {pa_ch}, sq.m: {pa_sq}')
and what you will get is
Output:
property: Appartement, people: 4 p, ch: 3 ch, sq.m: 85 m² property: Appartement, people: 2 p, ch: 1 ch, sq.m: 33 m² property: Appartement, people: 2 p, ch: 1 ch, sq.m: 30 m² property: Appartement, people: 4 p, ch: 2 ch, sq.m: 72 m² property: Appartement, people: 6 p, ch: 4 ch, sq.m: 150 m² property: Appartement, people: 1 p, ch: 33 m², sq.m: 1 asc property: Appartement, people: 5 p, ch: 3 ch, sq.m: 122 m² property: Appartement, people: 3 p, ch: 2 ch, sq.m: 106 m² property: Appartement, people: 4 p, ch: 3 ch, sq.m: 145 m² property: Appartement, people: 3 p, ch: 2 ch, sq.m: 51 m² property: Appartement, people: 3 p, ch: 2 ch, sq.m: 67 m² property: Appartement, people: 3 p, ch: 2 ch, sq.m: 79 m² property: Appartement, people: 3 p, ch: 2 ch, sq.m: 92 m² property: Appartement, people: 1 p, ch: 41 m², sq.m: 1 asc
note that there are 2 apartments for 1 p, that has slightly different output. You will need to process the result more carefully to make sure output is consistent
@metulburr was faster than me
it looks really awesome guys, thank you for your support :) i ll try to make a dataframe from this to make it into columns, in case i wouldnt figure it out i ll ask you for help :)
Firstly, i cannot edit my post (probably too old? or i am missing it, if yes then sorry)
Secondly, sorry for my newbie questions, but i am new and i want to learn python :P

In case i would like to make variables?
lets say i want to make:
price = section.find('span', {'class':'c-pa-cprice'}).text.strip()
print(price)

and it doesn't work... it returns something about tab
so i tried to add ":" on the end of the variable sentence but it also did not work

full my try is below:
for section in soup.find_all(class_='c-pa-info'):
    sbathrooms = section.find('div', {'class':'c-pa-criterion'}).text.strip()
    sprice = section.find('span', {'class':'c-pa-cprice'}).text.strip()
    cc = section.find('span', {'class':'c-pa-sprice'}).text.strip()
    sneighborhood = section.find('div', {'class':'c-pa-city'}).text.strip()
    #print('---')
	print(sbathrooms)
(Aug-28-2019, 12:24 PM)zarize Wrote: [ -> ]and it doesn't work... it returns something about tab
what does it return? Probably error that you mix tab with spaces?
Pages: 1 2