spliting html code with br tag

yokaso · (This post was last modified: Aug-04-2019, 07:04 AM by yokaso.)

Hi,i am new in python scraping and i apologize for any mistake,

i would like to get a text from html code and the target text is between , i tried the following code but it give me the whole text.
any idea ????

item_phone_type= items.find('span', class_='annonce_get_description', itemprop="description").text.split('<br>')
print(item_phone_type)

Quote:
Smartphones 
Double puces
 
Mémoire : 64 GO
 
Bluetooth Wifi 4G
 
Ecran 5.8 pouces
 
Appareil photo 12 MP
 
Bon état

***snippsat*** · Aug-04-2019, 09:36 AM

To do a test,also see that this code can be run then it easier for people to help.

from bs4 import BeautifulSoup
import re

html = '''\
<span class="annonce_get_description" itemprop="description">
Smartphones<br>
<b>Double puces</b>
<br>
Mémoire : 64 GO
<br>
Bluetooth Wifi <b>4G</b>
<br>
Ecran 5.8 pouces
<br>
Appareil photo 12 MP
<br>
Bon état
<br>
<span class="annonce_description_preview "> </span></span>'''

soup = BeautifulSoup(html, 'lxml')

Use:

>>> print(tags.text)
tags = soup.find(class_="annonce_get_description")

Smartphones
Double puces

Mémoire : 64 GO

Bluetooth Wifi 4G

Ecran 5.8 pouces

Appareil photo 12 MP

Bon état

 
>>> print(repr(tags.text.strip()))
'Smartphones\nDouble puces\n\nMémoire : 64 GO\n\nBluetooth Wifi 4G\n\nEcran 5.8 pouces\n\nAppareil photo 12 MP\n\nBon état'

With .text get all br tags,see when use repr() that if split on \n\n it should keep the structure.

>>> br_tags = tags.text.strip().split('\n\n')
>>> br_tags
['Smartphones\nDouble puces',
 'Mémoire : 64 GO',
 'Bluetooth Wifi 4G',
 'Ecran 5.8 pouces',
 'Appareil photo 12 MP',
 'Bon état']

>>> print(br_tags[0])
Smartphones
Double puces

>>> print(br_tags[2])
Bluetooth Wifi 4G

yokaso · Aug-05-2019, 07:14 AM

i tried your code but it don't give me the same result as yours.

Quote:IndexError Traceback (most recent call last)
<ipython-input-15-6f66d1a27188> in <module>()
13 print(br_tags[0])
14 print("######################################")
---> 15 print(br_tags[2])
16 print("######################################")
17 print(br_tags[4])

IndexError: list index out of range

***snippsat*** · (This post was last modified: Aug-05-2019, 09:42 AM by snippsat.)

You are getting output that my code can never give,so you are testing against more html then i do.
Here as one script and take with itemprop="description that may be need if testing code on site you use.

from bs4 import BeautifulSoup

# Simulate code on a web-site 
html = '''\
<span class="annonce_get_description" itemprop="description">
Smartphones<br>
<b>Double puces</b>
<br>
Mémoire : 64 GO
<br>
Bluetooth Wifi <b>4G</b>
<br>
Ecran 5.8 pouces
<br>
Appareil photo 12 MP
<br>
Bon état
<br>
<span class="annonce_description_preview "> </span></span>'''

soup = BeautifulSoup(html, 'lxml')
tags = soup.find(class_="annonce_get_description", itemprop="description")
br_tags = tags.text.strip().split('\n\n')
print(br_tags)
print('-' * 15)
print(br_tags[0])
print('-' * 15)
print(br_tags[2])

Just to make it clear code over is stand alone it do not need a url.

Output:E:\div_code\scrape
λ python br_tags.py
['Smartphones\nDouble puces', 'Mémoire : 64 GO', 'Bluetooth Wifi 4G', 'Ecran 5.8 pouces', 'Appareil photo 12 MP', 'Bon état']
---------------
Smartphones
Double puces
---------------
Bluetooth Wifi 4G

yokaso · Aug-06-2019, 08:01 AM

thank you,and ur right .
when i try ur code of html it work well and when i try the website it got errors. i don't know why but i want to know.
i will send you the website maybe you can enlighten me ?
https://www.ouedkniss.com/telephones
and thank you again

***snippsat*** · Aug-06-2019, 10:58 AM

Something like this,id="ann-20047560" is changes all time on this site.
A general way to split as shown,may need to adjust some to get what you want as not all advertisement text are the same.

import requests
from bs4 import BeautifulSoup

url = 'https://www.ouedkniss.com/telephones'
response = requests.get(url)
soup = BeautifulSoup(response.content, 'lxml')
id_find = soup.find('div', id="ann-20047560")
text_tag = soup.find('span', class_="annonce_get_description", itemprop="description")
print(text_tag.text)
# split
print('-' * 20)
tag_spilt = text_tag.text.split('<br/>')
lst = tag_spilt[0].split('\r\n')
print(lst)

yokaso · Aug-06-2019, 11:49 AM

but this code don't split the content ????

***snippsat*** · (This post was last modified: Aug-06-2019, 12:15 PM by snippsat.)

Output is now this,always changing.

Output:SmartphonesMémoire : 64 GO Bon état Je vends 2 iphone se produit européen :
>> capacité : 64 go
>> couleur : rose gold
>> État : 10/10 
>> fourni avec tt ses accessoires 

--------------------
['SmartphonesMémoire : 64 GO Bon état Je vends 2 iphone se produit européen :', '>> capacité : 64 go', '>> couleur : rose gold', '>> État : 10/10 ', '>> fourni avec tt ses accessoires ', '']

The lst is a list with split content.

>>> lst[0]
'SmartphonesMémoire : 64 GO Bon état Je vends 2 iphone se produit européen :'
>>> lst[1]
'>> capacité : 64 go'
>>> lst[2]
'>> couleur : rose gold'
>>> lst[3]
'>> État : 10/10 '

yokaso · (This post was last modified: Aug-06-2019, 12:55 PM by yokaso.)

import requests
from bs4 import BeautifulSoup
 
url = 'https://www.ouedkniss.com/telephones'
response = requests.get(url)
soup = BeautifulSoup(response.content, 'lxml')
id_find = soup.find('div', id="ann-20047560")
text_tag = soup.find('span', class_="annonce_get_description", itemprop="description")
print(text_tag.text)
# split
print('-' * 20)
tag_spilt = text_tag.text.split('<br/>')
lst = tag_spilt[0].split('\r\n')
print(lst)
print('-'*20)
print(lst[0])

Quote:SmartphonesBluetooth Wifi 4G Produit neuf jamais utilisé Paiement à la livraisonCaractéristiques de la série 4 (gps)
boîtier en aluminium gris space gps intégré, glonass, galileo et qzss s4 avec processeur dual-core 64 bits w3 apple puce sans fil altimètre barométrique capacité 16 go 1 capteur cardiaque optique capteur cardiaque électrique accéléromètre amélioré
--------------------
['SmartphonesBluetooth Wifi 4G Produit neuf jamais utilisé Paiement à la livraisonCaractéristiques de la série 4 (gps)', 'boîtier en aluminium gris space gps intégré, glonass, galileo et qzss s4 avec processeur dual-core 64 bits w3 apple puce sans fil altimètre barométrique capacité 16 go 1 capteur cardiaque optique capteur cardiaque électrique accéléromètre amélioré ']
--------------------
SmartphonesBluetooth Wifi 4G Produit neuf jamais utilisé Paiement à la livraisonCaractéristiques de la série 4 (gps)

i will try more again, and see

if you see the picture we can split it from the 

[Image: IWEXCIR]

***snippsat*** · Aug-06-2019, 02:35 PM

(Aug-06-2019, 12:39 PM)yokaso Wrote: if you see the picture we can split it from the

Yes that's what i do look at code again,then Python add \r\n for new line which i do split on.
Have to look at code that's get back with repr(),can not only look at web-site code.

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Populating list items to html code and create individualized html code files	ChainyDaisy	0	1,596	Sep-21-2022, 07:18 PM Last Post: ChainyDaisy
	HTML multi select HTML listbox with Flask/Python	rfeyer	0	4,649	Mar-14-2021, 12:23 PM Last Post: rfeyer
	Python3 + BeautifulSoup4 + lxml (HTML -> CSV) - How to loop to next HTML/new CSV Row	BrandonKastning	0	2,371	Mar-22-2020, 06:10 AM Last Post: BrandonKastning
	How to get the href value of a specific word in the html code	julio2000	2	3,217	Mar-05-2020, 07:50 PM Last Post: julio2000
	Embedding HTML Code in Python	kendias	5	4,285	Jan-27-2019, 01:43 AM Last Post: kendias
	Help with Python and HTML code	karlo_ds	4	3,448	Oct-16-2017, 03:03 PM Last Post: karlo_ds

spliting html code with br tag

User Panel Messages

Announcements