Python Forum
spliting html code with br tag - Printable Version

+- Python Forum (https://python-forum.io)
+-- Forum: Python Coding (https://python-forum.io/forum-7.html)
+--- Forum: Web Scraping & Web Development (https://python-forum.io/forum-13.html)
+--- Thread: spliting html code with br tag (/thread-20301.html)

Pages: 1 2


spliting html code with br tag - yokaso - Aug-04-2019

Hi,i am new in python scraping and i apologize for any mistake,

i would like to get a text from html code and the target text is between <br>, i tried the following code but it give me the whole text.
any idea ????


item_phone_type= items.find('span', class_='annonce_get_description', itemprop="description").text.split('<br>')
print(item_phone_type)
Quote:<span class="annonce_get_description" itemprop="description">
Smartphones<br>
<b>Double puces</b>
<br>
Mémoire : 64 GO
<br>
Bluetooth Wifi <b>4G</b>
<br>
Ecran 5.8 pouces
<br>
Appareil photo 12 MP
<br>
Bon état
<br>
<span class="annonce_description_preview "> </span></span>



RE: spliting html code with br tag - snippsat - Aug-04-2019

To do a test,also see that this code can be run then it easier for people to help.
from bs4 import BeautifulSoup
import re

html = '''\
<span class="annonce_get_description" itemprop="description">
Smartphones<br>
<b>Double puces</b>
<br>
Mémoire : 64 GO
<br>
Bluetooth Wifi <b>4G</b>
<br>
Ecran 5.8 pouces
<br>
Appareil photo 12 MP
<br>
Bon état
<br>
<span class="annonce_description_preview "> </span></span>'''

soup = BeautifulSoup(html, 'lxml')
Use:
>>> print(tags.text)
tags = soup.find(class_="annonce_get_description")

Smartphones
Double puces

Mémoire : 64 GO

Bluetooth Wifi 4G

Ecran 5.8 pouces

Appareil photo 12 MP

Bon état

 
>>> print(repr(tags.text.strip()))
'Smartphones\nDouble puces\n\nMémoire : 64 GO\n\nBluetooth Wifi 4G\n\nEcran 5.8 pouces\n\nAppareil photo 12 MP\n\nBon état'
With .text get all br tags,see when use repr() that if split on \n\n it should keep the structure.
>>> br_tags = tags.text.strip().split('\n\n')
>>> br_tags
['Smartphones\nDouble puces',
 'Mémoire : 64 GO',
 'Bluetooth Wifi 4G',
 'Ecran 5.8 pouces',
 'Appareil photo 12 MP',
 'Bon état']

>>> print(br_tags[0])
Smartphones
Double puces

>>> print(br_tags[2])
Bluetooth Wifi 4G



RE: spliting html code with br tag - yokaso - Aug-05-2019

i tried your code but it don't give me the same result as yours.
Quote:IndexError Traceback (most recent call last)
<ipython-input-15-6f66d1a27188> in <module>()
13 print(br_tags[0])
14 print("######################################")
---> 15 print(br_tags[2])
16 print("######################################")
17 print(br_tags[4])

IndexError: list index out of range



RE: spliting html code with br tag - snippsat - Aug-05-2019

You are getting output that my code can never give,so you are testing against more html then i do.
Here as one script and take with itemprop="description that may be need if testing code on site you use.
from bs4 import BeautifulSoup

# Simulate code on a web-site 
html = '''\
<span class="annonce_get_description" itemprop="description">
Smartphones<br>
<b>Double puces</b>
<br>
Mémoire : 64 GO
<br>
Bluetooth Wifi <b>4G</b>
<br>
Ecran 5.8 pouces
<br>
Appareil photo 12 MP
<br>
Bon état
<br>
<span class="annonce_description_preview "> </span></span>'''

soup = BeautifulSoup(html, 'lxml')
tags = soup.find(class_="annonce_get_description", itemprop="description")
br_tags = tags.text.strip().split('\n\n')
print(br_tags)
print('-' * 15)
print(br_tags[0])
print('-' * 15)
print(br_tags[2])
Just to make it clear code over is stand alone it do not need a url.
Output:
E:\div_code\scrape λ python br_tags.py ['Smartphones\nDouble puces', 'Mémoire : 64 GO', 'Bluetooth Wifi 4G', 'Ecran 5.8 pouces', 'Appareil photo 12 MP', 'Bon état'] --------------- Smartphones Double puces --------------- Bluetooth Wifi 4G



RE: spliting html code with br tag - yokaso - Aug-06-2019

thank you,and ur right .
when i try ur code of html it work well and when i try the website it got errors. i don't know why but i want to know.
i will send you the website maybe you can enlighten me ?
https://www.ouedkniss.com/telephones
and thank you again


RE: spliting html code with br tag - snippsat - Aug-06-2019

Something like this,id="ann-20047560" is changes all time on this site.
A general way to split as shown,may need to adjust some to get what you want as not all advertisement text are the same.
import requests
from bs4 import BeautifulSoup

url = 'https://www.ouedkniss.com/telephones'
response = requests.get(url)
soup = BeautifulSoup(response.content, 'lxml')
id_find = soup.find('div', id="ann-20047560")
text_tag = soup.find('span', class_="annonce_get_description", itemprop="description")
print(text_tag.text)
# split
print('-' * 20)
tag_spilt = text_tag.text.split('<br/>')
lst = tag_spilt[0].split('\r\n')
print(lst)



RE: spliting html code with br tag - yokaso - Aug-06-2019

but this code don't split the content ????


RE: spliting html code with br tag - snippsat - Aug-06-2019

Output is now this,always changing.
Output:
SmartphonesMémoire : 64 GO Bon état Je vends 2 iphone se produit européen : >> capacité : 64 go >> couleur : rose gold >> État : 10/10 >> fourni avec tt ses accessoires -------------------- ['SmartphonesMémoire : 64 GO Bon état Je vends 2 iphone se produit européen :', '>> capacité : 64 go', '>> couleur : rose gold', '>> État : 10/10 ', '>> fourni avec tt ses accessoires ', '']
The lst is a list with split content.
>>> lst[0]
'SmartphonesMémoire : 64 GO Bon état Je vends 2 iphone se produit européen :'
>>> lst[1]
'>> capacité : 64 go'
>>> lst[2]
'>> couleur : rose gold'
>>> lst[3]
'>> État : 10/10 '



RE: spliting html code with br tag - yokaso - Aug-06-2019

import requests
from bs4 import BeautifulSoup
 
url = 'https://www.ouedkniss.com/telephones'
response = requests.get(url)
soup = BeautifulSoup(response.content, 'lxml')
id_find = soup.find('div', id="ann-20047560")
text_tag = soup.find('span', class_="annonce_get_description", itemprop="description")
print(text_tag.text)
# split
print('-' * 20)
tag_spilt = text_tag.text.split('<br/>')
lst = tag_spilt[0].split('\r\n')
print(lst)
print('-'*20)
print(lst[0])
Quote:SmartphonesBluetooth Wifi 4G Produit neuf jamais utilisé Paiement à la livraisonCaractéristiques de la série 4 (gps)
boîtier en aluminium gris space gps intégré, glonass, galileo et qzss s4 avec processeur dual-core 64 bits w3 apple puce sans fil altimètre barométrique capacité 16 go 1 capteur cardiaque optique capteur cardiaque électrique accéléromètre amélioré
--------------------
['SmartphonesBluetooth Wifi 4G Produit neuf jamais utilisé Paiement à la livraisonCaractéristiques de la série 4 (gps)', 'boîtier en aluminium gris space gps intégré, glonass, galileo et qzss s4 avec processeur dual-core 64 bits w3 apple puce sans fil altimètre barométrique capacité 16 go 1 capteur cardiaque optique capteur cardiaque électrique accéléromètre amélioré ']
--------------------
SmartphonesBluetooth Wifi 4G Produit neuf jamais utilisé Paiement à la livraisonCaractéristiques de la série 4 (gps)


i will try more again, and see

if you see the picture we can split it from the <br>


[Image: IWEXCIR]


[Image: uHv0L8C]


RE: spliting html code with br tag - snippsat - Aug-06-2019

(Aug-06-2019, 12:39 PM)yokaso Wrote: if you see the picture we can split it from the <br>
Yes that's what i do look at code again,then Python add \r\n for new line which i do split on.
Have to look at code that's get back with repr(),can not only look at web-site code.