Python Forum
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
spliting html code with br tag
#1
Hi,i am new in python scraping and i apologize for any mistake,

i would like to get a text from html code and the target text is between <br>, i tried the following code but it give me the whole text.
any idea ????


item_phone_type= items.find('span', class_='annonce_get_description', itemprop="description").text.split('<br>')
print(item_phone_type)
Quote:<span class="annonce_get_description" itemprop="description">
Smartphones<br>
<b>Double puces</b>
<br>
Mémoire : 64 GO
<br>
Bluetooth Wifi <b>4G</b>
<br>
Ecran 5.8 pouces
<br>
Appareil photo 12 MP
<br>
Bon état
<br>
<span class="annonce_description_preview "> </span></span>
Reply
#2
To do a test,also see that this code can be run then it easier for people to help.
from bs4 import BeautifulSoup
import re

html = '''\
<span class="annonce_get_description" itemprop="description">
Smartphones<br>
<b>Double puces</b>
<br>
Mémoire : 64 GO
<br>
Bluetooth Wifi <b>4G</b>
<br>
Ecran 5.8 pouces
<br>
Appareil photo 12 MP
<br>
Bon état
<br>
<span class="annonce_description_preview "> </span></span>'''

soup = BeautifulSoup(html, 'lxml')
Use:
>>> print(tags.text)
tags = soup.find(class_="annonce_get_description")

Smartphones
Double puces

Mémoire : 64 GO

Bluetooth Wifi 4G

Ecran 5.8 pouces

Appareil photo 12 MP

Bon état

 
>>> print(repr(tags.text.strip()))
'Smartphones\nDouble puces\n\nMémoire : 64 GO\n\nBluetooth Wifi 4G\n\nEcran 5.8 pouces\n\nAppareil photo 12 MP\n\nBon état'
With .text get all br tags,see when use repr() that if split on \n\n it should keep the structure.
>>> br_tags = tags.text.strip().split('\n\n')
>>> br_tags
['Smartphones\nDouble puces',
 'Mémoire : 64 GO',
 'Bluetooth Wifi 4G',
 'Ecran 5.8 pouces',
 'Appareil photo 12 MP',
 'Bon état']

>>> print(br_tags[0])
Smartphones
Double puces

>>> print(br_tags[2])
Bluetooth Wifi 4G
Reply
#3
i tried your code but it don't give me the same result as yours.
Quote:IndexError Traceback (most recent call last)
<ipython-input-15-6f66d1a27188> in <module>()
13 print(br_tags[0])
14 print("######################################")
---> 15 print(br_tags[2])
16 print("######################################")
17 print(br_tags[4])

IndexError: list index out of range
Reply
#4
You are getting output that my code can never give,so you are testing against more html then i do.
Here as one script and take with itemprop="description that may be need if testing code on site you use.
from bs4 import BeautifulSoup

# Simulate code on a web-site 
html = '''\
<span class="annonce_get_description" itemprop="description">
Smartphones<br>
<b>Double puces</b>
<br>
Mémoire : 64 GO
<br>
Bluetooth Wifi <b>4G</b>
<br>
Ecran 5.8 pouces
<br>
Appareil photo 12 MP
<br>
Bon état
<br>
<span class="annonce_description_preview "> </span></span>'''

soup = BeautifulSoup(html, 'lxml')
tags = soup.find(class_="annonce_get_description", itemprop="description")
br_tags = tags.text.strip().split('\n\n')
print(br_tags)
print('-' * 15)
print(br_tags[0])
print('-' * 15)
print(br_tags[2])
Just to make it clear code over is stand alone it do not need a url.
Output:
E:\div_code\scrape λ python br_tags.py ['Smartphones\nDouble puces', 'Mémoire : 64 GO', 'Bluetooth Wifi 4G', 'Ecran 5.8 pouces', 'Appareil photo 12 MP', 'Bon état'] --------------- Smartphones Double puces --------------- Bluetooth Wifi 4G
Reply
#5
thank you,and ur right .
when i try ur code of html it work well and when i try the website it got errors. i don't know why but i want to know.
i will send you the website maybe you can enlighten me ?
https://www.ouedkniss.com/telephones
and thank you again
Reply
#6
Something like this,id="ann-20047560" is changes all time on this site.
A general way to split as shown,may need to adjust some to get what you want as not all advertisement text are the same.
import requests
from bs4 import BeautifulSoup

url = 'https://www.ouedkniss.com/telephones'
response = requests.get(url)
soup = BeautifulSoup(response.content, 'lxml')
id_find = soup.find('div', id="ann-20047560")
text_tag = soup.find('span', class_="annonce_get_description", itemprop="description")
print(text_tag.text)
# split
print('-' * 20)
tag_spilt = text_tag.text.split('<br/>')
lst = tag_spilt[0].split('\r\n')
print(lst)
Reply
#7
but this code don't split the content ????
Reply
#8
Output is now this,always changing.
Output:
SmartphonesMémoire : 64 GO Bon état Je vends 2 iphone se produit européen : >> capacité : 64 go >> couleur : rose gold >> État : 10/10 >> fourni avec tt ses accessoires -------------------- ['SmartphonesMémoire : 64 GO Bon état Je vends 2 iphone se produit européen :', '>> capacité : 64 go', '>> couleur : rose gold', '>> État : 10/10 ', '>> fourni avec tt ses accessoires ', '']
The lst is a list with split content.
>>> lst[0]
'SmartphonesMémoire : 64 GO Bon état Je vends 2 iphone se produit européen :'
>>> lst[1]
'>> capacité : 64 go'
>>> lst[2]
'>> couleur : rose gold'
>>> lst[3]
'>> État : 10/10 '
Reply
#9
import requests
from bs4 import BeautifulSoup
 
url = 'https://www.ouedkniss.com/telephones'
response = requests.get(url)
soup = BeautifulSoup(response.content, 'lxml')
id_find = soup.find('div', id="ann-20047560")
text_tag = soup.find('span', class_="annonce_get_description", itemprop="description")
print(text_tag.text)
# split
print('-' * 20)
tag_spilt = text_tag.text.split('<br/>')
lst = tag_spilt[0].split('\r\n')
print(lst)
print('-'*20)
print(lst[0])
Quote:SmartphonesBluetooth Wifi 4G Produit neuf jamais utilisé Paiement à la livraisonCaractéristiques de la série 4 (gps)
boîtier en aluminium gris space gps intégré, glonass, galileo et qzss s4 avec processeur dual-core 64 bits w3 apple puce sans fil altimètre barométrique capacité 16 go 1 capteur cardiaque optique capteur cardiaque électrique accéléromètre amélioré
--------------------
['SmartphonesBluetooth Wifi 4G Produit neuf jamais utilisé Paiement à la livraisonCaractéristiques de la série 4 (gps)', 'boîtier en aluminium gris space gps intégré, glonass, galileo et qzss s4 avec processeur dual-core 64 bits w3 apple puce sans fil altimètre barométrique capacité 16 go 1 capteur cardiaque optique capteur cardiaque électrique accéléromètre amélioré ']
--------------------
SmartphonesBluetooth Wifi 4G Produit neuf jamais utilisé Paiement à la livraisonCaractéristiques de la série 4 (gps)


i will try more again, and see

if you see the picture we can split it from the <br>


[Image: IWEXCIR]


[Image: uHv0L8C]
Reply
#10
(Aug-06-2019, 12:39 PM)yokaso Wrote: if you see the picture we can split it from the <br>
Yes that's what i do look at code again,then Python add \r\n for new line which i do split on.
Have to look at code that's get back with repr(),can not only look at web-site code.
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Populating list items to html code and create individualized html code files ChainyDaisy 0 1,560 Sep-21-2022, 07:18 PM
Last Post: ChainyDaisy
  HTML multi select HTML listbox with Flask/Python rfeyer 0 4,528 Mar-14-2021, 12:23 PM
Last Post: rfeyer
  Python3 + BeautifulSoup4 + lxml (HTML -> CSV) - How to loop to next HTML/new CSV Row BrandonKastning 0 2,328 Mar-22-2020, 06:10 AM
Last Post: BrandonKastning
  How to get the href value of a specific word in the html code julio2000 2 3,143 Mar-05-2020, 07:50 PM
Last Post: julio2000
  Embedding HTML Code in Python kendias 5 4,219 Jan-27-2019, 01:43 AM
Last Post: kendias
  Help with Python and HTML code karlo_ds 4 3,390 Oct-16-2017, 03:03 PM
Last Post: karlo_ds

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020