Python Forum
Beautifull Soap. Split page using a value and not a tag. - Printable Version

+- Python Forum (https://python-forum.io)
+-- Forum: Python Coding (https://python-forum.io/forum-7.html)
+--- Forum: Web Scraping & Web Development (https://python-forum.io/forum-13.html)
+--- Thread: Beautifull Soap. Split page using a value and not a tag. (/thread-33323.html)



Beautifull Soap. Split page using a value and not a tag. - lillo123 - Apr-15-2021

Hi, look please at this page:
https://mob.processotelematico.giustizia.it/proxy/index_mobile.php?version=1.1.11&platform=Android%208.0.0&uuid=137cd993b81df224&devicename=SM-G955F&token=c0ba723983c804d8eef1c9ee74cfcb99&azione=direttarg_siecic_mobile&tipoufficio=1&registro=PC&idufficio=0350330099&aaproc=2015&numproc=161&;

Is a normal page that, with BeautifullSoap, I can use simply because it is a single block of data.

The problem is when I have this kind of page

https://mob.processotelematico.giustizia.it/proxy/index_mobile.php?version=1.1.11&platform=Android%208.0.0&uuid=137cd993b81df224&devicename=SM-G955F&token=c0ba723983c804d8eef1c9ee74cfcb99&azione=direttarg_siecic_mobile&tipoufficio=1&registro=PC&idufficio=0580910098&aaproc=2018&numproc=1&;

As you see, in this page there are many block of data (all the same), but unfortunately, there are few tags. Only <li> and <ul>. The only way I have to intercept a block with respect to the next is through the value "Parti fascicolo". So not a tag, but a value.

How can I split the page into multiple blocks using the "Parti fascicolo" values? if I can split it, I can work it as I do for the first file.


RE: Beautifull Soap. Split page using a value and not a tag. - buran - Apr-15-2021

It looks like every "chunk" consists of 4 <ul> tags.


RE: Beautifull Soap. Split page using a value and not a tag. - snippsat - Apr-15-2021

Use CSS selector to split it up it parts.
BS has this build in trough select() and select_one()
Example.
import requests
from bs4 import BeautifulSoup

url = 'https://mob.processotelematico.giustizia.it/proxy/index_mobile.php?version=1.1.11&platform=Android%208.0.0&uuid=137cd993b81df224&devicename=SM-G955F&token=c0ba723983c804d8eef1c9ee74cfcb99&azione=direttarg_siecic_mobile&tipoufficio=1&registro=PC&idufficio=0580910098&aaproc=2018&numproc=1&'
response = requests.get(url)
soup = BeautifulSoup(response.content, 'lxml')
part_1 = soup.select_one('body > ul:nth-child(1)')
part_5  = soup.select_one('body > ul:nth-child(24)')
Usage test,so nth-child can split it up.
Output:
>>> part_1.select_one('li') <li data-role="list-divider">Parti fascicolo</li> >>> part_1.select('li')[:5] [<li data-role="list-divider">Parti fascicolo</li>, <li>L**** *****<i> (Debitore)</i><br/>Avv. P****** F****</li>] >>> >>> part_5.select_one('li') <li data-role="list-divider">Parti fascicolo</li> >>> part_5.select('li')[:5] [<li data-role="list-divider">Parti fascicolo</li>, <li>M**** G****<i> (Creditore)</i></li>, <li>B**** S****<i> (Curatore)</i></li>, <li>E**** *****<i> (Debitore)</i></li>, <li>C**** S****<i> (Creditore)</i></li>]



RE: Beautifull Soap. Split page using a value and not a tag. - lillo123 - Apr-16-2021

Thank you, but I don't understand very well.
I do not know before, how many "Parti del fascicolo" blocks I will have. So, how can intercept this blocks? In this case are 6, but in others 1, 2, 3 or more.
Also, I have a lot of ul tag (about 48). Inside this tags, how can split starting each time using the value"Parti del fascicolo"?
"Parti del FAscicolo" it is the only way to understand that a new block is about to begin.

thank you .
Carlo


RE: Beautifull Soap. Split page using a value and not a tag. - snippsat - Apr-16-2021

(Apr-16-2021, 09:18 AM)lillo123 Wrote: Also, I have a lot of ul tag (about 48). Inside this tags, how can split starting each time using the value"Parti del fascicolo"?
"Parti del FAscicolo" it is the only way to understand that a new block is about to begin.
Can search for tag that contain text,and in case go up to parent tag ul when found.
Then this ul will have all li tag that has search word.
Example.
import requests
from bs4 import BeautifulSoup

url = 'https://mob.processotelematico.giustizia.it/proxy/index_mobile.php?version=1.1.11&platform=Android%208.0.0&uuid=137cd993b81df224&devicename=SM-G955F&token=c0ba723983c804d8eef1c9ee74cfcb99&azione=direttarg_siecic_mobile&tipoufficio=1&registro=PC&idufficio=0580910098&aaproc=2018&numproc=1&'
response = requests.get(url)
soup = BeautifulSoup(response.content, 'lxml')
result = soup.select('li:-soup-contains("Parti fascicolo")')
Output:
>>> result [<li data-role="list-divider">Parti fascicolo</li>, <li data-role="list-divider">Parti fascicolo</li>, <li data-role="list-divider">Parti fascicolo</li>, <li data-role="list-divider">Parti fascicolo</li>, <li data-role="list-divider">Parti fascicolo</li>, <li data-role="list-divider">Parti fascicolo</li>]
At his stage has found all Parti fascicolo tag.
So now need to go to parent ul which will have block of tags that related to Parti fascicolo.
Output:
>>> result[0].find_parent('ul') <ul data-dividertheme="e" data-inset="true" data-role="listview"><li data-role="list-divider">Parti fascicolo</li><li>L**** *****<i> (Debitore)</i><br/>Avv. P****** F*****</li></ul> >>> >>> result[2].find_parent('ul') <ul data-dividertheme="e" data-inset="true" data-role="listview"><li data-role="list-divider">Parti fascicolo</li><li>M**** G****<i> (Creditore)</i></li><li>B**** S****<i> (Curatore)</i></li><li>E**** *****<i> (Debitore)</i></li><li>C**** S****<i> (Creditore)</i></li><li>C**** A****<i> (Creditore)</i></li><li>D**** G****<i> (Creditore)</i></li><li>D**** P****<i> (Creditore)</i></li><li>D**** M****<i> (Creditore)</i></li><li>D**** C****<i> (Creditore)</i></li><li>F**** F****<i> (Creditore)</i></li><li>F**** O****<i> (Creditore)</i></li><li>I**** C****<i> (Creditore)</i></li><li>L**** E****<i> (Creditore)</i></li><li>A**** *****<i> (Creditore)</i></li><li>C**** *****<i> (Creditore)</i></li><li>A**** C****<i> (Creditore)</i></li><li>A**** C****<i> (Creditore)</i></li><li>A**** A****<i> (Creditore)</i></li><li>A**** G****<i> (Creditore)</i></li><li>B**** S****<i> (Creditore)</i></li><li>B**** E****<i> (Creditore)</i></li><li>C**** V****<i> (Creditore)</i></li><li>C**** M****<i> (Creditore)</i></li><li>L**** M****<i> (Creditore)</i></li><li>L**** R****<i> (Creditore)</i></li><li>L**** M****<i> (Creditore)</i></li><li>R**** L****<i> (Creditore)</i></li><li>R**** C****<i> (Creditore)</i></li><li>V**** D****<i> (Creditore)</i></li><li>V**** E****<i> (Creditore)</i></li><li>A**** F****<i> (Creditore)</i></li><li>A**** F****<i> (Creditore)</i></li><li>A**** I****<i> (Creditore)</i></li><li>B**** G****<i> (Creditore)</i></li><li>C**** G****<i> (Creditore)</i></li><li>C**** M****<i> (Creditore)</i></li><li>C**** M****<i> (Creditore)</i></li><li>D**** M****<i> (Creditore)</i></li><li>D**** V****<i> (Creditore)</i></li><li>F**** R****<i> (Creditore)</i></li><li>G**** R****<i> (Creditore)</i></li><li>G**** F****<i> (Creditore)</i></li><li>G**** M****<i> (Creditore)</i></li><li>M**** G****<i> (Creditore)</i></li><li>P**** F****<i> (Creditore)</i></li><li>P**** S****<i> (Creditore)</i></li><li>R**** M****<i> (Creditore)</i></li><li>T**** C****<i> (Creditore)</i></li><li>D**** D****<i> (Creditore)</i></li><li>G**** V****<i> (Creditore)</i></li><li>V**** M****<i> (Creditore)</i></li><li>Z**** A****<i> (Creditore)</i></li><li>M**** F****<i> (Creditore)</i></li><li>M**** M****<i> (Creditore)</i></li><li>M**** A****<i> (Creditore)</i></li><li>M**** D****<i> (Creditore)</i></li><li>M**** G****<i> (Creditore)</i></li><li>M**** G****<i> (Creditore)</i></li><li>O**** D****<i> (Creditore)</i></li><li>P**** G****<i> (Creditore)</i></li><li>R**** E****<i> (Creditore)</i></li><li>R**** F****<i> (Creditore)</i></li><li>R**** R****<i> (Creditore)</i></li><li>T**** C****<i> (Creditore)</i></li><li>M**** C****<i> (Creditore)</i></li><li>B**** D****<i> (Creditore)</i></li><li>O**** V****<i> (Creditore)</i></li></ul>



RE: Beautifull Soap. Split page using a value and not a tag. - lillo123 - Apr-21-2021

Hi,
the problem with this, is that with this command "result[0].find_parent('ul')" I will take the first "Parti del fascicolo" parent:
Output:
result[0].find_parent('ul') ul data-dividertheme="e" data-inset="true" data-role="listview"><li data-role="list-divider">Parti fascicolo</li><li>L**** *****<i> (Debitore)</i><br/>Avv. P****** F*****</li></ul>
But How can I take the other parents? All the <ul> tag before the next "Parti Fascicolo" like:

Output:
<ul data-inset="true" data-role="listview" data-dividertheme="e"> <li data-role="list-divider">Ruolo Generale</li> <li>N. <NumeroRuolo>1</NumeroRuolo>/<AnnoRuolo>2018</AnnoRuolo></li> <li>Registro: <Registro style="display:none;">PC</Registro>Procedure Concorsuali</li> <li>Ufficio: <descUfficio>Tribunale Ordinario di Roma</descUfficio><IdUfficio style="display:none;">0580910098</IdUfficio></li> <li>iscritto al ruolo il 20/02/2018</li> </ul>
or this one:

Output:
<ul data-inset="true" data-role="listview" data-dividertheme="e"> <li data-role="list-divider">Oggetto</li> <li>ACCORDI DI RISTRUTTURAZIONE</li> </ul>
thank you