Jun-07-2019, 04:53 PM
I'm trying to scrap data from UN sanctions list web site.
print(all[2])
" Designation: "
<strong> a) </strong>
FDLR Interim President
FDLR-FOCA 1st Vice-President
FDLR-FOCA Major General
" POB: "
<strong> a) </strong>
Musanze District, Northern Province, Rwanda
Ruhengeri, Rwanda
Michel Byiringiro
However my code only gets the value a)
import requests from bs4 import BeautifulSoup r = requests.get("https://scsanctions.un.org/r/", headers={'User-agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:61.0) Gecko/20100101 Firefox/61.0'}) c =r.content soup = BeautifulSoup(c,"html.parser") #print(soup.prettify()) all= soup.find_all("tr",{"class":"rowtext"})This is how the html data appears for one particular section.
print(all[2])
Output:Out[40]:
<tr class="rowtext"><td>
<strong>CDi.003 </strong><strong>Name: </strong>1: GASTON 2: IYAMUREMYE 3: na 4: na<br/><span><strong> Title: </strong>na<strong> Designation: </strong><strong> a) </strong>FDLR Interim President<strong> b) </strong>FDLR-FOCA 1st Vice-President<strong> c) </strong>FDLR-FOCA Major General<strong> DOB: </strong>1948<strong> POB: </strong><strong> a) </strong>Musanze District, Northern Province, Rwanda <strong> b) </strong>Ruhengeri, Rwanda <strong> Good quality a.k.a.: </strong><strong> a) </strong>Byiringiro Victor Rumuli<strong> b) </strong>Victor Rumuri<strong> c) </strong>Michel Byiringiro<strong> Low quality a.k.a.: </strong>Rumuli<strong> Nationality: </strong>Rwanda<strong> Passport no: </strong>na<strong> National identification no: </strong>na<strong> Address: </strong>North Kivu Province, Democratic Republic of the Congo (as of June 2016) <strong> Listed on: </strong>1 Dec. 2010
(
amended on 13 Oct. 2016
)
<strong> Other information: </strong> INTERPOL-UN Security Council Special Notice web link: https://www.interpol.int/en/How-we-work/Notices/View-UN-Notices-Individuals<span class="emptyspace"> </span><a href="https://www.interpol.int/en/How-we-work/Notices/View-UN-Notices-Individuals">click here</a></span>
</td></tr>
With regards to " Designation: " & " POB: " search texts, there are multiple values in the HTML file. " Designation: "
<strong> a) </strong>
FDLR Interim President
FDLR-FOCA 1st Vice-President
FDLR-FOCA Major General
" POB: "
<strong> a) </strong>
Musanze District, Northern Province, Rwanda
Ruhengeri, Rwanda
Michel Byiringiro
However my code only gets the value a)
designation = all[2].find("strong", text=" Designation: ").next_sibling print(designation) Out[42]: <strong> a) </strong> pob = all[2].find("strong", text=" POB: ").next_sibling print(pob) Out[44]: <strong> a) </strong>I want to get these multiple value as a list
Output:Expected_designation
Out[49]:
['FDLR Interim President',
'FDLR-FOCA 1st Vice-President',
'FDLR-FOCA Major General']
Expected_pob
Out[50]:
['Musanze District, Northern Province',
'Rwanda,Ruhengeri, Rwanda',
'Michel Byiringiro']
Appreciate if someone can help me to get this done.