Web scraping using bs4 - Printable Version

Web scraping using bs4 - Printable Version

+- Python Forum (https://python-forum.io)
+-- Forum: Python Coding (https://python-forum.io/forum-7.html)
+--- Forum: Web Scraping & Web Development (https://python-forum.io/forum-13.html)
+--- Thread: Web scraping using bs4 (/thread-18949.html)

Web scraping using bs4 - klllmmm - Jun-07-2019

I'm trying to scrap data from UN sanctions list web site.

import requests
from bs4 import BeautifulSoup
 
r = requests.get("https://scsanctions.un.org/r/", headers={'User-agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:61.0) Gecko/20100101 Firefox/61.0'})

c =r.content
soup = BeautifulSoup(c,"html.parser")
#print(soup.prettify())
all= soup.find_all("tr",{"class":"rowtext"})

This is how the html data appears for one particular section.
print(all[2])

Output:Out[40]: 
<tr class="rowtext"><td>
<strong>CDi.003 </strong><strong>Name: </strong>1: GASTON 2: IYAMUREMYE 3: na 4: na<br/><span><strong> Title: </strong>na<strong> Designation: </strong><strong> a) </strong>FDLR Interim President<strong> b) </strong>FDLR-FOCA 1st Vice-President<strong> c) </strong>FDLR-FOCA Major General<strong> DOB: </strong>1948<strong> POB: </strong><strong> a) </strong>Musanze District, Northern Province, Rwanda <strong> b) </strong>Ruhengeri, Rwanda <strong> Good quality a.k.a.: </strong><strong> a) </strong>Byiringiro Victor Rumuli<strong> b) </strong>Victor Rumuri<strong> c) </strong>Michel Byiringiro<strong> Low quality a.k.a.: </strong>Rumuli<strong> Nationality: </strong>Rwanda<strong> Passport no: </strong>na<strong> National identification no: </strong>na<strong> Address: </strong>North Kivu Province, Democratic Republic of the Congo (as of June 2016) <strong> Listed on: </strong>1 Dec. 2010 
						(
						amended on 13 Oct. 2016
						) 
					<strong> Other information: </strong> INTERPOL-UN Security Council Special Notice web link: https://www.interpol.int/en/How-we-work/Notices/View-UN-Notices-Individuals<span class="emptyspace"> </span><a href="https://www.interpol.int/en/How-we-work/Notices/View-UN-Notices-Individuals">click here</a></span>
</td></tr>

With regards to " Designation: " & " POB: " search texts, there are multiple values in the HTML file.

" Designation: "
<strong> a) </strong>
FDLR Interim President
FDLR-FOCA 1st Vice-President
FDLR-FOCA Major General

" POB: "
<strong> a) </strong>
Musanze District, Northern Province, Rwanda
Ruhengeri, Rwanda
Michel Byiringiro

However my code only gets the value a)

designation = all[2].find("strong", text=" Designation: ").next_sibling
print(designation)
Out[42]: <strong> a) </strong>

pob = all[2].find("strong", text=" POB: ").next_sibling
print(pob)
Out[44]: <strong> a) </strong>

I want to get these multiple value as a list

Output:Expected_designation
Out[49]: 
['FDLR Interim President',
 'FDLR-FOCA 1st Vice-President',
 'FDLR-FOCA Major General']

Expected_pob
Out[50]: 
['Musanze District, Northern Province',
 'Rwanda,Ruhengeri, Rwanda',
 'Michel Byiringiro']

Appreciate if someone can help me to get this done.

RE: Web scraping using bs4 - Larz60+ - Jun-07-2019

Here's some base code you can use.
This gets the page, caches it so you don't have to download each pass, and extracts the table containing the text
and links in td blocks.
You can use this as a starting point.
Requires python 3.6 or newer

import requests
from bs4 import BeautifulSoup
import PrettifyPage
from pathlib import Path
import os
import sys


class GetTitles:
    def __init__(self):
        # anchor save directory
        os.chdir(os.path.abspath(os.path.dirname(__file__)))
        self.pp = PrettifyPage.PrettifyPage()
        self.url = 'https://scsanctions.un.org/r/'

        homepath = Path('.')
        self.cachefile = homepath / 'unscsanctions.html'
        self.prettyfile = homepath / 'unscsanctions_pretty.html'
        self.get_titles()
    
    def get_titles(self):
        pp = self.pp.prettify
        # Fetch file with external cache
        if self.cachefile.exists():
            with self.cachefile.open('rb') as fp:
                page = fp.read()
        else:
            response = requests.get(self.url)
            if response.status_code == 200:
                page = response.content
                with self.cachefile.open('wb') as fp:
                    fp.write(page)
            else:
                print(f'Problem fetchting page: {self.url}')
                sys.exit(-1)
        
        soup = BeautifulSoup(page, 'lxml')
        # Create a prettyfile so you can look at it easier than raw data
        if not self.prettyfile.exists():
            with self.prettyfile.open('w') as fp:
                fp.write(pp(soup, 2))

        table = soup.select('table.display:nth-child(11)')[0]
        trs = table.tbody.find_all('tr')
        for n, tr in enumerate(trs):
            tds = tr.find_all('td')
            for n1, td in enumerate(tds):
                print(f'\n========================= tr_{n}, td{n1} =========================')
                print(f'{pp(td, 2)}')
                td_text = td.text.strip()
                print(f'\n========================= contents =========================')
                print(f'\ntd_text: {td_text}\n')
                ll = td.find_all('a')
                for link in ll:
                    href = link.get('href')
                    print(f'link: {link}')

if __name__ == '__main__':
    GetTitles()

Also needs this module (name PrettifyPage.py and keep in same directory as above script)

# PrettifyPage.py

from bs4 import BeautifulSoup
import requests
import pathlib


class PrettifyPage:
    def __init__(self):
        pass

    def prettify(self, soup, indent):
        pretty_soup = str()
        previous_indent = 0
        for line in soup.prettify().split("\n"):
            current_indent = str(line).find("<")
            if current_indent == -1 or current_indent > previous_indent + 2:
                current_indent = previous_indent + 1
            previous_indent = current_indent
            pretty_soup += self.write_new_line(line, current_indent, indent)
        return pretty_soup

    def write_new_line(self, line, current_indent, desired_indent):
        new_line = ""
        spaces_to_add = (current_indent * desired_indent) - current_indent
        if spaces_to_add > 0:
            for i in range(spaces_to_add):
                new_line += " "		
        new_line += str(line) + "\n"
        return new_line

if __name__ == '__main__':
    pp = PrettifyPage()

sample output 1st td:

Output:========================= tr_0, td0 =========================
<td>
  <strong>
    CDi.001
  </strong>
  <strong>
    Name:
  </strong>
   1: ERIC 2: BADEGE 3: na 4: na
  <br/>
  <span>
    <strong>
      Title:
    </strong>
     na
    <strong>
      Designation:
    </strong>
     na
    <strong>
      DOB:
    </strong>
     1971
    <strong>
      POB:
    </strong>
     na
    <strong>
      Good quality a.k.a.:
    </strong>
     na
    <strong>
      Low quality a.k.a.:
    </strong>
     na
    <strong>
      Nationality:
    </strong>
     Democratic Republic of the Congo
    <strong>
      Passport no:
    </strong>
     na
    <strong>
      National identification no:
    </strong>
     na
    <strong>
      Address:
    </strong>
     Rwanda (as of early 2016)
    <strong>
      Listed on:
    </strong>
     31 Dec. 2012 
                                                (
                                                amended on 13 Oct. 2016
                                                )
    <strong>
      Other information:
    </strong>
     He fled to Rwanda in March 2013 and is still living there as of early 2016. INTERPOL-UN Security Council Special Notice web link: https://www.interpol.int/en/How-we-work/Notices/View-UN-Notices-Individuals
    <span class="emptyspace">
    </span>
    <a href="https://www.interpol.int/en/How-we-work/Notices/View-UN-Notices-Individuals">
      click here
    </a>
  </span>
</td>


========================= contents =========================

td_text: CDi.001 Name: 1: ERIC 2: BADEGE 3: na 4: na Title: na Designation: na DOB: 1971 POB: na Good quality a.k.a.: na Low quality a.k.a.: na Nationality: Democratic Republic of the Congo Passport no: na National identification no: na Address: Rwanda (as of early 2016)  Listed on: 31 Dec. 2012 
                                                (
                                                amended on 13 Oct. 2016
                                                ) 
                                         Other information: He fled to Rwanda in March 2013 and is still living there as of early 2016. INTERPOL-UN Security Council Special Notice web link: https://www.interpol.int/en/How-we-work/Notices/View-UN-Notices-Individuals click here

link: <a href="https://www.interpol.int/en/How-we-work/Notices/View-UN-Notices-Individuals">click here</a>

RE: Web scraping using bs4 - klllmmm - Jun-08-2019

Thanks for the reply.

I'm using python 3.6.7.

I tried the code you stated, And i got following error.

Error:Traceback (most recent call last):
  File ".\unscsanctions_bs4.py", line 59, in <module>
    GetTitles()
  File ".\unscsanctions_bs4.py", line 19, in __init__
    self.get_titles()
  File ".\unscsanctions_bs4.py", line 41, in get_titles
    fp.write(pp(soup, 2))
  File "D:\Anaconda3\lib\encodings\cp1252.py", line 19, in encode
    return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode characters in position 170290-170293: character maps to <undefined>

Pls. note my main problem is to get the data into a pandas dataframe. And the issue i faced is that how to get the texts into one field when there are multiple values for some search tags such as Designation, POB etc

Eg. Extracts from 3rd ID

Output:
<strong> Designation: </strong><strong> a) </strong>FDLR Interim President<strong> b) </strong>FDLR-FOCA 1st Vice-President<strong> c) </strong>FDLR-FOCA Major General<strong>

Output:
<strong> POB: </strong><strong> a) </strong>Musanze District, Northern Province, Rwanda <strong> b) </strong>Ruhengeri, Rwanda <strong>

Appreciate if you can give some inputs for this.

RE: Web scraping using bs4 - Larz60+ - Jun-10-2019

Sorry, this is the first time I checked in since last post, just reading your post now.
It seems to have a problem with the cache encoding, probably have to specify utf-8 or whatever codec you're using.
I'm not an expert on pandas, haven't used it much as I do very little reporting (most of my code is system level).
Perhaps someone else will pick up on this.