Web scraping using bs4 - Printable Version +- Python Forum (https://python-forum.io) +-- Forum: Python Coding (https://python-forum.io/forum-7.html) +--- Forum: Web Scraping & Web Development (https://python-forum.io/forum-13.html) +--- Thread: Web scraping using bs4 (/thread-18949.html) |
Web scraping using bs4 - klllmmm - Jun-07-2019 I'm trying to scrap data from UN sanctions list web site. import requests from bs4 import BeautifulSoup r = requests.get("https://scsanctions.un.org/r/", headers={'User-agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:61.0) Gecko/20100101 Firefox/61.0'}) c =r.content soup = BeautifulSoup(c,"html.parser") #print(soup.prettify()) all= soup.find_all("tr",{"class":"rowtext"})This is how the html data appears for one particular section. print(all[2]) With regards to " Designation: " & " POB: " search texts, there are multiple values in the HTML file. " Designation: " <strong> a) </strong> FDLR Interim President FDLR-FOCA 1st Vice-President FDLR-FOCA Major General " POB: " <strong> a) </strong> Musanze District, Northern Province, Rwanda Ruhengeri, Rwanda Michel Byiringiro However my code only gets the value a) designation = all[2].find("strong", text=" Designation: ").next_sibling print(designation) Out[42]: <strong> a) </strong> pob = all[2].find("strong", text=" POB: ").next_sibling print(pob) Out[44]: <strong> a) </strong>I want to get these multiple value as a list Appreciate if someone can help me to get this done.
RE: Web scraping using bs4 - Larz60+ - Jun-07-2019 Here's some base code you can use. This gets the page, caches it so you don't have to download each pass, and extracts the table containing the text and links in td blocks. You can use this as a starting point. Requires python 3.6 or newer import requests from bs4 import BeautifulSoup import PrettifyPage from pathlib import Path import os import sys class GetTitles: def __init__(self): # anchor save directory os.chdir(os.path.abspath(os.path.dirname(__file__))) self.pp = PrettifyPage.PrettifyPage() self.url = 'https://scsanctions.un.org/r/' homepath = Path('.') self.cachefile = homepath / 'unscsanctions.html' self.prettyfile = homepath / 'unscsanctions_pretty.html' self.get_titles() def get_titles(self): pp = self.pp.prettify # Fetch file with external cache if self.cachefile.exists(): with self.cachefile.open('rb') as fp: page = fp.read() else: response = requests.get(self.url) if response.status_code == 200: page = response.content with self.cachefile.open('wb') as fp: fp.write(page) else: print(f'Problem fetchting page: {self.url}') sys.exit(-1) soup = BeautifulSoup(page, 'lxml') # Create a prettyfile so you can look at it easier than raw data if not self.prettyfile.exists(): with self.prettyfile.open('w') as fp: fp.write(pp(soup, 2)) table = soup.select('table.display:nth-child(11)')[0] trs = table.tbody.find_all('tr') for n, tr in enumerate(trs): tds = tr.find_all('td') for n1, td in enumerate(tds): print(f'\n========================= tr_{n}, td{n1} =========================') print(f'{pp(td, 2)}') td_text = td.text.strip() print(f'\n========================= contents =========================') print(f'\ntd_text: {td_text}\n') ll = td.find_all('a') for link in ll: href = link.get('href') print(f'link: {link}') if __name__ == '__main__': GetTitles()Also needs this module (name PrettifyPage.py and keep in same directory as above script) # PrettifyPage.py from bs4 import BeautifulSoup import requests import pathlib class PrettifyPage: def __init__(self): pass def prettify(self, soup, indent): pretty_soup = str() previous_indent = 0 for line in soup.prettify().split("\n"): current_indent = str(line).find("<") if current_indent == -1 or current_indent > previous_indent + 2: current_indent = previous_indent + 1 previous_indent = current_indent pretty_soup += self.write_new_line(line, current_indent, indent) return pretty_soup def write_new_line(self, line, current_indent, desired_indent): new_line = "" spaces_to_add = (current_indent * desired_indent) - current_indent if spaces_to_add > 0: for i in range(spaces_to_add): new_line += " " new_line += str(line) + "\n" return new_line if __name__ == '__main__': pp = PrettifyPage()sample output 1st td:
RE: Web scraping using bs4 - klllmmm - Jun-08-2019 Thanks for the reply. I'm using python 3.6.7. I tried the code you stated, And i got following error. Pls. note my main problem is to get the data into a pandas dataframe. And the issue i faced is that how to get the texts into one field when there are multiple values for some search tags such as Designation, POB etcEg. Extracts from 3rd ID
Appreciate if you can give some inputs for this.
RE: Web scraping using bs4 - Larz60+ - Jun-10-2019 Sorry, this is the first time I checked in since last post, just reading your post now. It seems to have a problem with the cache encoding, probably have to specify utf-8 or whatever codec you're using. I'm not an expert on pandas, haven't used it much as I do very little reporting (most of my code is system level). Perhaps someone else will pick up on this. |