web scraping to csv formatting problems - Printable Version +- Python Forum (https://python-forum.io) +-- Forum: Python Coding (https://python-forum.io/forum-7.html) +--- Forum: Web Scraping & Web Development (https://python-forum.io/forum-13.html) +--- Thread: web scraping to csv formatting problems (/thread-19511.html) |
web scraping to csv formatting problems - bluethundr - Jul-02-2019 Hello, I am trying to scrape a web page and send the result to CSV. I am able to get the content I want in the CSV. However, the content is being repeated down the page and unique info is sent across the page, instead of down the page under the headers. This is the result I'm getting: CSV Output The CSV should list the accounts one per line, going down and not across as in this example. This is the original wiki page that I'm scraping (had to block out company info): Original Wiki Page This is the code I am using: import csv import os import requests from requests import get from requests.exceptions import RequestException from contextlib import closing from bs4 import BeautifulSoup output_dir = os.path.join( '..', 'output_files', 'aws_accounts_list') source = 'aws_wiki_page' destination = os.path.join(output_dir, source + '.csv' ) url = 'https://wiki.us.cworld.company.com/display/6TO/AWS+Accounts' page = requests.get(url, auth=('me', 'secret')) headers = ['Company Account Name', 'AWS Account Name', 'Description', 'LOB', 'AWS Account Number', 'Connected to Homebase', 'Peninsula or Island', 'URL', 'Owner', 'Engagement Code', 'CloudOps Access Type'] soup = BeautifulSoup(page.text, 'lxml') rows = [] for tr in soup.select('tr'): rows.append([td.text for td in soup.select('td')]) with open(destination, 'w+', newline='') as csvfile: writer = csv.writer(csvfile, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL) writer.writerow(headers) for row in rows: writer.writerow(row) print(row)What am I doing wrong? RE: web scraping to csv formatting problems - Larz60+ - Jul-03-2019 The URL posted doesn't appear to be valid RE: web scraping to csv formatting problems - perfringo - Jul-03-2019 I have no understanding about specifics of this task. It seems to me that html table is needed to scrape and in this case I would skip the low level coding and let pandas handle that. Something along those lines: >>> import pandas as pd >>> df = pd.read_html("https://en.wikipedia.org/wiki/Comparison_of_programming_languages") >>> df[1].to_csv('comparison_table.csv')This code grabs second table (135 rows and 11 columns) on webpage https://en.wikipedia.org/wiki/Comparison_of_programming_languages and writes it to comparison_table.csv in present working directory. For further information you can look at forum thread pandas library tricks
RE: web scraping to csv formatting problems - bluethundr - Jul-03-2019 (Jul-03-2019, 02:32 AM)Larz60+ Wrote: The URL posted doesn't appear to be valid Yes, I am aware. Perhaps I should have clarified in the OP. This is a company URL, and I am not allow to post specifics of company information in a public forum. Also the real link, even if I could post it, would not work off of our network. RE: web scraping to csv formatting problems - Larz60+ - Jul-04-2019 It's mighty difficult to give advise without looking at the page. usual layout for a table is to have multiple tr's and multiple td's within each tr. Here's an example of this on a simple page with only one table: table = soup.find('table', {'summary': 'This table displays Connecticut towns and the year of their establishment.'}) trs = table.tbody.find_all('tr') for n, tr in enumerate(trs): for n1, td in enumerate(self.get_td(tr)): print(f'==================================== tr {n}, td: {n1} ====================================') print(f'{self.pp.prettify(td, 2)}')This will give you a layout of the page and make it easier to determine how to proceed. the prettify method is in module PrettifyPage.py which is a modified version of BeautfulSoup's prettify which allows changing indent size from bs4 import BeautifulSoup import requests import pathlib class PrettifyPage: def __init__(self): pass def prettify(self, soup, indent): pretty_soup = str() previous_indent = 0 for line in soup.prettify().split("\n"): current_indent = str(line).find("<") if current_indent == -1 or current_indent > previous_indent + 2: current_indent = previous_indent + 1 previous_indent = current_indent pretty_soup += self.write_new_line(line, current_indent, indent) return pretty_soup def write_new_line(self, line, current_indent, desired_indent): new_line = "" spaces_to_add = (current_indent * desired_indent) - current_indent if spaces_to_add > 0: for i in range(spaces_to_add): new_line += " " new_line += str(line) + "\n" return new_line |