Python Forum
web scraping to csv formatting problems
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
web scraping to csv formatting problems
#1
Hello,

I am trying to scrape a web page and send the result to CSV. I am able to get the content I want in the CSV. However, the content is being repeated down the page and unique info is sent across the page, instead of down the page under the headers.

This is the result I'm getting: CSV Output

The CSV should list the accounts one per line, going down and not across as in this example. This is the original wiki page that I'm scraping (had to block out company info): Original Wiki Page

This is the code I am using:
import csv
import os
import requests
from requests import get
from requests.exceptions import RequestException
from contextlib import closing
from bs4 import BeautifulSoup

output_dir = os.path.join( '..', 'output_files', 'aws_accounts_list')
source = 'aws_wiki_page'
destination = os.path.join(output_dir, source + '.csv' )
url = 'https://wiki.us.cworld.company.com/display/6TO/AWS+Accounts'
page = requests.get(url, auth=('me', 'secret'))

headers = ['Company Account Name', 'AWS Account Name', 'Description', 'LOB', 'AWS Account Number', 'Connected to Homebase', 'Peninsula or Island', 'URL', 'Owner', 'Engagement Code', 'CloudOps Access Type']

soup = BeautifulSoup(page.text, 'lxml')

rows = []
for tr in soup.select('tr'):
    rows.append([td.text for td in soup.select('td')])

with open(destination, 'w+', newline='') as csvfile:
    writer = csv.writer(csvfile, delimiter=',',
                            quotechar='"', quoting=csv.QUOTE_MINIMAL)
    writer.writerow(headers)
    for row in rows:
        writer.writerow(row)
        print(row)
What am I doing wrong?
Reply
#2
The URL posted doesn't appear to be valid
Reply
#3
I have no understanding about specifics of this task. It seems to me that html table is needed to scrape and in this case I would skip the low level coding and let pandas handle that. Something along those lines:

>>> import pandas as pd
>>> df = pd.read_html("https://en.wikipedia.org/wiki/Comparison_of_programming_languages")
>>> df[1].to_csv('comparison_table.csv')
This code grabs second table (135 rows and 11 columns) on webpage https://en.wikipedia.org/wiki/Comparison..._languages and writes it to comparison_table.csv in present working directory.

Output:
,Language,Intended use,Imperative,Object-oriented,Functional,Procedural,Generic,Reflective,Event-driven,Other paradigm(s),Standardized? 0,1C:Enterprise,"Application, RAD, business, general, web, mobile",Yes,,Yes,Yes,Yes,Yes,Yes,"Object-based, Prototype-based programming",No 1,ActionScript 3.0,"Application, client-side, web",Yes,Yes,Yes,,,,Yes,,"1996, ECMA" 2,Ada,"Application, embedded, realtime, system",Yes,Yes[2],,Yes[3],Yes[4],,,"concurrent,[5] distributed,[6]","1983, 2005, 2012, ANSI, ISO, GOST 27831-88[7]" 3,Aldor,"Highly domain-specific, symbolic computing",Yes,Yes,Yes,,,,,,No 4,ALGOL 58,Application,Yes,,,,,,,,No /.../
For further information you can look at forum thread pandas library tricks
I'm not 'in'-sane. Indeed, I am so far 'out' of sane that you appear a tiny blip on the distant coast of sanity. Bucky Katt, Get Fuzzy

Da Bishop: There's a dead bishop on the landing. I don't know who keeps bringing them in here. ....but society is to blame.
Reply
#4
(Jul-03-2019, 02:32 AM)Larz60+ Wrote: The URL posted doesn't appear to be valid

Yes, I am aware. Perhaps I should have clarified in the OP. This is a company URL, and I am not allow to post specifics of company information in a public forum. Also the real link, even if I could post it, would not work off of our network.
Reply
#5
It's mighty difficult to give advise without looking at the page.
usual layout for a table is to have multiple tr's and multiple td's within each tr.
Here's an example of this on a simple page with only one table:
        table = soup.find('table', {'summary': 'This table displays Connecticut towns and the year of their establishment.'})
        trs = table.tbody.find_all('tr')

        for n, tr in enumerate(trs):
            for n1, td in enumerate(self.get_td(tr)):
                print(f'==================================== tr {n}, td: {n1} ====================================')
                print(f'{self.pp.prettify(td, 2)}')
This will give you a layout of the page and make it easier to determine how to proceed.
the prettify method is in module PrettifyPage.py which is a modified version of BeautfulSoup's prettify which allows changing indent size

from bs4 import BeautifulSoup
import requests
import pathlib


class PrettifyPage:
    def __init__(self):
        pass
        
    def prettify(self, soup, indent):
        pretty_soup = str()
        previous_indent = 0
        for line in soup.prettify().split("\n"):
            current_indent = str(line).find("<")
            if current_indent == -1 or current_indent > previous_indent + 2:
                current_indent = previous_indent + 1
            previous_indent = current_indent
            pretty_soup += self.write_new_line(line, current_indent, indent)
        return pretty_soup

    def write_new_line(self, line, current_indent, desired_indent):
        new_line = ""
        spaces_to_add = (current_indent * desired_indent) - current_indent
        if spaces_to_add > 0:
            for i in range(spaces_to_add):
                new_line += " "		
        new_line += str(line) + "\n"
        return new_line
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Scraping problems with Python requests. gtlhbkkj 1 1,830 Jan-22-2020, 11:00 AM
Last Post: gtlhbkkj
  Scraping problems. Pls help with a correct request query. gtlhbkkj 0 1,484 Oct-09-2019, 12:00 PM
Last Post: gtlhbkkj
  Scraping problems. Pls help with a correct request query. gtlhbkkj 6 3,010 Oct-01-2019, 09:22 PM
Last Post: gtlhbkkj
  Formatting Output After Web Scraping yoitspython 3 2,852 Aug-01-2019, 01:22 PM
Last Post: snippsat

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020