Python Forum
Formatting Output After Web Scraping
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Formatting Output After Web Scraping
#1
Hi (again, I'm redoing my previous post with more detail),

I am currently web scraping a site using START DATE 07/01/2019 and END DATE 07/05/2019. The URL is Lobbyist Registrations

I am web scraping the table on that site and trying to place that data in a newly created table using Python (of course). I have gotten the data but I'm faced with having to move that data into a table. I have tried to create a list or DataFrame for the data but have not gotten any success. Any help is appreciated! Smile

So far, I have successfully web scraped the table using the following code:

import PipPackages
import requests
from bs4 import BeautifulSoup, SoupStrainer
import pandas as pd

website_url = requests.get('https://www8.miamidade.gov/Apps/COB/LobbyistOnline/Views/Queries/Registration_ByPeriod_List.aspx?startdate=07%2f01%2f2019&enddate=07%2f05%2f2019')
content = website_url.text

soup = BeautifulSoup(content,'lxml')
table = soup.find('table', attrs={'id': 'ctl00_mainContentPlaceHolder_gvLobbyistRegList'})


rows_in_table = []


for row in table.findAll('tr')[1:]:
    cell = row.findAll('td')
    if len(rows_in_table) == 0: # 0 == 0
        rows_in_table = [None for _ in cell]
    elif len(cell) != len(rows_in_table): # 4 != 5
        for index, rowspan in enumerate(rows_in_table):
            if rowspan is not None:
                value = rowspan["value"]
                cell.insert(index, value) # when, index = 0 & value = kesti michael 
                if rows_in_table[index]["rows_left"] == 1:
                    rows_in_table[index] = None
                else:
                    rows_in_table[index]["rows_left"] -= 1
    print(cell[0].string)
    #new_list = []
    #names_list = cell[0].string
    #li = list(names_list.split("-"))

    for index, x in enumerate(cell):
        if x.has_attr("rowspan"):
            rowspan = {"rows_left": int(x["rowspan"]), "value": x}
            rows_in_table[index] = rowspan
Output that results from the code above is the following: (of course, because the print statement is using cell[0].string as the argument)

Output:
Adams, Chris ALBANESE, MARC BAILEY, MARIO DIAZ DE LA PORTILLA, ESQ., MIGUEL GONZALEZ, JOSE M JOHNSON, ERIN E KESTI, MICHAEL KESTI, MICHAEL KRISCHER, ALAN KUIPER, KENNETH A LASARTE, FELIX M LERMA, LISA I MANZANO, JESSE MARTINEZ DE CASTRO, ORLANDO OTAZO, JULIO O PRAGER, RICHARD S RIVET, JEFFREY J RUIZ-DIAZ DE LA PORTILLA, ELINETTE TAYLOR, MICHAEL THOMSON, CHRISTIAN WELLS, GARY T WOLF, JAMES M ZELEDON, CLARIMAR
Believe me, I have attempted to place the current output ^ into a list because I wanted to make a list to put in a dictionary like dictionary = {'Name':list, 'Date':date} but have failed as shown below: (Python code then output)

#print(cell[0].string)
new_list = []
names_list = cell[0].string
li = list(names_list.split("-"))
print(li + new_list) # attempt #1
#print(li.append(new_list)) #attempt #2
# I TRIED TO WORK WITH A DICTIONARY IN ONE ATTEMPT
#dictionary={'Names':names_list}
#df = pd.DataFrame(dictionary) #attempt #3
#print(df) # this prints 'Names' : ...names... in a long list
Based on my first attempt, I get this:

Output:
['Adams, Chris '] ['ALBANESE, MARC '] ['BAILEY, MARIO '] ['DIAZ DE LA PORTILLA, ESQ., MIGUEL '] ['GONZALEZ, JOSE M'] ['JOHNSON, ERIN E'] ['KESTI, MICHAEL '] ['KESTI, MICHAEL '] ['KRISCHER, ALAN '] ['KUIPER, KENNETH A'] ['LASARTE, FELIX M'] ['LERMA, LISA I'] ['MANZANO, JESSE '] ['MARTINEZ DE CASTRO, ORLANDO '] ['OTAZO, JULIO O'] ['PRAGER, RICHARD S'] ['RIVET, JEFFREY J'] ['RUIZ', 'DIAZ DE LA PORTILLA, ELINETTE '] ['TAYLOR, MICHAEL '] ['THOMSON, CHRISTIAN '] ['WELLS, GARY T'] ['WOLF, JAMES M'] ['ZELEDON, CLARIMAR '] ['\xa0']
Can someone help me out with how I should implement the code to get the output properly formatted to a table, such as DataFrame? Any sort or method to achieve would be great ! Thank you so much for the help ! thnxx
Reply
#2
(Jul-31-2019, 08:53 PM)yoitspython Wrote: Can someone help me out with how I should implement the code to get the output properly formatted to a table, such as DataFrame?
You can drop all this and let pandas do the job if want a table as DataFrame.
import pandas as pd
 
df = pd.read_html("https://www8.miamidade.gov/Apps/COB/LobbyistOnline/Views/Queries/Registration_ByPeriod_List.aspx?startdate=07%2f01%2f2019&enddate=07%2f05%2f2019")
df = df[3]
df
[Image: CIyEZZ.jpg]
Look at pandas library tricks.
First step is clean up, df.dtypes.
Lobbyist Name        object
Principal            object
Employed On          object
Issue Description    object
Issue Status         object
dtype: object
Eg Employed On should be date time object ect...
Reply
#3
ohh myy, whhhhaaat just happened? Thank you so much! Pandas is amazing.. I honestly am new to using pandas and didn't know it could do this ..wow..

but, How did you get that table to display so well? I used the code you provided and got the following with three dots (...):

Output:
Lobbyist Name ... Issue Status 0 Adams, Chris ... Active 1 ALBANESE, MARC ... Active 2 BAILEY, MARIO ... Active 3 DIAZ DE LA PORTILLA, ESQ., MIGUEL ... Active 4 GONZALEZ, JOSE M ... Active 5 JOHNSON, ERIN E ... Active 6 KESTI, MICHAEL ... Active 7 KESTI, MICHAEL ... Active 8 KRISCHER, ALAN ... Active 9 KUIPER, KENNETH A ... Active 10 LASARTE, FELIX M ... Active 11 LERMA, LISA I ... Active 12 MANZANO, JESSE ... Active 13 MARTINEZ DE CASTRO, ORLANDO ... Active 14 OTAZO, JULIO O ... Active 15 PRAGER, RICHARD S ... Active 16 RIVET, JEFFREY J ... Active 17 RUIZ-DIAZ DE LA PORTILLA, ELINETTE ... Active 18 TAYLOR, MICHAEL ... Active 19 THOMSON, CHRISTIAN ... Active 20 WELLS, GARY T ... Active 21 WOLF, JAMES M ... Active 22 ZELEDON, CLARIMAR ... Active 23 NaN ... NaN [24 rows x 5 columns]
I'm using IDLE on my end. Usually, if I were to display something using another lib/package it would open a blank window with the graph/table. thanks, again for the help
Reply
#4
(Aug-01-2019, 12:22 AM)yoitspython Wrote: but, How did you get that table to display so well?
Use JupyterLab NoteBook to get that display,it's easier to work with in a Notebook as get better display of table and easier to test stuff out.

Quote:I used the code you provided and got the following with three dots (...):
This can vary on Editor/IDE used(IDLE is not good in any case).
To show all look at Options and settings.
import pandas as pd

pd.options.display.max_rows = 999
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Formatting Output after Web Scrape yoitspython 2 2,471 Jul-30-2019, 08:39 PM
Last Post: yoitspython
  web scraping to csv formatting problems bluethundr 4 2,768 Jul-04-2019, 02:00 AM
Last Post: Larz60+
  web scraping help in getting the output kapilan15 1 2,328 Jan-15-2019, 04:52 PM
Last Post: Larz60+
  Problem formatting output text aj347 5 4,135 Sep-10-2017, 04:54 PM
Last Post: nilamo

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020