Posts: 89
Threads: 26
Joined: Mar 2020
Jan-09-2022, 05:47 AM
Scraping Columns with Pandas (Column Entries w/ more than 1 word writes two columns)
My code is the following:
1 2 3 4 5 6 7 8 9 10 |
import pandas as pd
tables = pd.read_html(url)
tables[ 1 ]
print (tables[ 1 ])
df = tables[ 1 ]
df.to_csv( 'AL_Counties.csv' , sep = '\t' , encoding = 'utf-8' , index = False )
|
Using Pandas,
Once the file is written; which works great. Any entries in any column; for instance as shown in the following screen shots in Libre Calc is writing the names into multiple columns. This makes my obstacles difficult. How do I remedy this?
Screenshot #1:
Screenshot #2:
Thank you everyone for this forum!
Posts: 7,319
Threads: 123
Joined: Sep 2016
(Jan-09-2022, 05:47 AM)BrandonKastning Wrote: for instance as shown in the following screen shots in Libre Calc Don't take the detour trough df.to_csv.
Pandas can save to .odf format,need odfpy and openpyxl installed.
Then you get same look as DataFrame show.
1 2 3 |
pip install odfpy openpyxl
conda install odfpy openpyxl
|
Can also delete Map column.
1 2 3 4 5 6 7 8 9 10 |
import pandas as pd
tables = pd.read_html(url)
df = tables[ 1 ]
df = df.drop( 'Map' , axis = 1 )
df.to_excel( "wiki.ods" , index = False , engine = "odf" )
df.to_excel( 'wiki.xlsx' , index = False )
|
BrandonKastning likes this post
Posts: 89
Threads: 26
Joined: Mar 2020
Jan-09-2022, 09:35 PM
snippsat,
Thank you! This worked wonderfully! Much better formatting!
Much obliged! :)
Best Regards,
Brandon Kastning
(Jan-09-2022, 01:54 PM)snippsat Wrote: (Jan-09-2022, 05:47 AM)BrandonKastning Wrote: for instance as shown in the following screen shots in Libre Calc Don't take the detour trough df.to_csv.
Pandas can save to .odf format,need odfpy and openpyxl installed.
Then you get same look as DataFrame show.
1 2 3 |
pip install odfpy openpyxl
conda install odfpy openpyxl
|
Can also delete Map column.
1 2 3 4 5 6 7 8 9 10 |
import pandas as pd
tables = pd.read_html(url)
df = tables[ 1 ]
df = df.drop( 'Map' , axis = 1 )
df.to_excel( "wiki.ods" , index = False , engine = "odf" )
df.to_excel( 'wiki.xlsx' , index = False )
|
Posts: 89
Threads: 26
Joined: Mar 2020
Should I open a new thread?
How to download the map images and store them (Either DB or Local) ?
Posts: 7,319
Threads: 123
Joined: Sep 2016
(Jan-09-2022, 09:52 PM)BrandonKastning Wrote: Should I open a new thread? It's part of same task,so no problem.
(Jan-09-2022, 09:52 PM)BrandonKastning Wrote: How to download the map images and store them (Either DB or Local) ? You have to give it try,now need to use more common scraping tool.
Here a demo how to start.
1 2 3 4 5 6 7 8 9 10 11 12 13 |
import requests
from bs4 import BeautifulSoup
response = requests.get(url)
soup = BeautifulSoup(response.content, 'lxml' )
print (soup.find( 'h1' ).text)
img = soup.find_all( 'a' , class_ = "image" )
img_link = img[ 0 ].find( 'img' ).get( 'src' )
img_link = img_link.replace( '//' , 'http://' ) print (img_link)
|
Output: List of counties in Alabama
http://upload.wikimedia.org/wikipedia/commons/thumb/5/54/Map_of_Alabama_highlighting_Autauga_County.svg/75px-Map_of_Alabama_highlighting_Autauga_County.svg.png
BrandonKastning likes this post
Posts: 89
Threads: 26
Joined: Mar 2020
Jan-13-2022, 04:56 AM
(This post was last modified: Jan-13-2022, 04:56 AM by BrandonKastning.
Edit Reason: tag error
)
snippsat,
Thank you for this! This is a great point for me to start regarding images and I believe since you wrote the comment "First Image" that I will need to learn loops. I will be coming back to this.
In the meantime, I ran into some troubles with wikipedia panda table scraping. I changed from Counties to "Municipalities" and regardless of the tables[0], tables[1], tables[2] result in all the wrong data displayed on the wikipedia article.
Code is as follows:
1 2 3 4 5 6 7 |
import pandas as pd
tables = pd.read_html(url)
df = tables[ 1 ]
df.to_excel( "AL_Alabama_Municipalities.ods" , index = False , engine = "odf" )
|
Thank you again for this forum! How do I determine the tables[#]? Is it a guessing game or is is there an attribute or property within the browser code that could aid me in finding the correct tables[#]?
Best Regards,
Brandon Kastning
(Jan-10-2022, 06:52 PM)snippsat Wrote: (Jan-09-2022, 09:52 PM)BrandonKastning Wrote: Should I open a new thread? It's part of same task,so no problem.
(Jan-09-2022, 09:52 PM)BrandonKastning Wrote: How to download the map images and store them (Either DB or Local) ? You have to give it try,now need to use more common scraping tool.
Here a demo how to start.
1 2 3 4 5 6 7 8 9 10 11 12 13 |
import requests
from bs4 import BeautifulSoup
response = requests.get(url)
soup = BeautifulSoup(response.content, 'lxml' )
print (soup.find( 'h1' ).text)
img = soup.find_all( 'a' , class_ = "image" )
img_link = img[ 0 ].find( 'img' ).get( 'src' )
img_link = img_link.replace( '//' , 'http://' ) print (img_link)
|
Output: List of counties in Alabama
http://upload.wikimedia.org/wikipedia/commons/thumb/5/54/Map_of_Alabama_highlighting_Autauga_County.svg/75px-Map_of_Alabama_highlighting_Autauga_County.svg.png
Posts: 7,319
Threads: 123
Joined: Sep 2016
Jan-13-2022, 02:46 PM
(This post was last modified: Jan-13-2022, 02:46 PM by snippsat.)
(Jan-13-2022, 04:56 AM)BrandonKastning Wrote: Thank you again for this forum! How do I determine the tables[#]? Is it a guessing game or is is there an attribute or property within the browser code that could aid me in finding the correct tables[#]? A web site can have many tables,so have to look at site(count) or test out like tables[0], tables[1],tables[6].... and see if get wanted result.
There is match in pandas.read_html that can use string or regex to match something i table wanted.
Example Timeline of programming languages ,let say we want Python table we can match name Guido van Rossum.
1 2 3 4 |
import pandas as pd
df[ 0 ].head( 13 )
|
Output: Year Name Chief developer, company Predecessor(s)
0 1990 Sather Steve Omohundro Eiffel
1 1990 AMOS BASIC François Lionet and Constantin Sotiropoulos STOS BASIC
2 1990 AMPL Robert Fourer, David Gay and Brian Kernighan a... NaN
3 1990 Object Oberon H Mössenböck, J Templ, R Griesemer Oberon
4 1990 J Kenneth E. Iverson, Roger Hui at Iverson Software APL, FP
5 1990 Haskell NaN Miranda
6 1990 EuLisp NaN Common Lisp, Scheme
7 1990 Z Shell (zsh) Paul Falstad at Princeton University ksh
8 1991 GNU E David J. DeWitt, Michael J. Carey C++
9 1991 Oberon-2 Hanspeter Mössenböck, Wirth Object Oberon
10 1991 Oz Gert Smolka and his students Prolog
11 1991 Q Albert Gräf NaN
12 1991 Python Guido van Rossum ABC, C
So if a match it will always be df[0] .
Without match it would be table 9:
BrandonKastning likes this post
Posts: 89
Threads: 26
Joined: Mar 2020
Jan-13-2022, 10:52 PM
(This post was last modified: Jan-13-2022, 10:52 PM by BrandonKastning.
Edit Reason: forgot code + gratitude
)
snippsat,
Thank you for this new knowledge and sharing this code. Wonderful example!
Without applying your teaching to my code (yet); I found a work around that managed to pull the data in.
I disabled:
Flag/Parameter.
Then I had to do the following using Libre Calc
Once disabled; Python picked up the correct table with tables[0] (Strange; I am still unsure what index=False truly does at this time).
The output against the .ods it generated an additional row of header names. One that is different from another.
I tried removing the duplicate row w/ 3 column headings that were under a single header name. Libre Calc gave me an error; so I decided to try copy and paste and the following worked great for a CSV save in 7 steps.
Default Settings on Save As Dialogs were used and worked fine!
then I use step2.py as the Payload after Manually Naming the Column Headers in Libre Calc!
step1.py code (to generate the .ods):
1 2 3 4 5 6 7 8 9 |
import pandas as pd
tables = pd.read_html(url)
df = tables[ 0 ]
df.to_excel( "AL_Alabama_Cities.ods" , engine = "odf" )
|
step2.py code:
1 2 3 4 5 6 7 8 9 |
import pandas as pd
import mysql.connector
from sqlalchemy import create_engine
myd = pd.read_csv( 'AL_Alabama_Cities.CSV.csv' )
myd.to_sql(name = 'AL_Cities_CSV' , con = engine, if_exists = 'replace' , index = False )
|
Thank you again snippsat and everyone for this forum time/expertise!
Best Regards,
Brandon Kastning
(Jan-13-2022, 02:46 PM)snippsat Wrote: (Jan-13-2022, 04:56 AM)BrandonKastning Wrote: Thank you again for this forum! How do I determine the tables[#]? Is it a guessing game or is is there an attribute or property within the browser code that could aid me in finding the correct tables[#]? A web site can have many tables,so have to look at site(count) or test out like tables[0], tables[1],tables[6].... and see if get wanted result.
There is match in pandas.read_html that can use string or regex to match something i table wanted.
Example Timeline of programming languages ,let say we want Python table we can match name Guido van Rossum.
1 2 3 4 |
import pandas as pd
df[ 0 ].head( 13 )
|
Output: Year Name Chief developer, company Predecessor(s)
0 1990 Sather Steve Omohundro Eiffel
1 1990 AMOS BASIC François Lionet and Constantin Sotiropoulos STOS BASIC
2 1990 AMPL Robert Fourer, David Gay and Brian Kernighan a... NaN
3 1990 Object Oberon H Mössenböck, J Templ, R Griesemer Oberon
4 1990 J Kenneth E. Iverson, Roger Hui at Iverson Software APL, FP
5 1990 Haskell NaN Miranda
6 1990 EuLisp NaN Common Lisp, Scheme
7 1990 Z Shell (zsh) Paul Falstad at Princeton University ksh
8 1991 GNU E David J. DeWitt, Michael J. Carey C++
9 1991 Oberon-2 Hanspeter Mössenböck, Wirth Object Oberon
10 1991 Oz Gert Smolka and his students Prolog
11 1991 Q Albert Gräf NaN
12 1991 Python Guido van Rossum ABC, C
So if a match it will always be df[0] .
Without match it would be table 9:
|