Posts: 5,151
Threads: 396
Joined: Sep 2016
I have obviously never used the pandas library
Based on this answer i was astounded by the fact that pandas can just read the html with a simple single liner df = pd.read_html(str(sov_tables)) as opposed to the OP's method:
1 2 3 4 5 6 7 8 9 10 11 12 |
for table in sov_tables[ 0 ]:
headers = []
rows = table.find_all( 'tr' )
for header in table.find( 'tr' ).find_all( 'th' ):
headers.append(header.text)
for row in rows[ 1 :]:
values = []
for col in row.find_all([ 'th' , 'td' ]):
values.append(col.text)
if values:
cntr_dict = {headers[i]: values[i] for i in range ( len (values))}
cntr.append(cntr_dict)
|
It appears pandas read_html method is only for tables. Is there any other tricks pandas has to someone that is not familiar with it?
Recommended Tutorials:
Posts: 7,324
Threads: 123
Joined: Sep 2016
(Jun-27-2019, 01:00 PM)metulburr Wrote: It appears pandas read_html method is only for tables. Is there any other tricks pandas has to someone that is not familiar with it? Pandas can to a lot so there many tricks
In the other post could just have done this,to get same image as show.
1 2 3 4 5 |
import pandas as pd
df = df[ 0 ]
df
|
There is a tricks if need to specify a better search.
It can be an eye opener when working with many data types eg csv,json,html table,sql,excel ..ect Pandas read a lot IO Tools
how much easier it can be to do stuff with data and display it if using Jupyter Notebook.
Posts: 5,151
Threads: 396
Joined: Sep 2016
Jun-27-2019, 03:45 PM
(This post was last modified: Jun-27-2019, 03:45 PM by metulburr.)
Quote:
1 2 3 4 5 |
import pandas as pd
df = df[ 0 ]
df
|
This is even more astonishing. How does pandas know that you are looking for the table as opposed to something else? I could assume the attrs would define if there were multiple tables. But without it why would this not pick up other data besides tables? In this link's case, the header, beginning text, footer, wiki side bar, images, etc. And as well as if it only looks for tables seems odd they didnt name the method read_table instead. As read_html seems to indicate all html.
I just cant believe how many years i parsed sites looking for tables when i could of just used panda to do it.
Recommended Tutorials:
Posts: 3,458
Threads: 101
Joined: Sep 2016
https://pandas.pydata.org/pandas-docs/st...-read-html
Looks like it just returns all tables in the html. So for Wikipedia, it just happens that the first table is the one you're interested in.
pandas and numpy are things I keep wanting to look into. They seem super powerful, but all the examples I've seen look basically like magic (like the scipy array that calculates the next step in Conway's Game of Life using whatever convolve2d is).
Posts: 7,324
Threads: 123
Joined: Sep 2016
Jun-27-2019, 05:15 PM
(This post was last modified: Jun-27-2019, 05:15 PM by snippsat.)
(Jun-27-2019, 03:45 PM)metulburr Wrote: How does pandas know that you are looking for the table as opposed to something else? pandas.read_html look at right there is [Source] link.
Tool you recognize
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 |
def _importers():
global _IMPORTS
if _IMPORTS:
return
global _HAS_BS4, _HAS_LXML, _HAS_HTML5LIB
try :
import bs4
_HAS_BS4 = True
except ImportError:
pass
try :
import lxml
_HAS_LXML = True
except ImportError:
pass
try :
import html5lib
_HAS_HTML5LIB = True
except ImportError:
pass
|
If look further at code see a lot of searching/parsing and build of table.
Posts: 7,324
Threads: 123
Joined: Sep 2016
Jun-28-2019, 12:39 PM
(This post was last modified: Jun-28-2019, 12:39 PM by snippsat.)
Gone take a look at of making a html table.
Start with a dictionary,could have been other Python data structure like list,tuple.
1 2 3 4 5 6 |
country_data = {
"country" : [ "Brazil" , "Russia" , "India" , "China" , "South Africa" ],
"capital" : [ "Brasilia" , "Moscow" , "New Dehli" , "Beijing" , "Pretoria" ],
"area" : [ 8.516 , 17.10 , 3.286 , 9.597 , 1.221 ],
"population" : [ 200.4 , 143.5 , 1252 , 1357 , 52.98 ],
}
|
Then in very few step this dictionary to a html table,and a little trick to use Bootstrap so table look nicer.
![[Image: ObBD1s.jpg]](https://imagizer.imageshack.com/v2/xq90/921/ObBD1s.jpg)
Or black.
Here is the little code for this NoteBook.
If want to make shared Notebook like this,this is the steps.
From Notebook dict_to_table.ipynb save/download,now raw source code of .ipynb .
Copy that into GitHub Gist --> Create public gist .
Now link that get created over,copy into nbviewer --> Go .
Bootstrap tricks
Look at this Pen,see that i have changed class="dataframe" to class="table table-striped" .
Then enable Bootstrap in CodePen.
Also put into to container so table do not take whole screen.
1 2 3 |
<div class = "container" >
<! - - Content here - - >
< / div>
|
In other post i link to IO Tools ,see also that there is writer to .
So that what i used here to_html ,so when get any data into DataFrame(Pandas heart),there also a lot option to get it out in different formats.
Posts: 360
Threads: 5
Joined: Jun 2019
I learned a lot about pandas watching this series on Youtube.
Kevin Markham Pandas Tutorial
Posts: 7,324
Threads: 123
Joined: Sep 2016
Jul-03-2019, 10:16 PM
(This post was last modified: Jul-03-2019, 10:16 PM by snippsat.)
More ways to find table.
Timeline of programming languages
1 2 3 4 |
import pandas as pd
len (wiki_timeline)
|
When use read_html it will find all table(13) on site.
On this site can not use attrs= as all table has same CSS class name.
If look at parameters can do Shift+Tab over read_html .
The see that there is a match='.+' parameter.
This do a text search on site so can use it match a text inside table we want eg match='Guido Van Rossum'
1 2 3 4 5 |
import pandas as pd
len (wiki_timeline)
wiki_timeline[ 0 ].tail()
|
Basic clean up fix,on this site Excel Sample Data has table no <thead> so we get 0 1 2 ect as header row.
We can pass row number we want to use as header.
1 2 3 4 |
import pandas as pd
df[ 0 ].head()
|
If new Pandas so is pandas.DataFrame central for all that get read in.
So if eg read in csv it will be in DataFrame same as table over.
Output: Users,date,count
daily_users,19-03-2017,219
daily_users,19-03-2018,5040
daily_users,19-03-2019,13579
weekly_users,19-03-2017,1767
weekly_users,19-03-2018,26664
weekly_users,19-03-2019,72166
1 2 3 4 5 |
import pandas as pd
forum = pd.read_csv( "forum.csv" )
forum[ 'date' ] = pd.to_datetime(forum[ 'date' ])
forum
|
![[Image: emFbyk.jpg]](https://imagizer.imageshack.com/v2/xq90/922/emFbyk.jpg)
There are many ways to plot here eg new library Altair.
Yes it user growth on this forum last two years
1 2 3 4 5 6 7 8 9 10 11 |
import pandas as pd
import altair as alt
forum = pd.read_csv( "forum.csv" )
forum[ 'date' ] = pd.to_datetime(forum[ 'date' ])
alt.Chart(forum).mark_line().encode(
x = 'date' ,
y = 'count' ,
color = 'Users'
)
|
Rich Output in this demo notebook different ways to display code in different formats.
|