Python Forum
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
pandas library tricks
#1
I have obviously never used the pandas library

Based on this answer i was astounded by the fact that pandas can just read the html with a simple single liner df = pd.read_html(str(sov_tables)) as opposed to the OP's method:
for table in sov_tables[0]:
    headers = []
    rows = table.find_all('tr')
    for header in table.find('tr').find_all('th'):
        headers.append(header.text)
    for row in rows[1:]:
        values = []
        for col in row.find_all(['th', 'td']):
            values.append(col.text)
        if values:
            cntr_dict = {headers[i]: values[i] for i in range(len(values))}
            cntr.append(cntr_dict)
It appears pandas read_html method is only for tables. Is there any other tricks pandas has to someone that is not familiar with it?
Recommended Tutorials:
Reply
#2
(Jun-27-2019, 01:00 PM)metulburr Wrote: It appears pandas read_html method is only for tables. Is there any other tricks pandas has to someone that is not familiar with it?
Pandas can to a lot so there many tricks Cool
In the other post could just have done this,to get same image as show.
import pandas as pd

df = pd.read_html("https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_population")
df = df[0]
df 
There is a tricks if need to specify a better search.
df = pd.read_html("https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_population", attrs={'class': "wikitable sortable"})
It can be an eye opener when working with many data types eg csv,json,html table,sql,excel ..ect Pandas read a lot IO Tools
how much easier it can be to do stuff with data and display it if using Jupyter Notebook.
Reply
#3
Quote:
import pandas as pd
 
df = pd.read_html("https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_population")
df = df[0]
df
This is even more astonishing. How does pandas know that you are looking for the table as opposed to something else? I could assume the attrs would define if there were multiple tables. But without it why would this not pick up other data besides tables? In this link's case, the header, beginning text, footer, wiki side bar, images, etc. And as well as if it only looks for tables seems odd they didnt name the method read_table instead. As read_html seems to indicate all html.

I just cant believe how many years i parsed sites looking for tables when i could of just used panda to do it. Blush
Recommended Tutorials:
Reply
#4
https://pandas.pydata.org/pandas-docs/st...-read-html

Looks like it just returns all tables in the html. So for Wikipedia, it just happens that the first table is the one you're interested in.

pandas and numpy are things I keep wanting to look into. They seem super powerful, but all the examples I've seen look basically like magic (like the scipy array that calculates the next step in Conway's Game of Life using whatever convolve2d is).
Reply
#5
(Jun-27-2019, 03:45 PM)metulburr Wrote: How does pandas know that you are looking for the table as opposed to something else?
pandas.read_html look at right there is [Source] link.
Tool you recognize Wink
def _importers():
    # import things we need
    # but make this done on a first use basis

    global _IMPORTS
    if _IMPORTS:
        return

    global _HAS_BS4, _HAS_LXML, _HAS_HTML5LIB

    try:
        import bs4  # noqa
        _HAS_BS4 = True
    except ImportError:
        pass

    try:
        import lxml  # noqa
        _HAS_LXML = True
    except ImportError:
        pass

    try:
        import html5lib  # noqa
        _HAS_HTML5LIB = True
    except ImportError:
        pass
If look further at code see a lot of searching/parsing and build of table.
Reply
#6
Gone take a look at of making a html table.
Start with a dictionary,could have been other Python data structure like list,tuple.
country_data = {
    "country": ["Brazil", "Russia", "India", "China", "South Africa"],
    "capital": ["Brasilia", "Moscow", "New Dehli", "Beijing", "Pretoria"],
    "area": [8.516, 17.10, 3.286, 9.597, 1.221],
    "population": [200.4, 143.5, 1252, 1357, 52.98],
}
Then in very few step this dictionary to a html table,and a little trick to use Bootstrap so table look nicer.
[Image: ObBD1s.jpg]
Or black.
[Image: Fablzi.jpg]

Here is the little code for this NoteBook.
If want to make shared Notebook like this,this is the steps.
From Notebook dict_to_table.ipynb save/download,now raw source code of .ipynb.
Copy that into GitHub Gist --> Create public gist.
Now link that get created over,copy into nbviewer --> Go.

Bootstrap tricks Cool
Look at this Pen,see that i have changed class="dataframe" to class="table table-striped".
Then enable Bootstrap in CodePen.
Also put into to container so table do not take whole screen.
<div class="container">
  <!-- Content here -->
</div>

In other post i link to IO Tools ,see also that there is writer to.
So that what i used here to_html,so when get any data into DataFrame(Pandas heart),there also a lot option to get it out in different formats.
Reply
#7
I learned a lot about pandas watching this series on Youtube.
Kevin Markham Pandas Tutorial
Reply
#8
More ways to find table.
Timeline of programming languages
import pandas as pd

wiki_timeline = pd.read_html('https://en.wikipedia.org/wiki/Timeline_of_programming_languages', )
len(wiki_timeline) # 13
When use read_html it will find all table(13) on site.
On this site can not use attrs= as all table has same CSS class name.

If look at parameters can do Shift+Tab over read_html.
The see that there is a match='.+' parameter.
This do a text search on site so can use it match a text inside table we want eg match='Guido Van Rossum'
import pandas as pd

wiki_timeline = pd.read_html('https://en.wikipedia.org/wiki/Timeline_of_programming_languages', match='Guido Van Rossum')
len(wiki_timeline) # 1
wiki_timeline[0].tail()
[Image: 1p48av.jpg]

Basic clean up fix,on this site Excel Sample Data has table no <thead> so we get 0 1 2 ect as header row.
We can pass row number we want to use as header.
import pandas as pd

df = pd.read_html('http://www.contextures.com/xlSampleData01.html', header=0)
df[0].head()
[Image: TAlg6M.jpg]

If new Pandas so is pandas.DataFrame central for all that get read in.
So if eg read in csv it will be in DataFrame same as table over.
Output:
Users,date,count daily_users,19-03-2017,219 daily_users,19-03-2018,5040 daily_users,19-03-2019,13579 weekly_users,19-03-2017,1767 weekly_users,19-03-2018,26664 weekly_users,19-03-2019,72166
import pandas as pd

forum = pd.read_csv("forum.csv")
forum['date'] = pd.to_datetime(forum['date'])
forum
[Image: emFbyk.jpg]
There are many ways to plot here eg new library Altair.
Yes it user growth on this forum last two years Wink
import pandas as pd
import altair as alt

forum = pd.read_csv("forum.csv")
forum['date'] = pd.to_datetime(forum['date'])

alt.Chart(forum).mark_line().encode(
    x='date',
    y='count',
    color='Users'
)    
[Image: hL6MsW.jpg]

Rich Output in this demo notebook different ways to display code in different formats.
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  [Selenium] Any Tricks To Block Junk Scripts From Loading? digitalmatic7 0 2,262 Feb-07-2018, 08:50 PM
Last Post: digitalmatic7

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020