Pandas reading html returning NaT - Printable Version +- Python Forum (https://python-forum.io) +-- Forum: Python Coding (https://python-forum.io/forum-7.html) +--- Forum: Web Scraping & Web Development (https://python-forum.io/forum-13.html) +--- Thread: Pandas reading html returning NaT (/thread-852.html) |
Pandas reading html returning NaT - iFunKtion - Nov-09-2016 Hi there, I am trying to get Pandas to read a wikipedia page that contains a table of US state abreviations, however the actual column in the table that I want is returned as NaT. I understand that this is a pandas variation of NaN meaning that the data is unavailable. The only thing is it is available, it's right there in front of me. Is there a way to get pandas to read this column, it reads almost every other column on the table be it a string or a date, I can't work out why a 2 character string is harder to read than a regular string. The code I am using to get this is: states = pd.read_html('https://simple.wikipedia.org/wiki/List_of_U.S._states') print(states)Kind regards iFunc RE: Pandas reading html returning NaT - snippsat - Nov-09-2016 I don't think there is a magic command in pandas that will find a html table in a page which has a lot of other stuff loading. I would take out html table an then give it to pandas. Eg. If use Juypter notebook you get a good looking dataframe like this. import requests from bs4 import BeautifulSoup import pandas as pd url = 'https://simple.wikipedia.org/wiki/List_of_U.S._states' url_get = requests.get(url) soup = BeautifulSoup(url_get.content, 'html.parser') table = soup.find('div', id="mw-content-text") table = table.find('table') with open('table.html', 'w', encoding='utf-8') as f: f.write(str(table)) states = pd.read_html('table.html', header=0) print(states[0][:5])
|