Python Forum

Full Version: how to edit data frames and convert to a list(pandas, read_html()) ?
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
I used library pandas, read_html() to import a table from a webpage.
I want to insert values from table read_html in ms msl table
but for this I must edit table read_html and convert to list.
This is difficult to do because .read_html() produces a list of dataframes.


my python code:
import requests
import pandas as pd
r = requests.get('URL')
pd.set_option('max_rows',10000) 
df = pd.read_html(r.content)
print(df)
result print(df) - dataframes:
Output:
[ 0 1 2 3 0 Number Name Plan NaN 1 NaN NaN not(selected) NaN 2 53494580 + (53)494580 NP_551 NaN 3 53494581 + (53)494581 NP_551 NaN 4 53494582 + (53)494582 NP_551 NaN 5 55110000 + (53)494583 NP_551 NaN]
I would like the following results to be written to the ms msl table:
Output:
[['1','NaN','NaN','not(selected)','NaN'], ['2','53494580','+ (53)494580','NP_551','NaN'], ['3','53494581','+ (53)494581','NP_551','NaN'], ['4','53494582','+ (53)494582','NP_551','NaN'], ['5','55110000','+ (53)494583','NP_551','NaN]']
how to edit data frames and convert to a list?
I would be grateful for any help.
we can use regex by importing re module
then find the string pattern
result = [ 0 1 2 3
0 Number Name Plan NaN
1 NaN NaN not(selected) NaN
2 53494580 + (53)494580 NP_551 NaN
3 53494581 + (53)494581 NP_551 NaN
4 53494582 + (53)494582 NP_551 NaN
5 55110000 + (53)494583 NP_551 NaN]

#make sure result is in string form

import re
convertedlist = re.compile('(\d) (\d*) (\+ \(\d*\)\d*) (NP_551) (NaN)').findall(result)
Output:
[('2', '53494580', '+ (53)494580', 'NP_551', 'NaN'), ('3', '53494581', '+ (53)494581', 'NP_551', 'NaN'), ('4', '53494582', '+ (53)494582', 'NP_551', 'NaN'), ('5', '55110000', '+ (53)494583', 'NP_551', 'NaN')]
convertedlist is in nested tuples instead of nested list. note that only last 4 items listed , the ['1','NaN','NaN','not(selected)','NaN'] left out due to unique patterns.
if you want list of lists:
try
my_list = df.values.to_list()
or use BeautifulSoup to parse the site, not pandas.

@ka06059 - please, don't post regex solution for everything. in this case OP is using pandas and the package offers enough tools of its own
Can skiprows and set name header.
Then use values.tolist() and insert header names.
Example:
>>> import pandas as pd
... from io import StringIO
...
... data = """\
... 1, 2, 3
... 3, 4, 6
... 6, 7, 8
... 9, 10, 11"""
...
... df = pd.read_csv(StringIO(data), skiprows=[0,1], names = ["foo", "bar", "spam"])

>>> df
   foo  bar  spam
0    6    7     8
1    9   10    11

>>> lst = df.values.tolist()
>>> first = list(df.columns)
>>> lst.insert(0, first)
>>> lst
[['foo', 'bar', 'spam'], [6, 7, 8], [9, 10, 11]]