URL DECODING - Printable Version +- Python Forum (https://python-forum.io) +-- Forum: Python Coding (https://python-forum.io/forum-7.html) +--- Forum: General Coding Help (https://python-forum.io/forum-8.html) +--- Thread: URL DECODING (/thread-15060.html) |
URL DECODING - UnionSystems - Jan-01-2019 This script reads a text file containing a HTML table, uses pandas to parse that table into a dataframe. Then writes that dataframe as a CSV to another text file. import pandas as pd f=open('input_table.html','r') html = f.read() df = pd.read_html(html) df[0].to_csv('output.csv',index=False,header=False)Problem is some of the cells of the table randomly contain URL encoded HTML. So for example <h1>My Heading</h1> will be %3Ch1%3EMy%20Heading%3C%2Fh1%3E Ive tried import pandas as pd import urllib f=open('input_table.html','r') html = f.read() html=urllib.unquote(html).decode('utf8') df = pd.read_html(html) df[0].to_csv('output.csv',index=False,header=False)But that results in scrambled results in the CSV. RE: URL DECODING - snippsat - Jan-02-2019 Have to be careful when downloading HTML to disk,and then try to read it back again so encoding don't get mess up. Should always use Requests,then get correct encoding back. Example: >>> import requests >>> >>> response = requests.get('https://www.contextures.com/xlSampleData01.html') >>> response.status_code 200 >>> response.encoding 'ISO-8859-1'To disk: import requests response = requests.get('https://www.contextures.com/xlSampleData01.html') html = response.text with open('html_raw.html', 'w', encoding='ISO-8859-1') as f_out: f_out.write(html)Read saved data pandas: I could of course also read the url,without saving to disk. df = pd.read_html('http://www.contextures.com/xlSampleData01.html', header=0) RE: URL DECODING - UnionSystems - Jan-02-2019 Sorry I am a learner. What I want to do is convert data inside dataframe. So that %3Ch1%3EMy%20Heading%3C%2Fh1%3E becomes <h1>My Heading</h1>
RE: URL DECODING - snippsat - Jan-02-2019 (Jan-02-2019, 01:36 AM)UnionSystems Wrote: Sorry I am a learner. What I want to do is convert data inside dataframe. So thatWhat i mean is that HTML can be messed up by our saving before get into pandas DataFrame. I can not tell how you get that html data to disk. Just decode,answers will always by in Python 3(as you also should use). >>> import urllib.parse >>> >>> url = '%3Ch1%3EMy%20Heading%3C%2Fh1%3E' >>> urllib.parse.unquote(url) '<h1>My Heading</h1>' RE: URL DECODING - UnionSystems - Jan-02-2019 I cannot change the data I am getting from the text file. So how can I urllib.parse.unquote(url)to all rows in a dataframe. I have tried df=df.apply(lambda x:urllib.unquote(x).decode('utf8'))But get error AttributeError: ("'Series' object has no attribute 'split'", u'occurred at index 0') RE: URL DECODING - UnionSystems - Jan-02-2019 Found the solution with only 2 lines of code import pandas as pd import urllib f=open('input_table.html','r') html = f.read() df = pd.read_html(html) df = df[0].applymap(str) df = df.applymap(urllib.unquote_plus) df.to_csv('output.csv',index=False,header=False) |