Python Forum
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
URL DECODING
#1
This script reads a text file containing a HTML table, uses pandas to parse that table into a dataframe. Then writes that dataframe as a CSV to another text file.

import pandas as pd

f=open('input_table.html','r')
html = f.read()
df = pd.read_html(html)
df[0].to_csv('output.csv',index=False,header=False)
Problem is some of the cells of the table randomly contain URL encoded HTML. So for example <h1>My Heading</h1> will be %3Ch1%3EMy%20Heading%3C%2Fh1%3E

Ive tried

import pandas as pd
import urllib

f=open('input_table.html','r')
html = f.read()
html=urllib.unquote(html).decode('utf8')
df = pd.read_html(html)
df[0].to_csv('output.csv',index=False,header=False)
But that results in scrambled results in the CSV.
Reply
#2
Have to be careful when downloading HTML to disk,and then try to read it back again so encoding don't get mess up.
Should always use Requests,then get correct encoding back.
Example:
>>> import requests
>>> 
>>> response = requests.get('https://www.contextures.com/xlSampleData01.html')
>>> response.status_code
200
>>> response.encoding
'ISO-8859-1'
To disk:
import requests

response = requests.get('https://www.contextures.com/xlSampleData01.html')
html = response.text
with open('html_raw.html', 'w', encoding='ISO-8859-1') as f_out:
    f_out.write(html)
Read saved data pandas:
[Image: rZ7nNw.jpg]
I could of course also read the url,without saving to disk.
df = pd.read_html('http://www.contextures.com/xlSampleData01.html', header=0)
Reply
#3
Sorry I am a learner. What I want to do is convert data inside dataframe. So that

%3Ch1%3EMy%20Heading%3C%2Fh1%3E

becomes


<h1>My Heading</h1>
Reply
#4
(Jan-02-2019, 01:36 AM)UnionSystems Wrote: Sorry I am a learner. What I want to do is convert data inside dataframe. So that
What i mean is that HTML can be messed up by our saving before get into pandas DataFrame.
I can not tell how you get that html data to disk.
Just decode,answers will always by in Python 3(as you also should use).
>>> import urllib.parse
>>> 
>>> url = '%3Ch1%3EMy%20Heading%3C%2Fh1%3E'
>>> urllib.parse.unquote(url)
'<h1>My Heading</h1>'
Reply
#5
I cannot change the data I am getting from the text file.

So how can I

urllib.parse.unquote(url)
to all rows in a dataframe.

I have tried

df=df.apply(lambda x:urllib.unquote(x).decode('utf8'))
But get error


AttributeError: ("'Series' object has no attribute 'split'", u'occurred at index 0')
Reply
#6
Found the solution Big Grin with only 2 lines of code

import pandas as pd
import urllib
 
f=open('input_table.html','r')
html = f.read()
df = pd.read_html(html)
df = df[0].applymap(str)
df = df.applymap(urllib.unquote_plus)
df.to_csv('output.csv',index=False,header=False)
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  flask app decoding problem mesbah 0 1,333 Aug-01-2021, 08:32 PM
Last Post: mesbah
  Decoding a serial stream AKGentile1963 7 4,576 Mar-20-2021, 08:07 PM
Last Post: deanhystad
  xml decoding failure(bs4) roughstroke 1 1,427 May-09-2020, 04:37 PM
Last Post: snippsat
  python3 decoding problem but python2 OK mesbah 0 1,298 Nov-30-2019, 04:42 PM
Last Post: mesbah
  utf-8 decoding failed every time i try adnanahsan 21 7,208 Aug-27-2019, 04:25 PM
Last Post: adnanahsan
  hex decoding in Python 3 rdirksen 2 3,727 May-12-2019, 11:49 AM
Last Post: rdirksen
  Decoding log files in binary using an XML file. captainfantastic 1 1,757 Apr-04-2019, 02:24 AM
Last Post: captainfantastic
  decoding sub.process output with multiple \n? searching1 2 2,054 Feb-24-2019, 12:00 AM
Last Post: searching1
  Decoding Hex / understanding Code NoWay 4 3,142 Mar-20-2018, 02:48 PM
Last Post: NoWay
  base64 decoding issue or bug rbrahmaa 2 11,450 Apr-25-2017, 11:56 PM
Last Post: wavic

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020