Python Forum

Full Version: HTML Decoder pandas dataframe column
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
I am getting html that I want to decode. If I do it with an example it works but not with my pandas dataframe. Any suggestions?

 #!/usr/bin/env python
# coding: utf-8

# import statements
import requests
import pandas as pd
import html

# constants
url = "https://chartexp1.sha.maryland.gov/CHARTExportClientService/getDMSMapDataJSON.do"

# getting response
response = requests.request("GET", url).json()

# converting to dataframe
df = pd.DataFrame(response['data'])

#adding new column/converting msgHTML Encoded to decoded
df['decodedHtml'] = html.unescape(df['msgHTML'])

# saving dataframe to csv
df.to_csv('output/response_python.csv')


##TESTING ONLY##
myHtml = "<body><h1> How to use html.unescape() in Python </h1></body>"
encodedHtml = html.escape(myHtml)
print("Encoded HTML: ", encodedHtml)
decodedHtml = html.unescape(encodedHtml)

print("Decoded HTML: ", decodedHtml)

print(html.unescape('&copy; 2023'))

 
Hi,

the information provided is a bit thin... Is `response['data'] really HTML? I tried to make an API call with the URL from you post, but I receive a time-out error...

What do you get instead when exporting your dataframe to CSV?

Regards, noisefloor
Thanks for the reply. I apologize for the lack of information

The issue is with the following line:

#adding new column/converting msgHTML Encoded to decoded
df['decodedHtml'] = html.unescape(df['msgHTML'])
The issue is df['msgHTML'] has content similiar to the following

&lt;table class='dmsMsg'&gt;&lt;tr class='dmsMsgRow'&gt;&lt;td class='dmsMsgTextCenter'&gt;I-695       15 MILES&lt;/td&gt;&lt;/tr&gt;&lt;tr class='dmsMsgRow'&gt;&lt;td class='dmsMsgTextCenter'&gt;&amp;nbsp;&lt;/td&gt;&lt;/tr&gt;&lt;tr class='dmsMsgRow'&gt;&lt;td class='dmsMsgTextCenter'&gt; 14 MINUTES&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt;
What I am attempting to do is the convert that to the following format

<table class='dmsMsg'><tr class='dmsMsgRow'><td class='dmsMsgTextCenter'>I-695       15 MILES</td></tr><tr class='dmsMsgRow'><td class='dmsMsgTextCenter'>&nbsp;</td></tr><tr class='dmsMsgRow'><td class='dmsMsgTextCenter'> 14 MINUTES</td></tr></table>
html.unsescape(str) cannot be used in a vectorized solution. Have to fall back to using DataFrame.apply(func)
df["msgHTML"] = df["msgHTML"].apply(html.unescape)