Data Cleanup Help Needed

Sonali · (This post was last modified: Dec-12-2019, 08:11 PM by Sonali.)

Hello,

I am pulling CRM entity in our datalake using APIs. Most columns from this particular entity are bringing in the correct information but the column 'description' is actually brining in the complete css/html code. I need to clean up this column to show only the text.

I used

dfObj = pd.DataFrame([sub.split("|") for sub in activity])

to create the pipe delimited file and then pulled in this specific field using

dfObj.iloc[0,15]

Here is the example of what is coming in the description field:

'<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:w="urn:schemas-microsoft-com:office:word" xmlns:m="http://schemas.microsoft.com/office/2004/12/omml" xmlns="http://www.w3.org/TR/REC-html40">\r\n<head>\r\n<meta http-equiv="Content-Type" content="text/html; charset=utf-8">\r\n<meta name="Generator" content="Microsoft Word 15 (filtered medium)">\r\n<style></style>\r\n</head>\r\n<body lang="EN-US" link="#0563C1" vlink="#954F72">\r\n<div class="WordSection1">\r\nGarreth,<o:p></o:p>\r\n<o:p> </o:p>\r\nAttached are the W9 and COI for MedEd. Looking forward to our call this afternoon. Let me know if you need any additional documents from MedEd.<o:p></o:p>\r\n<o:p> </o:p>\r\nThanks,<o:p></o:p>\r\n<o:p> </o:p>\r\n<o:p> </o:p>\r\nJason Smith \r\nDirector, Marketing<o:p></o:p>\r\n \r\n<img width="256" height="57" style="width:2.6666in;height:.5937in" id="Picture_x0020_1" src="imagepath?AttachmentType=1001&AttachmentId=8da1f035-eb7e-e811-8146-e0071b6a7141" alt="MedEd"> \r\n \r\n2827 test address \r\ncity, state, zip \r\nOffice: 317-975-7394 \r\n<a href="https://www.medxcelfm.com/">www.medxcelfm.com</a><o:p></o:p>\r\n<o:p> </o:p>\r\n<o:p> </o:p>\r\n<o:p> </o:p>\r\n</div>\r\n</body>\r\n</html>\r\n'

I have tried the following to clean up this data:

import re

descriptioncleanup = 
re.search("<[^<]+?>",dfObj.iloc[0,15])
re.sub("<[^<]+?>", "",dfObj.iloc[0,15])
result = re.sub("<[^<]+?>", "",dfObj.iloc[0,15])
result2 = re.sub("\n","",result)
result3 = re.sub("\r","",result2)
result4 = re.sub(";","",result3)
result5 = re.sub("&nbsp","",result4)
result9 = re.sub("behaior:url","",result5)
result10 = re.sub("behairurl","",result9)
result11 = re.sub("#default#VML","",result10)
result12 = re.sub(".shape","", result11)

print(result12)

This cleans up most of things but does not clean all. I would like to show only the bolded text in the column. I am very new to the Python programming and was wondering if there is a better way to clean this data. Any help is appreciated.

Thanks,
Sonali

**scidam** · Dec-13-2019, 12:17 AM

I would use BeautifulSoup or lxml for this, e.g.

import bs4

data = """<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:w="urn:schemas-microsoft-com:office:word" xmlns:m="http://schemas.microsoft.com/office/2004/12/omml" xmlns="http://www.w3.org/TR/REC-html40">\r\n<head>\r\n<meta http-equiv="Content-Type" content="text/html; charset=utf-8">\r\n<meta name="Generator" content="Microsoft Word 15 (filtered medium)">\r\n<!--[if !mso]><style>v\\:* {behavior:url(#default#VML);}\r\no\\:* {behavior:url(#default#VML);}\r\nw\\:* {behavior:url(#default#VML);}\r\n.shape {behavior:url(#default#VML);}\r\n</style><![endif]--><style><!--\r\n/* Font Definitions */\r\n@font-face\r\n\t{font-family:"Cambria Math";\r\n\tpanose-1:2 4 5 3 5 4 6 3 2 4;}\r\n@font-face\r\n\t{font-family:Calibri;\r\n\tpanose-1:2 15 5 2 2 2 4 3 2 4;}\r\n/* Style Definitions */\r\np.MsoNormal, li.MsoNormal, div.MsoNormal\r\n\t{margin:0in;\r\n\tmargin-bottom:.0001pt;\r\n\tfont-size:11.0pt;\r\n\tfont-family:"Calibri",sans-serif;}\r\na:link, span.MsoHyperlink\r\n\t{mso-style-priority:99;\r\n\tcolor:#0563C1;\r\n\ttext-decoration:underline;}\r\na:visited, span.MsoHyperlinkFollowed\r\n\t{mso-style-priority:99;\r\n\tcolor:#954F72;\r\n\ttext-decoration:underline;}\r\nspan.EmailStyle17\r\n\t{mso-style-type:personal-compose;\r\n\tfont-family:"Calibri",sans-serif;\r\n\tcolor:windowtext;}\r\n.MsoChpDefault\r\n\t{mso-style-type:export-only;\r\n\tfont-family:"Calibri",sans-serif;}\r\n@page WordSection1\r\n\t{size:8.5in 11.0in;\r\n\tmargin:1.0in 1.0in 1.0in 1.0in;}\r\ndiv.WordSection1\r\n\t{page:WordSection1;}\r\n--></style><!--[if gte mso 9]><xml>\r\n<o:shapedefaults v:ext="edit" spidmax="1026" />\r\n</xml><![endif]--><!--[if gte mso 9]><xml>\r\n<o:shapelayout v:ext="edit">\r\n<o:idmap v:ext="edit" data="1" />\r\n</o:shapelayout></xml><![endif]-->\r\n</head>\r\n<body lang="EN-US" link="#0563C1" vlink="#954F72">\r\n<div class="WordSection1">\r\n<p class="MsoNormal">Garreth,<o:p></o:p></p>\r\n<p class="MsoNormal"><o:p>&nbsp;</o:p></p>\r\n<p class="MsoNormal">Attached are the W9 and COI for MedEd. Looking forward to our call this afternoon. Let me know if you need any additional documents from MedEd.<o:p></o:p></p>\r\n<p class="MsoNormal"><o:p>&nbsp;</o:p></p>\r\n<p class="MsoNormal">Thanks,<o:p></o:p></p>\r\n<p class="MsoNormal"><o:p>&nbsp;</o:p></p>\r\n<p class="MsoNormal"><o:p>&nbsp;</o:p></p>\r\n<p class="MsoNormal"><b><span style="font-family:&quot;Arial&quot;,sans-serif;color:black">Jason Smith</span></b><span style="font-family:&quot;Arial&quot;,sans-serif;color:black"><br>\r\nDirector, Marketing<o:p></o:p></span></p>\r\n<p class="MsoNormal"><span style="font-size:10.0pt;font-family:&quot;Arial&quot;,sans-serif;color:black"><br>\r\n<b><img width="256" height="57" style="width:2.6666in;height:.5937in" id="Picture_x0020_1" src="imagepath?AttachmentType=1001&AttachmentId=8da1f035-eb7e-e811-8146-e0071b6a7141" alt="MedEd"></b><br>\r\n<br>\r\n2827 test address<br>\r\ncity, state, zip<br>\r\nOffice: 317-975-7394<br>\r\n<a href="https://www.medxcelfm.com/"><span style="color:#5877A4">www.medxcelfm.com</span></a><o:p></o:p></span></p>\r\n<p class="MsoNormal"><o:p>&nbsp;</o:p></p>\r\n<p class="MsoNormal"><o:p>&nbsp;</o:p></p>\r\n<p class="MsoNormal"><o:p>&nbsp;</o:p></p>\r\n</div>\r\n</body>\r\n</html>\r\n"""

ss = bs4.BeautifulSoup(data)
print(ss.find('body').text)

Output:Garreth,
 
Attached are the W9 and COI for MedEd. Looking forward to our call this afternoon. Let me know if you need any additional documents from MedEd.
 
Thanks,
 
 
Jason Smith
Director, Marketing



2827 test address
city, state, zip
Office: 317-975-7394
www.medxcelfm.com

Data Cleanup Help Needed

User Panel Messages

Announcements