Python Forum

Full Version: How to remove html content from a column of the datafarme in Python3.6?
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Hi,

I have a csv file which have a column-merchant product id.
This column has an unexpected data in the form of html like - <p class=""MsoNormal"" style=""border:

For further processing I need to remove this from my dataframe.
How can I remove this unexpected html value from this column? PFA sheet containing some data for testing also.
the file is not csv, it's slsx
row 210 has html code enbedded in it.
probably some sort of formatting, but if it's not needed, open in a spreadsheet and delete that line, then save as csv file which will be easier to read.

I have attached modified file
(Jul-25-2018, 09:38 PM)Larz60+ Wrote: [ -> ]the file is not csv, it's slsx
row 210 has html code enbedded in it.
probably some sort of formatting, but if it's not needed, open in a spreadsheet and delete that line, then save as csv file which will be easier to read.

I have attached modified file

I thought there must be a Python way to deal with this.
I have given the sample only, I have total 4000 records in which there are many product code with html data and that should be removed.
this appears to be a script you can modify to do what you wish:
https://stackoverflow.com/questions/2010...ing-python


Don't run this code. as it only writes a new file and that's not what you want to do.
I'll check this post later today, and if someone hasn't answered it by then, I'll see what I can suggest.