Mar-01-2020, 06:56 PM
Hello to all,
I´m trying to clean an html file that has repeated paragraphs within body. Below I show the input file and expected output.
Input.html https://jsfiddle.net/97ptc0Lh/4/
Output.html https://jsfiddle.net/97ptc0Lh/1/
I've been trying with the following code using BeautifulSoup but I don´t know why is not working, since the resultant list CleanHtml contains the repeated elements (paragraphs) that I´d like to remove. I already asked here, but still no much progress.
I´m trying to clean an html file that has repeated paragraphs within body. Below I show the input file and expected output.
Input.html https://jsfiddle.net/97ptc0Lh/4/
Output.html https://jsfiddle.net/97ptc0Lh/1/
I've been trying with the following code using BeautifulSoup but I don´t know why is not working, since the resultant list CleanHtml contains the repeated elements (paragraphs) that I´d like to remove. I already asked here, but still no much progress.
from bs4 import BeautifulSoup fp = open("Input.html", "rb") soup = BeautifulSoup(fp, "html5lib") Uniques = set() CleanHtml = [] for element in soup.html: if element not in Uniques: Uniques.add(element) CleanHtml.append(element) print (CleanHtml)Thanks in advance for any help.