How to remove multiple tags using regex - Printable Version +- Python Forum (https://python-forum.io) +-- Forum: Python Coding (https://python-forum.io/forum-7.html) +--- Forum: Homework (https://python-forum.io/forum-9.html) +--- Thread: How to remove multiple tags using regex (/thread-32842.html) |
How to remove multiple tags using regex - sbmonzur - Mar-09-2021 Hi all, Newbie here! I am using Python 3.8.3 and am trying to remove tags from the attached text file https://drive.google.com/file/d/1V3s8w8a3NQvex91EdOhdC9rQtCAOElpm/view?usp=sharing I want to extract 3 lists - titles, publication dates, and main text of the articles and remove the tags. In the code below, I have been able to remove the tags from the titles and publication dates. However, I am not able to properly remove all tags from the main texts. In the text file, the main text starts with the tag <div class="story-element story-element-text"> and ends before the next <h1 class tag. Any help in extracting this part of the text would be highly appreciated!! The article text is in a non-English script, but all the html tags are in English. #opening text file which contains newspaper article information scraped off website using beautifulsoup import re with open('listfile.txt', 'r', encoding='utf8') as my_file: text = my_file.read() print(text). #not printing output here because it's too large #removing tags and generating list of newspaper article titles titles = re.findall('<h1.*?>(.*?)</h1>', text) print(titles)
#removing tags and generating list of newspaper article publication dates dates = re.findall('<div class=\"storyPageMetaData-m__publish-time__19bdV\"><span>(.*?)</span>', text) print(dates)
#removing tags and generating list containing main text of articles #this is where I am unable to remove the tags bodytext= re.findall('<div class=\"story-element story-element-text\">(.*?)</div>', text) print(bodytext) RE: How to remove multiple tags using regex - Larz60+ - Mar-10-2021 why have you created three threads for the same topic? Please use the same thread if you need modifications. RE: How to remove multiple tags using regex - sbmonzur - Mar-10-2021 (Mar-10-2021, 02:06 AM)Larz60+ Wrote: why have you created three threads for the same topic? Sorry, I had a connection issue when posting and ending up duplicating the messages. It seems that I cannot delete any of the threads now. RE: How to remove multiple tags using regex - buran - Mar-10-2021 Don't use regex to parse html. Use parser like Beautiful Soup. |