Python Forum

Full Version: regex multi-line
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
hi,
i have some page that "the content" do not have specific tag(<something><like><this>) except for the <HTML>,
so I can't use the soup to get the content,

the other thing, like
Error:
<div> random content that i don't needed another random line </div>
I like to remove them, anything start with <random> and end with </random>, that include <div>random</div>,<script language....>thescript</script> or anything else, except the <html>to</html>
if it is a single line I can do it.
but they are multi-line, the mult-iline is so random, it can be 1 line or dozens line,
so, how to regex this?
Very difficult to say anything relevant without actual data.
(Aug-27-2022, 05:25 PM)Gribouillis Wrote: [ -> ]Very difficult to say anything relevant without actual data.
I try to read the novel on my phone when I offline
https://lightnovelstranslations.com/the-...re-part-4/

note: is there any legal if I show the link in this forum?
If you want to scrape the story, I had some success by splitting the content on horizontal lines
>>> url = "the url you gave"
>>> import requests
>>> r = requests.get(url)
>>> L = r.content.split(b'<hr>')
>>> story = L[2]
(Aug-27-2022, 08:36 PM)Gribouillis Wrote: [ -> ]If you want to scrape the story, I had some success by splitting the content on horizontal lines
>>> url = "the url you gave"
>>> import requests
>>> r = requests.get(url)
>>> L = r.content.split(b'<hr>')
>>> story = L[2]

this thing is what I looking for, it solves my problem,
but, if this case happens, it will nice to get that regex that can be use
(Aug-27-2022, 08:54 PM)kucingkembar Wrote: [ -> ]it will nice to get that regex that can be use
Regex and Html are not best friends the classics post🌞 that never get old.

I would do it like this.
import requests
from bs4 import BeautifulSoup
# pip install html2text
import html2text

url = 'https://lightnovelstranslations.com/the-galactic-navy-officer-becomes-an-adventurer/chapter-95-preparations-for-departure-part-4/'
response = requests.get(url)
soup = BeautifulSoup(response.content, 'lxml')
story = soup.select_one('#post-104395 > div')
text_maker = html2text.HTML2Text()
text_maker.ignore_links = True
text = text_maker.handle(story.prettify())
print(text) 
Output:
Chapter 95 - Preparations for Departure Part 3 * * * **Translator: SFBaka** **Editor: Thor’s Stone** * * * –Roberto’s POV– The princess and Alan-sama welcomed our arrival at the royal capital with more enthusiasm than I expected. I’m glad I prepared myself beforehand to get scolded for arbitrarily departing with an advanced party. It was already late at night, and most of the others have returned to their rooms. But some of the leaders including Adjutant Dalshim still remained in the hall to talk more with me. “So how is it? What is your impression of serving under Alan-sama, Dalshim- dono?” “In a word, splendid. I can declare without any qualms that everything we’ve accomplished so far was largely due to Alan-sama’s contributions.” .....
(Aug-27-2022, 09:53 PM)snippsat Wrote: [ -> ]
(Aug-27-2022, 08:54 PM)kucingkembar Wrote: [ -> ]it will nice to get that regex that can be use
Regex and Html are not best friends the classics post🌞 that never get old.

I would do it like this.
import requests
from bs4 import BeautifulSoup
# pip install html2text
import html2text

url = 'https://lightnovelstranslations.com/the-galactic-navy-officer-becomes-an-adventurer/chapter-95-preparations-for-departure-part-4/'
response = requests.get(url)
soup = BeautifulSoup(response.content, 'lxml')
story = soup.select_one('#post-104395 > div')
text_maker = html2text.HTML2Text()
text_maker.ignore_links = True
text = text_maker.handle(story.prettify())
print(text) 
Output:
Chapter 95 - Preparations for Departure Part 3 * * * **Translator: SFBaka** **Editor: Thor’s Stone** * * * –Roberto’s POV– The princess and Alan-sama welcomed our arrival at the royal capital with more enthusiasm than I expected. I’m glad I prepared myself beforehand to get scolded for arbitrarily departing with an advanced party. It was already late at night, and most of the others have returned to their rooms. But some of the leaders including Adjutant Dalshim still remained in the hall to talk more with me. “So how is it? What is your impression of serving under Alan-sama, Dalshim- dono?” “In a word, splendid. I can declare without any qualms that everything we’ve accomplished so far was largely due to Alan-sama’s contributions.” .....

Regex and Html are not best friends the classics post🌞 that never get old.
i read that link before, the solution is about,
sorry if external question, what is this "Have you tried using an XML parser instead?"
any link to it?
anyway your code work, I add reputation point again for you and another one who replies