Python Forum
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
regex multi-line
#1
hi,
i have some page that "the content" do not have specific tag(<something><like><this>) except for the <HTML>,
so I can't use the soup to get the content,

the other thing, like
Error:
<div> random content that i don't needed another random line </div>
I like to remove them, anything start with <random> and end with </random>, that include <div>random</div>,<script language....>thescript</script> or anything else, except the <html>to</html>
if it is a single line I can do it.
but they are multi-line, the mult-iline is so random, it can be 1 line or dozens line,
so, how to regex this?
Reply
#2
Very difficult to say anything relevant without actual data.
kucingkembar likes this post
Reply
#3
(Aug-27-2022, 05:25 PM)Gribouillis Wrote: Very difficult to say anything relevant without actual data.
I try to read the novel on my phone when I offline
https://lightnovelstranslations.com/the-...re-part-4/

note: is there any legal if I show the link in this forum?
Reply
#4
If you want to scrape the story, I had some success by splitting the content on horizontal lines
>>> url = "the url you gave"
>>> import requests
>>> r = requests.get(url)
>>> L = r.content.split(b'<hr>')
>>> story = L[2]
kucingkembar likes this post
Reply
#5
(Aug-27-2022, 08:36 PM)Gribouillis Wrote: If you want to scrape the story, I had some success by splitting the content on horizontal lines
>>> url = "the url you gave"
>>> import requests
>>> r = requests.get(url)
>>> L = r.content.split(b'<hr>')
>>> story = L[2]

this thing is what I looking for, it solves my problem,
but, if this case happens, it will nice to get that regex that can be use
Reply
#6
(Aug-27-2022, 08:54 PM)kucingkembar Wrote: it will nice to get that regex that can be use
Regex and Html are not best friends the classics post🌞 that never get old.

I would do it like this.
import requests
from bs4 import BeautifulSoup
# pip install html2text
import html2text

url = 'https://lightnovelstranslations.com/the-galactic-navy-officer-becomes-an-adventurer/chapter-95-preparations-for-departure-part-4/'
response = requests.get(url)
soup = BeautifulSoup(response.content, 'lxml')
story = soup.select_one('#post-104395 > div')
text_maker = html2text.HTML2Text()
text_maker.ignore_links = True
text = text_maker.handle(story.prettify())
print(text) 
Output:
Chapter 95 - Preparations for Departure Part 3 * * * **Translator: SFBaka** **Editor: Thor’s Stone** * * * –Roberto’s POV– The princess and Alan-sama welcomed our arrival at the royal capital with more enthusiasm than I expected. I’m glad I prepared myself beforehand to get scolded for arbitrarily departing with an advanced party. It was already late at night, and most of the others have returned to their rooms. But some of the leaders including Adjutant Dalshim still remained in the hall to talk more with me. “So how is it? What is your impression of serving under Alan-sama, Dalshim- dono?” “In a word, splendid. I can declare without any qualms that everything we’ve accomplished so far was largely due to Alan-sama’s contributions.” .....
kucingkembar likes this post
Reply
#7
(Aug-27-2022, 09:53 PM)snippsat Wrote:
(Aug-27-2022, 08:54 PM)kucingkembar Wrote: it will nice to get that regex that can be use
Regex and Html are not best friends the classics post🌞 that never get old.

I would do it like this.
import requests
from bs4 import BeautifulSoup
# pip install html2text
import html2text

url = 'https://lightnovelstranslations.com/the-galactic-navy-officer-becomes-an-adventurer/chapter-95-preparations-for-departure-part-4/'
response = requests.get(url)
soup = BeautifulSoup(response.content, 'lxml')
story = soup.select_one('#post-104395 > div')
text_maker = html2text.HTML2Text()
text_maker.ignore_links = True
text = text_maker.handle(story.prettify())
print(text) 
Output:
Chapter 95 - Preparations for Departure Part 3 * * * **Translator: SFBaka** **Editor: Thor’s Stone** * * * –Roberto’s POV– The princess and Alan-sama welcomed our arrival at the royal capital with more enthusiasm than I expected. I’m glad I prepared myself beforehand to get scolded for arbitrarily departing with an advanced party. It was already late at night, and most of the others have returned to their rooms. But some of the leaders including Adjutant Dalshim still remained in the hall to talk more with me. “So how is it? What is your impression of serving under Alan-sama, Dalshim- dono?” “In a word, splendid. I can declare without any qualms that everything we’ve accomplished so far was largely due to Alan-sama’s contributions.” .....

Regex and Html are not best friends the classics post🌞 that never get old.
i read that link before, the solution is about,
sorry if external question, what is this "Have you tried using an XML parser instead?"
any link to it?
anyway your code work, I add reputation point again for you and another one who replies
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  How to add multi-line comment section? Winfried 1 220 Mar-24-2024, 04:34 PM
Last Post: deanhystad
  multi-line CMD in one-line python kucingkembar 5 4,001 Jan-01-2022, 12:45 PM
Last Post: kucingkembar
  [SOLVED] Why does regex fail cleaning line? Winfried 5 2,470 Aug-22-2021, 06:59 PM
Last Post: Winfried
  Multi-line console input lizze 4 2,382 Dec-26-2020, 08:10 AM
Last Post: lizze
  Regex on more than one line ? JohnnyCoffee 3 2,640 Mar-12-2020, 02:01 PM
Last Post: JohnnyCoffee
  Regex won't replace character with line break Tomf96 2 2,559 Jan-12-2020, 12:14 PM
Last Post: Tomf96
  Python convert multi line into single line formatted string karthidec 2 9,458 Dec-23-2019, 12:46 PM
Last Post: karthidec
  multi-line messages in raised exceptions? Skaperen 3 7,359 Aug-01-2019, 02:17 AM
Last Post: Skaperen
  Do I always have to use triple quotes or \n for multi-line statements? DragonG 3 2,624 Oct-24-2018, 11:21 AM
Last Post: metulburr
  Multi line strings and Ascii art YevesDraco 0 5,916 Feb-22-2017, 12:25 PM
Last Post: YevesDraco

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020