Python Forum
[SOLVED] [regex] Why isn't possible substring ignored?
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
[SOLVED] [regex] Why isn't possible substring ignored?
#1
Hello,

I need to loop through a list of URLs to grab each page's title, which might contain a substring I want to ignore.

For some reason, the substring isn't removed:

with open('list.txt") as f:
	for line in f:
		print(line.replace('\n', ''))
		n = requests.get(line)
		al = n.text
		#Doesn't remove possible ( - dummy)?
		d = re.search('<\W*title\W*(.*)( - dummy)?</title', al, re.IGNORECASE)
		title = html.unescape(d.group(1))
		print(title)
How is my regex wrong?

Thank you.
Reply
#2
What do you mean by "the substring isn't removed"? Can you give concrete example of data?
Reply
#3
Some titles look like this:
<title>My title - dummy</title>

Others look like this:
<title>My title</title>

If it's there, how can I get rid of the " - dummy" part?

I expected this to work, but it's ignored: ( - dummy)?
Reply
#4
(Apr-08-2023, 01:43 PM)Winfried Wrote: If it's there, how can I get rid of the " - dummy" part?
Yes,if the format is the same in all titles.
Output:
<title>My title - dummy</title> <title>Site about cars - car 99</title> <title>Numbers - 12345 678</title>
import re

with open('url_lst.txt') as f:
    for line in f:
        d = re.search('<\W*title\W*(.*?)( - \w.*)?</title', line)
        title = d.group(1)
        print(title)
Output:
My title Site about cars Numbers
Winfried likes this post
Reply
#5
Thanks, it works, although I don't understand why I need to 1) make it ungreedy since the part can only occur as the last token, and 2) add a trailing ".*" for it to work since it's a single word and nothing can possibly follow.

I'll investigate further.

--
Edit: This works. Maybe it's not a plain space that separates "dummy" from the rest.

d = re.search('<title>(.+) (- dummy)?</title', al, re.IGNORECASE)
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  extract substring from a string before a word !! evilcode1 3 550 Nov-08-2023, 12:18 AM
Last Post: evilcode1
  [solved] Regex expression do not want to taken :/ SpongeB0B 2 774 Nov-06-2023, 02:43 PM
Last Post: SpongeB0B
  Help with a regex? (solved) wrybread 3 837 May-01-2023, 05:12 AM
Last Post: deanhystad
  [SOLVED] Alternative to regex to extract date from whole timestamp? Winfried 6 1,852 Nov-16-2022, 01:49 PM
Last Post: carecavoador
  ValueError: substring not found nby2001 4 7,963 Aug-08-2022, 11:16 AM
Last Post: rob101
  Match substring using regex Pavel_47 6 1,446 Jul-18-2022, 07:46 AM
Last Post: Pavel_47
  Substring Counting shelbyahn 4 6,154 Jan-13-2022, 10:08 AM
Last Post: krisputas
  [SOLVED] Why does regex fail cleaning line? Winfried 5 2,470 Aug-22-2021, 06:59 PM
Last Post: Winfried
  Python Substring muzikman 4 2,331 Dec-01-2020, 03:07 PM
Last Post: deanhystad
  Removing items from list if containing a substring pythonnewbie138 2 2,220 Aug-27-2020, 10:20 PM
Last Post: pythonnewbie138

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020