Python Forum
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Reading a Regex pattern
#1
Hi,

I have a Regex pattern I'm trying to read, I'll explain how I understand it, please correct me if wrong (which I believe I am)

pattern = re.compile(r'<.*?>')
I think it reads:
Quote:Find an open bracket in the text, followed by any character except a newline, it could be zero or more of them, then followed by close bracket

I'm really not sure about that explanation, but I know I'm kinda close, please help.
Reply
#2
It is not an open bracket but a less-than sign. Similarly the last character is a greater-than sign. The question mark in the regex is unnecessary as 'zero or more' already implies 'optional'. Apart from that, the explanation looks correct.
Reply
#3
I think they are called angle brackets.

The interpretation is correct. If you use re.DOTALL as flag, the dot matches also newline.
The .*? is a non-greedy regex.
Almost dead, but too lazy to die: https://sourceserver.info
All humans together. We don't need politicians!
Reply
#4
I have seen that pattern many times,sometimes also used in "bad" way.
So if have some HTML can get the idea if i just remove all tags then i will get text out of HTML.
import re

html = '''\
<!DOCTYPE html>
<html>
<head>
  <meta charset="UTF-8">
  <title>Page Title</title>
</head>
<body>
  <h1>This is a Heading</h1>
  <p>This is a paragraph.</p>
</body>
</html>'''
Output:
>>> r = re.sub(r'<.*?>', '', html) >>> print(r.strip()) Page Title This is a Heading This is a paragraph.
So it kind of work,but a parser is a better choice.
from bs4 import BeautifulSoup

html = '''\
<!DOCTYPE html>
<html>
<head>
  <meta charset="UTF-8">
  <title>Page Title</title>
</head>
<body>
  <h1>This is a Heading</h1>
  <p>This is a paragraph.</p>
</body>
</html>'''

soup = BeautifulSoup(html, 'lxml')
Output:
>>> print(soup.text.strip()) Page Title This is a Heading This is a paragraph.
Parser also have better control,as regex is a bad Hand choice with HTML.
Output:
>>> soup.find('body') <body> <h1>This is a Heading</h1> <p>This is a paragraph.</p> </body> >>> print(soup.find('body').text.strip()) This is a Heading This is a paragraph.
This was just guess at what pattern can do,i could of course be wrong here.
Reply
#5
Hi,

Thanks for the response.

I think I should mention that I'm very new at programming, I was taking Python Tutorials, now I'm trying to learn with a "real life" problem, then I got a code to read.

Now, the file that I'm reading is part HTML part something else. it does start with HTML tags but throughout the file is a mess, it's not consistent with HTML, so they start cleaning HTML tags first then work through the rest as they find them.

(Apr-24-2019, 04:53 PM)snippsat Wrote: I have seen that pattern many times,sometimes also used in "bad" way.
So if have some HTML can get the idea if i just remove all tags then i will get text out of HTML.
import re

html = '''\
<!DOCTYPE html>
<html>
<head>
  <meta charset="UTF-8">
  <title>Page Title</title>
</head>
<body>
  <h1>This is a Heading</h1>
  <p>This is a paragraph.</p>
</body>
</html>'''
Output:
>>> r = re.sub(r'<.*?>', '', html) >>> print(r.strip()) Page Title This is a Heading This is a paragraph.
So it kind of work,but a parser is a better choice.
from bs4 import BeautifulSoup

html = '''\
<!DOCTYPE html>
<html>
<head>
  <meta charset="UTF-8">
  <title>Page Title</title>
</head>
<body>
  <h1>This is a Heading</h1>
  <p>This is a paragraph.</p>
</body>
</html>'''

soup = BeautifulSoup(html, 'lxml')
Output:
>>> print(soup.text.strip()) Page Title This is a Heading This is a paragraph.
Parser also have better control,as regex is a bad Hand choice with HTML.
Output:
>>> soup.find('body') <body> <h1>This is a Heading</h1> <p>This is a paragraph.</p> </body> >>> print(soup.find('body').text.strip()) This is a Heading This is a paragraph.
This was just guess at what pattern can do,i could of course be wrong here.
Reply
#6
I tried to use parser, but when I execute this
>>> print(soup.text.strip())
Page Title
I get this error:
Error:
bs4.FeatureNotFount: Couldn't find a tree builder with the features you requested: lxml. Do you need to install a parser libraby?
Reply
#7
You can install lxml (it's a fast xml implementation).
pip install lxml
Often this module is already installed, if other installed dependencies rely on it.

When you instanciate your BeautifulSoup, you can use:
from bs4 import BeautifulSoup
bs_html = BeautifulSoup('<html><body><p>Hello World</p></body></html>', 'html')
bs_lxml = BeautifulSoup('<html><body><p>Hello World</p></body></html>', 'lxml')

print(bs_html.find('p').text)
print(bs_lxml.find('p').text)
Don't use regex for HTML: https://blog.codinghorror.com/parsing-ht...hulhu-way/
Almost dead, but too lazy to die: https://sourceserver.info
All humans together. We don't need politicians!
Reply
#8
Is there a way that you could extract only the body, but not with tags like the one below:
Output:
<body> <h1>This is a Heading</h1> <p>This is a paragraph.</p> </body>
But rather
Output:
This is a heading
Reply
#9
Yes and there is more than one way.
The method find() should return only one result.
Then you have also find_all() which returns a list with all elements which matches the pattern.

result = bs.find('h1')
text_in_tag = result.text
Almost dead, but too lazy to die: https://sourceserver.info
All humans together. We don't need politicians!
Reply
#10
(Apr-25-2019, 07:32 AM)stahorse Wrote: Is there a way that you could extract only the body, but not with tags like the one below:
Take a look at Web-Scraping part-1
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Regex pattern match WJSwan 2 1,193 Feb-07-2023, 04:52 AM
Last Post: WJSwan
  regex pattern to extract relevant sentences Bubly 2 1,837 Jul-06-2021, 04:17 PM
Last Post: Bubly
  Reading a Regex stahorse 4 2,449 May-16-2019, 02:09 PM
Last Post: snippsat
  Regex Pattern NewBeie 5 2,991 May-13-2019, 01:27 PM
Last Post: michalmonday
  Regex, creating a pattern stahorse 5 3,119 Apr-24-2019, 08:29 AM
Last Post: DeaD_EyE

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020