Python Forum
Reading a Regex pattern - Printable Version

+- Python Forum (https://python-forum.io)
+-- Forum: Python Coding (https://python-forum.io/forum-7.html)
+--- Forum: General Coding Help (https://python-forum.io/forum-8.html)
+--- Thread: Reading a Regex pattern (/thread-17792.html)

Pages: 1 2


Reading a Regex pattern - stahorse - Apr-24-2019

Hi,

I have a Regex pattern I'm trying to read, I'll explain how I understand it, please correct me if wrong (which I believe I am)

pattern = re.compile(r'<.*?>')
I think it reads:
Quote:Find an open bracket in the text, followed by any character except a newline, it could be zero or more of them, then followed by close bracket

I'm really not sure about that explanation, but I know I'm kinda close, please help.


RE: Reading a Regex pattern - Gribouillis - Apr-24-2019

It is not an open bracket but a less-than sign. Similarly the last character is a greater-than sign. The question mark in the regex is unnecessary as 'zero or more' already implies 'optional'. Apart from that, the explanation looks correct.


RE: Reading a Regex pattern - DeaD_EyE - Apr-24-2019

I think they are called angle brackets.

The interpretation is correct. If you use re.DOTALL as flag, the dot matches also newline.
The .*? is a non-greedy regex.


RE: Reading a Regex pattern - snippsat - Apr-24-2019

I have seen that pattern many times,sometimes also used in "bad" way.
So if have some HTML can get the idea if i just remove all tags then i will get text out of HTML.
import re

html = '''\
<!DOCTYPE html>
<html>
<head>
  <meta charset="UTF-8">
  <title>Page Title</title>
</head>
<body>
  <h1>This is a Heading</h1>
  <p>This is a paragraph.</p>
</body>
</html>'''
Output:
>>> r = re.sub(r'<.*?>', '', html) >>> print(r.strip()) Page Title This is a Heading This is a paragraph.
So it kind of work,but a parser is a better choice.
from bs4 import BeautifulSoup

html = '''\
<!DOCTYPE html>
<html>
<head>
  <meta charset="UTF-8">
  <title>Page Title</title>
</head>
<body>
  <h1>This is a Heading</h1>
  <p>This is a paragraph.</p>
</body>
</html>'''

soup = BeautifulSoup(html, 'lxml')
Output:
>>> print(soup.text.strip()) Page Title This is a Heading This is a paragraph.
Parser also have better control,as regex is a bad Hand choice with HTML.
Output:
>>> soup.find('body') <body> <h1>This is a Heading</h1> <p>This is a paragraph.</p> </body> >>> print(soup.find('body').text.strip()) This is a Heading This is a paragraph.
This was just guess at what pattern can do,i could of course be wrong here.


RE: Reading a Regex pattern - stahorse - Apr-25-2019

Hi,

Thanks for the response.

I think I should mention that I'm very new at programming, I was taking Python Tutorials, now I'm trying to learn with a "real life" problem, then I got a code to read.

Now, the file that I'm reading is part HTML part something else. it does start with HTML tags but throughout the file is a mess, it's not consistent with HTML, so they start cleaning HTML tags first then work through the rest as they find them.

(Apr-24-2019, 04:53 PM)snippsat Wrote: I have seen that pattern many times,sometimes also used in "bad" way.
So if have some HTML can get the idea if i just remove all tags then i will get text out of HTML.
import re

html = '''\
<!DOCTYPE html>
<html>
<head>
  <meta charset="UTF-8">
  <title>Page Title</title>
</head>
<body>
  <h1>This is a Heading</h1>
  <p>This is a paragraph.</p>
</body>
</html>'''
Output:
>>> r = re.sub(r'<.*?>', '', html) >>> print(r.strip()) Page Title This is a Heading This is a paragraph.
So it kind of work,but a parser is a better choice.
from bs4 import BeautifulSoup

html = '''\
<!DOCTYPE html>
<html>
<head>
  <meta charset="UTF-8">
  <title>Page Title</title>
</head>
<body>
  <h1>This is a Heading</h1>
  <p>This is a paragraph.</p>
</body>
</html>'''

soup = BeautifulSoup(html, 'lxml')
Output:
>>> print(soup.text.strip()) Page Title This is a Heading This is a paragraph.
Parser also have better control,as regex is a bad Hand choice with HTML.
Output:
>>> soup.find('body') <body> <h1>This is a Heading</h1> <p>This is a paragraph.</p> </body> >>> print(soup.find('body').text.strip()) This is a Heading This is a paragraph.
This was just guess at what pattern can do,i could of course be wrong here.



RE: Reading a Regex pattern - stahorse - Apr-25-2019

I tried to use parser, but when I execute this
>>> print(soup.text.strip())
Page Title
I get this error:
Error:
bs4.FeatureNotFount: Couldn't find a tree builder with the features you requested: lxml. Do you need to install a parser libraby?



RE: Reading a Regex pattern - DeaD_EyE - Apr-25-2019

You can install lxml (it's a fast xml implementation).
pip install lxml
Often this module is already installed, if other installed dependencies rely on it.

When you instanciate your BeautifulSoup, you can use:
from bs4 import BeautifulSoup
bs_html = BeautifulSoup('<html><body><p>Hello World</p></body></html>', 'html')
bs_lxml = BeautifulSoup('<html><body><p>Hello World</p></body></html>', 'lxml')

print(bs_html.find('p').text)
print(bs_lxml.find('p').text)
Don't use regex for HTML: https://blog.codinghorror.com/parsing-html-the-cthulhu-way/


RE: Reading a Regex pattern - stahorse - Apr-25-2019

Is there a way that you could extract only the body, but not with tags like the one below:
Output:
<body> <h1>This is a Heading</h1> <p>This is a paragraph.</p> </body>
But rather
Output:
This is a heading



RE: Reading a Regex pattern - DeaD_EyE - Apr-25-2019

Yes and there is more than one way.
The method find() should return only one result.
Then you have also find_all() which returns a list with all elements which matches the pattern.

result = bs.find('h1')
text_in_tag = result.text



RE: Reading a Regex pattern - snippsat - Apr-25-2019

(Apr-25-2019, 07:32 AM)stahorse Wrote: Is there a way that you could extract only the body, but not with tags like the one below:
Take a look at Web-Scraping part-1