Reading a Regex pattern - Printable Version

Reading a Regex pattern - Printable Version

+- Python Forum (https://python-forum.io)
+-- Forum: Python Coding (https://python-forum.io/forum-7.html)
+--- Forum: General Coding Help (https://python-forum.io/forum-8.html)
+--- Thread: Reading a Regex pattern (/thread-17792.html)

Pages: 1 2

Reading a Regex pattern - stahorse - Apr-24-2019

Hi,

I have a Regex pattern I'm trying to read, I'll explain how I understand it, please correct me if wrong (which I believe I am)

pattern = re.compile(r'<.*?>')

I think it reads:

Quote:Find an open bracket in the text, followed by any character except a newline, it could be zero or more of them, then followed by close bracket

I'm really not sure about that explanation, but I know I'm kinda close, please help.

RE: Reading a Regex pattern - Gribouillis - Apr-24-2019

It is not an open bracket but a less-than sign. Similarly the last character is a greater-than sign. The question mark in the regex is unnecessary as 'zero or more' already implies 'optional'. Apart from that, the explanation looks correct.

RE: Reading a Regex pattern - DeaD_EyE - Apr-24-2019

I think they are called angle brackets.

The interpretation is correct. If you use re.DOTALL as flag, the dot matches also newline.
The .*? is a non-greedy regex.

RE: Reading a Regex pattern - snippsat - Apr-24-2019

I have seen that pattern many times,sometimes also used in "bad" way.
So if have some HTML can get the idea if i just remove all tags then i will get text out of HTML.

import re

html = '''\
<!DOCTYPE html>
<html>
<head>
  <meta charset="UTF-8">
  <title>Page Title</title>
</head>
<body>
  <h1>This is a Heading</h1>
  <p>This is a paragraph.</p>
</body>
</html>'''

Output:>>> r = re.sub(r'<.*?>', '', html)
>>> print(r.strip())
Page Title


  This is a Heading
  This is a paragraph.

So it kind of work,but a parser is a better choice.

from bs4 import BeautifulSoup

html = '''\
<!DOCTYPE html>
<html>
<head>
  <meta charset="UTF-8">
  <title>Page Title</title>
</head>
<body>
  <h1>This is a Heading</h1>
  <p>This is a paragraph.</p>
</body>
</html>'''

soup = BeautifulSoup(html, 'lxml')

Output:>>> print(soup.text.strip())
Page Title


This is a Heading
This is a paragraph.

Parser also have better control,as regex is a bad Hand

choice with HTML.

Output:>>> soup.find('body')
<body>
<h1>This is a Heading</h1>
<p>This is a paragraph.</p>
</body>

>>> print(soup.find('body').text.strip())
This is a Heading
This is a paragraph.

This was just guess at what pattern can do,i could of course be wrong here.

RE: Reading a Regex pattern - stahorse - Apr-25-2019

Hi,

Thanks for the response.

I think I should mention that I'm very new at programming, I was taking Python Tutorials, now I'm trying to learn with a "real life" problem, then I got a code to read.

Now, the file that I'm reading is part HTML part something else. it does start with HTML tags but throughout the file is a mess, it's not consistent with HTML, so they start cleaning HTML tags first then work through the rest as they find them.

(Apr-24-2019, 04:53 PM)snippsat Wrote: I have seen that pattern many times,sometimes also used in "bad" way.
So if have some HTML can get the idea if i just remove all tags then i will get text out of HTML.
import re

html = '''\
<!DOCTYPE html>
<html>
<head>
  <meta charset="UTF-8">
  <title>Page Title</title>
</head>
<body>
  <h1>This is a Heading</h1>
  <p>This is a paragraph.</p>
</body>
</html>'''
Output:>>> r = re.sub(r'<.*?>', '', html)
>>> print(r.strip())
Page Title


  This is a Heading
  This is a paragraph.
So it kind of work,but a parser is a better choice.
from bs4 import BeautifulSoup

html = '''\
<!DOCTYPE html>
<html>
<head>
  <meta charset="UTF-8">
  <title>Page Title</title>
</head>
<body>
  <h1>This is a Heading</h1>
  <p>This is a paragraph.</p>
</body>
</html>'''

soup = BeautifulSoup(html, 'lxml')
Output:>>> print(soup.text.strip())
Page Title


This is a Heading
This is a paragraph.
Parser also have better control,as regex is a bad choice with HTML.
Output:>>> soup.find('body')
<body>
<h1>This is a Heading</h1>
<p>This is a paragraph.</p>
</body>

>>> print(soup.find('body').text.strip())
This is a Heading
This is a paragraph.
This was just guess at what pattern can do,i could of course be wrong here.

RE: Reading a Regex pattern - stahorse - Apr-25-2019

I tried to use parser, but when I execute this

>>> print(soup.text.strip())
Page Title

I get this error:

Error:
bs4.FeatureNotFount: Couldn't find a tree builder  with the features you requested: lxml. Do you need to install a parser libraby?

RE: Reading a Regex pattern - DeaD_EyE - Apr-25-2019

You can install lxml (it's a fast xml implementation).

pip install lxml

Often this module is already installed, if other installed dependencies rely on it.

When you instanciate your BeautifulSoup, you can use:

from bs4 import BeautifulSoup
bs_html = BeautifulSoup('<html><body><p>Hello World</p></body></html>', 'html')
bs_lxml = BeautifulSoup('<html><body><p>Hello World</p></body></html>', 'lxml')

print(bs_html.find('p').text)
print(bs_lxml.find('p').text)

Don't use regex for HTML: https://blog.codinghorror.com/parsing-html-the-cthulhu-way/

RE: Reading a Regex pattern - stahorse - Apr-25-2019

Is there a way that you could extract only the body, but not with tags like the one below:

Output:<body>
<h1>This is a Heading</h1>
<p>This is a paragraph.</p>
</body>

But rather

Output:
This is a heading

RE: Reading a Regex pattern - DeaD_EyE - Apr-25-2019

Yes and there is more than one way.
The method find() should return only one result.
Then you have also find_all() which returns a list with all elements which matches the pattern.

result = bs.find('h1')
text_in_tag = result.text

RE: Reading a Regex pattern - snippsat - Apr-25-2019

(Apr-25-2019, 07:32 AM)stahorse Wrote: Is there a way that you could extract only the body, but not with tags like the one below:

Take a look at Web-Scraping part-1