Reading a Regex pattern - Printable Version +- Python Forum (https://python-forum.io) +-- Forum: Python Coding (https://python-forum.io/forum-7.html) +--- Forum: General Coding Help (https://python-forum.io/forum-8.html) +--- Thread: Reading a Regex pattern (/thread-17792.html) Pages:
1
2
|
Reading a Regex pattern - stahorse - Apr-24-2019 Hi, I have a Regex pattern I'm trying to read, I'll explain how I understand it, please correct me if wrong (which I believe I am) pattern = re.compile(r'<.*?>')I think it reads: Quote:Find an open bracket in the text, followed by any character except a newline, it could be zero or more of them, then followed by close bracket I'm really not sure about that explanation, but I know I'm kinda close, please help. RE: Reading a Regex pattern - Gribouillis - Apr-24-2019 It is not an open bracket but a less-than sign. Similarly the last character is a greater-than sign. The question mark in the regex is unnecessary as 'zero or more' already implies 'optional'. Apart from that, the explanation looks correct. RE: Reading a Regex pattern - DeaD_EyE - Apr-24-2019 I think they are called angle brackets. The interpretation is correct. If you use re.DOTALL as flag, the dot matches also newline. The .*? is a non-greedy regex.
RE: Reading a Regex pattern - snippsat - Apr-24-2019 I have seen that pattern many times,sometimes also used in "bad" way. So if have some HTML can get the idea if i just remove all tags then i will get text out of HTML. import re html = '''\ <!DOCTYPE html> <html> <head> <meta charset="UTF-8"> <title>Page Title</title> </head> <body> <h1>This is a Heading</h1> <p>This is a paragraph.</p> </body> </html>''' So it kind of work,but a parser is a better choice.from bs4 import BeautifulSoup html = '''\ <!DOCTYPE html> <html> <head> <meta charset="UTF-8"> <title>Page Title</title> </head> <body> <h1>This is a Heading</h1> <p>This is a paragraph.</p> </body> </html>''' soup = BeautifulSoup(html, 'lxml') Parser also have better control,as regex is a bad choice with HTML. This was just guess at what pattern can do,i could of course be wrong here.
RE: Reading a Regex pattern - stahorse - Apr-25-2019 Hi, Thanks for the response. I think I should mention that I'm very new at programming, I was taking Python Tutorials, now I'm trying to learn with a "real life" problem, then I got a code to read. Now, the file that I'm reading is part HTML part something else. it does start with HTML tags but throughout the file is a mess, it's not consistent with HTML, so they start cleaning HTML tags first then work through the rest as they find them. (Apr-24-2019, 04:53 PM)snippsat Wrote: I have seen that pattern many times,sometimes also used in "bad" way. RE: Reading a Regex pattern - stahorse - Apr-25-2019 I tried to use parser, but when I execute this >>> print(soup.text.strip()) Page TitleI get this error:
RE: Reading a Regex pattern - DeaD_EyE - Apr-25-2019 You can install lxml (it's a fast xml implementation). pip install lxmlOften this module is already installed, if other installed dependencies rely on it. When you instanciate your BeautifulSoup, you can use: from bs4 import BeautifulSoup bs_html = BeautifulSoup('<html><body><p>Hello World</p></body></html>', 'html') bs_lxml = BeautifulSoup('<html><body><p>Hello World</p></body></html>', 'lxml') print(bs_html.find('p').text) print(bs_lxml.find('p').text)Don't use regex for HTML: https://blog.codinghorror.com/parsing-html-the-cthulhu-way/ RE: Reading a Regex pattern - stahorse - Apr-25-2019 Is there a way that you could extract only the body, but not with tags like the one below: But rather
RE: Reading a Regex pattern - DeaD_EyE - Apr-25-2019 Yes and there is more than one way. The method find() should return only one result.Then you have also find_all() which returns a list with all elements which matches the pattern.result = bs.find('h1') text_in_tag = result.text RE: Reading a Regex pattern - snippsat - Apr-25-2019 (Apr-25-2019, 07:32 AM)stahorse Wrote: Is there a way that you could extract only the body, but not with tags like the one below:Take a look at Web-Scraping part-1 |