Python Forum
[Learning:bs4, re.search] - RegEx string cutoff
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
[Learning:bs4, re.search] - RegEx string cutoff
#1
All I'm trying to do is test print an html string given a regex pattern but the result is always incomplete and I cant figure out why. I'm new to python, and a coding amateur in general... bla bla... But all the regex training sites lead me to believe my pattern will work for this seemingly simple html capture but it keeps getting cut off in the build. I've been trying different flags but I dont think that's the issue. I also know its not the re.py cache. It's gotta be an escape char that I cant figure out, right?

GOAL:
Trying to print: "https://newjersey.craigslist.orgparlin-chevrolet-colorado-call/7014860327.html"
compile result is: "https://newjersey.craigslist.orgparlin-chevrolet-"


from urllib.request import urlopen
from urllib.error import HTTPError
from urllib.error import URLError
from bs4 import BeautifulSoup
import re



str1 = (""" 
bhcgHf4AWry,1:00N0N_iBTHgJR0p0p_2hkovkPhFZk,1:00101_bX2XWbjP0wA,1:00j0j_5naXGGGbBUK,1:00j0j_gbiQHGBLUjL,1:00k0k_fnTDHBeHrt5,1:00s0s_375GQT7ladO" href="https://newjersey.craigslist.orgparlin-chevrolet-colorado-call/7014860327.html">
<span class="result-price">$18000</span>
</a>
""")


print(str1)
reSearch1 = re.search(r'(https:).*(.html)', str1, flags=re.UNICODE)
print(reSearch1)
Output:
bhcgHf4AWry,1:00N0N_iBTHgJR0p0p_2hkovkPhFZk,1:00101_bX2XWbjP0wA,1:00j0j_5naXGGGbBUK,1:00j0j_gbiQHGBLUjL,1:00k0k_fnTDHBeHrt5,1:00s0s_375GQT7ladO" href="https://newjersey.craigslist.orgparlin-chevrolet-colorado-call/7014860327.html"> <span class="result-price">$18000</span> </a> <re.Match object; span=(153, 231), match='https://newjersey.craigslist.orgparlin-chevrolet-> [Finished in 0.2s]
Thanks for any help gents,
Jr
Reply
#2
don't use regex to parse html - plain and simple.

obviously this is some <a> tag. so using BeautifulSoup
from bs4 import BeautifulSoup
html = '''<a href="https://newjersey.craigslist.orgparlin-chevrolet-colorado-call/7014860327.html">
<span class="result-price">$18000</span>
</a>'''
soup = BeautifulSoup(html, 'html.parser')
a_tag = soup.find('a')
print(a_tag.get('href'))
also look at our tutorial:
part1
If you can't explain it to a six year old, you don't understand it yourself, Albert Einstein
How to Ask Questions The Smart Way: link and another link
Create MCV example
Debug small programs

Reply
#3
Well, Thanks Buran. I'm off and running with the function I was working on. But I gotta say, I'm shocked/disappointed that the two books I have in front of me at the moment both have numerous examples using regex to parse html. It all seemed so easy... Anyway, you got me to change my approach and things came together fairly quickly. Thanks. Also, neat link, very 'house of leaves'.
Reply
#4
Both Beautiful Soup and lxml has build in support for regex.
Now is it not often that need this regex support,as eg there also support for CSS Selector which is powerful finding specific stuff.
from bs4 import BeautifulSoup

# Simulate a web page
html = '''\
<div>
  <p class="vehicle">
    <a class="Taxi" href="Taxi link"></a>
    <a class="bmw" href="Link to bmw">Type of car is BMW</a>
    <a class="opel" href="Link to opel">Type of car is Opel</a>
    <a class="Bus" href="Bus link"></a>
  </p>
</div>'''

soup = BeautifulSoup(html, 'lxml')
# Select href attribute that begins with Link
print(soup.select('a[href^="Link"]'))
Output:
[<a class="bmw" href="Link to bmw">Type of car is BMW</a>, <a class="opel" href="Link to opel">Type of car is Opel</a>]
When the parser stop working is when when use .text,then regex can be needed on this text output.
If using code over as example an finding just car type from text.
>>> import re 
>>> 
>>> link = soup.select('a[href^="Link"]')
>>> car_text = '\n'.join([tag.text for tag in link])
>>> print(car_text)
Type of car is BMW
Type of car is Opel
>>> 
>>> re.findall(r"car is\s(\w+)", car_text)
['BMW', 'Opel']
Reply
#5
(Nov-09-2019, 07:37 AM)buran Wrote: don't use regex to parse html - plain and simple.

Short question about that statement.

What exactly is wrong, or shouldnt be done, when I parse the website with bs4, and search for example an ajaxToken ?
It would be a pain in the ass using bs4. Therefore regex is a nice option or not ?

Would be glad about an answer.
Because I personally use/mix regex/bs4 in some of my projects.

Thanks in advance ! :)
Reply
#6
I think the link I post explains what's wrong with using regex. No use to repeat the same discussion here
As @snippsat explained you can use regex when parser stop works.For specific cases mix of bs4 and regex to narrow down the exact piece of info from a string MAY work.
If you can't explain it to a six year old, you don't understand it yourself, Albert Einstein
How to Ask Questions The Smart Way: link and another link
Create MCV example
Debug small programs

Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  string parsing with re.search() delahug 9 3,563 Jun-04-2020, 07:02 PM
Last Post: delahug
  Regex search for string DBS 3 4,502 Feb-06-2017, 11:39 PM
Last Post: Ofnuts

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020