[Learning:bs4, re.search] - RegEx string cutoff

jarmerfohn · (This post was last modified: Nov-09-2019, 06:47 AM by jarmerfohn.)

All I'm trying to do is test print an html string given a regex pattern but the result is always incomplete and I cant figure out why. I'm new to python, and a coding amateur in general... bla bla... But all the regex training sites lead me to believe my pattern will work for this seemingly simple html capture but it keeps getting cut off in the build. I've been trying different flags but I dont think that's the issue. I also know its not the re.py cache. It's gotta be an escape char that I cant figure out, right?

GOAL:
Trying to print: "https://newjersey.craigslist.orgparlin-chevrolet-colorado-call/7014860327.html"
compile result is: "https://newjersey.craigslist.orgparlin-chevrolet-"

from urllib.request import urlopen
from urllib.error import HTTPError
from urllib.error import URLError
from bs4 import BeautifulSoup
import re



str1 = (""" 
bhcgHf4AWry,1:00N0N_iBTHgJR0p0p_2hkovkPhFZk,1:00101_bX2XWbjP0wA,1:00j0j_5naXGGGbBUK,1:00j0j_gbiQHGBLUjL,1:00k0k_fnTDHBeHrt5,1:00s0s_375GQT7ladO" href="https://newjersey.craigslist.orgparlin-chevrolet-colorado-call/7014860327.html">
<span class="result-price">$18000</span>
</a>
""")


print(str1)
reSearch1 = re.search(r'(https:).*(.html)', str1, flags=re.UNICODE)
print(reSearch1)

Output: 
bhcgHf4AWry,1:00N0N_iBTHgJR0p0p_2hkovkPhFZk,1:00101_bX2XWbjP0wA,1:00j0j_5naXGGGbBUK,1:00j0j_gbiQHGBLUjL,1:00k0k_fnTDHBeHrt5,1:00s0s_375GQT7ladO" href="https://newjersey.craigslist.orgparlin-chevrolet-colorado-call/7014860327.html">
<span class="result-price">$18000</span>
</a>

<re.Match object; span=(153, 231), match='https://newjersey.craigslist.orgparlin-chevrolet->
[Finished in 0.2s]

Thanks for any help gents,
Jr

**buran** · (This post was last modified: Nov-09-2019, 07:37 AM by buran.)

don't use regex to parse html - plain and simple.

obviously this is some <a> tag. so using BeautifulSoup

from bs4 import BeautifulSoup
html = '''<a href="https://newjersey.craigslist.orgparlin-chevrolet-colorado-call/7014860327.html">
<span class="result-price">$18000</span>
</a>'''
soup = BeautifulSoup(html, 'html.parser')
a_tag = soup.find('a')
print(a_tag.get('href'))

also look at our tutorial:
part1

jarmerfohn · Nov-09-2019, 11:56 PM

Well, Thanks Buran. I'm off and running with the function I was working on. But I gotta say, I'm shocked/disappointed that the two books I have in front of me at the moment both have numerous examples using regex to parse html. It all seemed so easy... Anyway, you got me to change my approach and things came together fairly quickly. Thanks. Also, neat link, very 'house of leaves'.

***snippsat*** · Nov-10-2019, 01:42 AM

Both Beautiful Soup and lxml has build in support for regex.
Now is it not often that need this regex support,as eg there also support for CSS Selector which is powerful finding specific stuff.

from bs4 import BeautifulSoup

# Simulate a web page
html = '''\
<div>
  <p class="vehicle">
    <a class="Taxi" href="Taxi link"></a>
    <a class="bmw" href="Link to bmw">Type of car is BMW</a>
    <a class="opel" href="Link to opel">Type of car is Opel</a>
    <a class="Bus" href="Bus link"></a>
  </p>
</div>'''

soup = BeautifulSoup(html, 'lxml')
# Select href attribute that begins with Link
print(soup.select('a[href^="Link"]'))

Output:
[<a class="bmw" href="Link to bmw">Type of car is BMW</a>, <a class="opel" href="Link to opel">Type of car is Opel</a>]

When the parser stop working is when when use .text,then regex can be needed on this text output.
If using code over as example an finding just car type from text.

>>> import re 
>>> 
>>> link = soup.select('a[href^="Link"]')
>>> car_text = '\n'.join([tag.text for tag in link])
>>> print(car_text)
Type of car is BMW
Type of car is Opel
>>> 
>>> re.findall(r"car is\s(\w+)", car_text)
['BMW', 'Opel']

Fre3k · Nov-23-2019, 08:58 AM

(Nov-09-2019, 07:37 AM)buran Wrote: don't use regex to parse html - plain and simple.

Short question about that statement.

What exactly is wrong, or shouldnt be done, when I parse the website with bs4, and search for example an ajaxToken ?
It would be a pain in the ass using bs4. Therefore regex is a nice option or not ?

Would be glad about an answer.
Because I personally use/mix regex/bs4 in some of my projects.

Thanks in advance ! :)

**buran** · (This post was last modified: Nov-23-2019, 09:33 AM by buran.)

I think the link I post explains what's wrong with using regex. No use to repeat the same discussion here
As @snippsat explained you can use regex when parser stop works.For specific cases mix of bs4 and regex to narrow down the exact piece of info from a string MAY work.

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	string parsing with re.search()	delahug	9	3,712	Jun-04-2020, 07:02 PM Last Post: delahug
	Regex search for string	DBS	3	4,597	Feb-06-2017, 11:39 PM Last Post: Ofnuts

[Learning:bs4, re.search] - RegEx string cutoff

User Panel Messages

Announcements