Need To Scrape Some Links

digitalmatic7 · Oct-09-2018, 12:18 AM

All links I need to scrape contain "www/delivery"

Example snippets of how it can look in the source code https://pastebin.com/3Lgyb3Cd

As you see there's so much variation, I find it very hard to grab these links.

***micseydel*** · Oct-09-2018, 12:30 AM

For future reference, providing that in output tags instead of linking off-site would be appreciated. I'm in between tasks at work at the moment but I usually ignore these kinds of posts. (I say this purely to be helpful, not trying to call you out or anything.)

Having looked at your link... Regular expressions aren't good for parsing arbitrary HTML, but they should be great for handling regular links. Have you tried writing a simple regex? Detecting the end in the general case is the tricky part (though I'm sure there's a Stack Overflow answer to that), but in your case it might be as simple as looking for quotes.

digitalmatic7 · Oct-09-2018, 02:33 AM

(Oct-09-2018, 12:30 AM)micseydel Wrote: For future reference, providing that in output tags instead of linking off-site would be appreciated. I'm in between tasks at work at the moment but I usually ignore these kinds of posts. (I say this purely to be helpful, not trying to call you out or anything.)

Having looked at your link... Regular expressions aren't good for parsing arbitrary HTML, but they should be great for handling regular links. Have you tried writing a simple regex? Detecting the end in the general case is the tricky part (though I'm sure there's a Stack Overflow answer to that), but in your case it might be as simple as looking for quotes.

I linked that way because I noticed a few of the links I'm scraping are connected to porn or spammy ads and malware. I haven't really vetted them, they're completely random. What exactly does output do? How can I use it?

I should have included some code that I tried:

html = """
<html><head></head>
<body>
<div id="nx646" data-advertentie-url="https://service.sportsads.nl/www/delivery/asyncjs.php"></div>
</body>
</html>
"""

soup = BeautifulSoup(html, 'lxml')

result = soup.body.findAll(text=re.compile(r'www/delivery', re.IGNORECASE | re.DOTALL))

if len(result) > 0:
    result = "Link Found"
    
print(result)

I've never written any regex it's completely foreign to me. I tried using a generator but it was confusing. I need to somehow mix this code:

re.findall(r'"([^"]*)"', inputString)

With a query that finds the other footprint: "www/delivery"

So ONLY text inside of quotations that ALSO matches the footprint.

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	scrape data 1 go to next page scrape data 2 and so on	alkaline3	6	5,172	Mar-13-2020, 07:59 PM Last Post: alkaline3
	webscrapping links and then enter those links to scrape data	kirito85	2	3,199	Jun-13-2019, 02:23 AM Last Post: kirito85
	BS4 - How Can I Scrape These Links?	digitalmatic7	1	2,329	May-07-2018, 03:05 AM Last Post: snippsat

Need To Scrape Some Links

User Panel Messages

Announcements