Python Forum
Need To Scrape Some Links - Printable Version

+- Python Forum (https://python-forum.io)
+-- Forum: Python Coding (https://python-forum.io/forum-7.html)
+--- Forum: Web Scraping & Web Development (https://python-forum.io/forum-13.html)
+--- Thread: Need To Scrape Some Links (/thread-13294.html)



Need To Scrape Some Links - digitalmatic7 - Oct-09-2018

All links I need to scrape contain "www/delivery"

Example snippets of how it can look in the source code https://pastebin.com/3Lgyb3Cd

As you see there's so much variation, I find it very hard to grab these links.


RE: Need To Scrape Some Links - micseydel - Oct-09-2018

For future reference, providing that in output tags instead of linking off-site would be appreciated. I'm in between tasks at work at the moment but I usually ignore these kinds of posts. (I say this purely to be helpful, not trying to call you out or anything.)

Having looked at your link... Regular expressions aren't good for parsing arbitrary HTML, but they should be great for handling regular links. Have you tried writing a simple regex? Detecting the end in the general case is the tricky part (though I'm sure there's a Stack Overflow answer to that), but in your case it might be as simple as looking for quotes.


RE: Need To Scrape Some Links - digitalmatic7 - Oct-09-2018

(Oct-09-2018, 12:30 AM)micseydel Wrote: For future reference, providing that in output tags instead of linking off-site would be appreciated. I'm in between tasks at work at the moment but I usually ignore these kinds of posts. (I say this purely to be helpful, not trying to call you out or anything.)

Having looked at your link... Regular expressions aren't good for parsing arbitrary HTML, but they should be great for handling regular links. Have you tried writing a simple regex? Detecting the end in the general case is the tricky part (though I'm sure there's a Stack Overflow answer to that), but in your case it might be as simple as looking for quotes.

I linked that way because I noticed a few of the links I'm scraping are connected to porn or spammy ads and malware. I haven't really vetted them, they're completely random. What exactly does output do? How can I use it?

I should have included some code that I tried:

html = """
<html><head></head>
<body>
<div id="nx646" data-advertentie-url="https://service.sportsads.nl/www/delivery/asyncjs.php"></div>
</body>
</html>
"""

soup = BeautifulSoup(html, 'lxml')

result = soup.body.findAll(text=re.compile(r'www/delivery', re.IGNORECASE | re.DOTALL))

if len(result) > 0:
    result = "Link Found"
    
print(result)
I've never written any regex it's completely foreign to me. I tried using a generator but it was confusing. I need to somehow mix this code:

re.findall(r'"([^"]*)"', inputString)
With a query that finds the other footprint: "www/delivery"

So ONLY text inside of quotations that ALSO matches the footprint.