Bottom Page

Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
 Need To Scrape Some Links
#1
All links I need to scrape contain "www/delivery"

Example snippets of how it can look in the source code https://pastebin.com/3Lgyb3Cd

As you see there's so much variation, I find it very hard to grab these links.
Quote
#2
For future reference, providing that in output tags instead of linking off-site would be appreciated. I'm in between tasks at work at the moment but I usually ignore these kinds of posts. (I say this purely to be helpful, not trying to call you out or anything.)

Having looked at your link... Regular expressions aren't good for parsing arbitrary HTML, but they should be great for handling regular links. Have you tried writing a simple regex? Detecting the end in the general case is the tricky part (though I'm sure there's a Stack Overflow answer to that), but in your case it might be as simple as looking for quotes.
Feel like you're not getting the answers you want? Checkout the help/rules for things like what to include/not include in a post, how to use code tags, how to ask smart questions, and more.

Pro-tip - there's an inverse correlation between the number of lines of code posted and my enthusiasm for helping with a question :)
Quote
#3
(Oct-09-2018, 12:30 AM)micseydel Wrote: For future reference, providing that in output tags instead of linking off-site would be appreciated. I'm in between tasks at work at the moment but I usually ignore these kinds of posts. (I say this purely to be helpful, not trying to call you out or anything.)

Having looked at your link... Regular expressions aren't good for parsing arbitrary HTML, but they should be great for handling regular links. Have you tried writing a simple regex? Detecting the end in the general case is the tricky part (though I'm sure there's a Stack Overflow answer to that), but in your case it might be as simple as looking for quotes.

I linked that way because I noticed a few of the links I'm scraping are connected to porn or spammy ads and malware. I haven't really vetted them, they're completely random. What exactly does output do? How can I use it?

I should have included some code that I tried:

html = """
<html><head></head>
<body>
<div id="nx646" data-advertentie-url="https://service.sportsads.nl/www/delivery/asyncjs.php"></div>
</body>
</html>
"""

soup = BeautifulSoup(html, 'lxml')

result = soup.body.findAll(text=re.compile(r'www/delivery', re.IGNORECASE | re.DOTALL))

if len(result) > 0:
    result = "Link Found"
    
print(result)
I've never written any regex it's completely foreign to me. I tried using a generator but it was confusing. I need to somehow mix this code:

re.findall(r'"([^"]*)"', inputString)
With a query that finds the other footprint: "www/delivery"

So ONLY text inside of quotations that ALSO matches the footprint.
Quote

Top Page

Possibly Related Threads...
Thread Author Replies Views Last Post
  webscrapping links and then enter those links to scrape data kirito85 2 308 Jun-13-2019, 02:23 AM
Last Post: kirito85
  BS4 - How Can I Scrape These Links? digitalmatic7 1 647 May-07-2018, 03:05 AM
Last Post: snippsat

Forum Jump:


Users browsing this thread: 1 Guest(s)