Bottom Page

Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
 Need To Scrape Some Links
#1
All links I need to scrape contain "www/delivery"

Example snippets of how it can look in the source code https://pastebin.com/3Lgyb3Cd

As you see there's so much variation, I find it very hard to grab these links.
Quote
#2
For future reference, providing that in output tags instead of linking off-site would be appreciated. I'm in between tasks at work at the moment but I usually ignore these kinds of posts. (I say this purely to be helpful, not trying to call you out or anything.)

Having looked at your link... Regular expressions aren't good for parsing arbitrary HTML, but they should be great for handling regular links. Have you tried writing a simple regex? Detecting the end in the general case is the tricky part (though I'm sure there's a Stack Overflow answer to that), but in your case it might be as simple as looking for quotes.
Quote
#3
(Oct-09-2018, 12:30 AM)micseydel Wrote: For future reference, providing that in output tags instead of linking off-site would be appreciated. I'm in between tasks at work at the moment but I usually ignore these kinds of posts. (I say this purely to be helpful, not trying to call you out or anything.)

Having looked at your link... Regular expressions aren't good for parsing arbitrary HTML, but they should be great for handling regular links. Have you tried writing a simple regex? Detecting the end in the general case is the tricky part (though I'm sure there's a Stack Overflow answer to that), but in your case it might be as simple as looking for quotes.

I linked that way because I noticed a few of the links I'm scraping are connected to porn or spammy ads and malware. I haven't really vetted them, they're completely random. What exactly does output do? How can I use it?

I should have included some code that I tried:

html = """
<html><head></head>
<body>
<div id="nx646" data-advertentie-url="https://service.sportsads.nl/www/delivery/asyncjs.php"></div>
</body>
</html>
"""

soup = BeautifulSoup(html, 'lxml')

result = soup.body.findAll(text=re.compile(r'www/delivery', re.IGNORECASE | re.DOTALL))

if len(result) > 0:
    result = "Link Found"
    
print(result)
I've never written any regex it's completely foreign to me. I tried using a generator but it was confusing. I need to somehow mix this code:

re.findall(r'"([^"]*)"', inputString)
With a query that finds the other footprint: "www/delivery"

So ONLY text inside of quotations that ALSO matches the footprint.
Quote

Top Page

Possibly Related Threads...
Thread Author Replies Views Last Post
  webscrapping links from pandas dataframe Wolverin 1 135 Jun-19-2019, 11:22 PM
Last Post: Larz60+
  webscrapping links and then enter those links to scrape data kirito85 2 179 Jun-13-2019, 02:23 AM
Last Post: kirito85
  Scrape ASPX data with python... hoff1022 0 557 Feb-26-2019, 06:16 PM
Last Post: hoff1022
  How to exclude certain links while webscraping basis on keywords Prince_Bhatia 0 341 Oct-31-2018, 07:00 AM
Last Post: Prince_Bhatia
  Scrape java script web site PythonHunger 6 620 Oct-25-2018, 05:59 AM
Last Post: PythonHunger
  Basic Syntax/HTML Scrape Questions sungar78 5 665 Sep-06-2018, 09:32 PM
Last Post: sungar78
  Project: Opening all links within a web page Truman 0 395 Aug-07-2018, 12:33 AM
Last Post: Truman
  How do i scrape website whose page changes using javsacript _dopostback function and Prince_Bhatia 1 989 Aug-06-2018, 09:45 AM
Last Post: wavic
  Download all secret links from a map design website fyec 0 560 Jul-24-2018, 09:08 PM
Last Post: fyec
  Scrape multiple lines with regex greetings 2 650 Jul-04-2018, 09:09 PM
Last Post: snippsat

Forum Jump:


Users browsing this thread: 1 Guest(s)