Python Forum

Full Version: Scraping external URLs from pages
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
I'd like to start this off by saying I know nothing at all about Python. If what I want to do is possible, I'll need to find someone to help me to do it.

I have a list of a few million URLs that I gathered using ScrapeBox. I would like to scrape all of those pages for external domains. My end goal is to find available domains that can be registered. I'd like to find a more cost effective method of scraping external URLs. ScrapeBox uses Windows hosting and to scale it to what I want isn't cost effective.

How difficult would this be to do in Python? Also, would it be costly to churn through millions of URLs?
I don't think I got the idea...but check this:

First you have to read the list of domain, modify them and after that, ping the new address.
If the ping returns OK, you know that the modified domains is taken.
I do a little scraping, maybe I can help you.
Let me know.
Thank you
Renny
@Blue Dog: This is not a post in Jobs section. If you can and want to contribute to discussion please share your thoughts in a public post in the thread.
@Apook, let me know if you want to move it to Jobs section-that is if you want to hire someone to do it for you.
This isn't a post seeking someone to do it. I just want to know if it's possible to be done. I have a text file with millions of URLs. Each URL is a web page. I need to have all the links on each of the pages scraped and put into a text file. I was wondering if this is possible. At this point I don't have the ability to hire anyone to do it. I just want to know if this is possible.
Yes, it is possible, and probably easier than you expect it would be :)

The modules requests (to fetch the page at the url) and BeautifulSoup (to quickly and easily parse that page) will make this very easy for you.