Downloading Page Source From URL List

**deanhystad** · (This post was last modified: Jun-07-2024, 05:51 PM by deanhystad.)

(Jun-06-2024, 06:59 PM)zunebuggy Wrote: My sites.txt is just a list of urls like:
https://abc.com
https://def.com
https://ghi.com
Thanks.

If that is what your url's look like, not only is https: a problem. but there are problems with this:

        myfold = myurl[8:10]
        myfn = myurl[8:12]

"https://abc.com"[8:10] == "ab", not "abc". 10-8 = 2 characters, not 3. You would need [8:11], or [7:10] after you fix the https issue. Better yet you should use pattern matching to extract the file and folder names.

import re

URL = r"http://gibberish.com\more_gibberish?page="

if match := re.search(r"(\w+\.\w+)\\(.*)", URL):
    website, document = match.groups()
    name, ext = website.split(".")
    print(name, ext, document)

Output:
gibberish com more_gibberish?page=

Pedroski55 · Jun-08-2024, 06:40 AM

I don't think you will use \ in hyperlinks.

If your web address looks like:

URL = r"http://gibberish-rubbish-trash.com/more_gibberish?page="

and your regex is like this:

e = re.compile(r"(\w+\.\w+)\\(.*)")

You will not find the whole web address because, notwithstanding all hyphenated words, - is not \w:

Output:e.search(URL)
<re.Match object; span=(25, 55), match='trash.com\\more_gibberish?page='>

I had trouble with web addresses containing -

This finds the base web address:

URL = r"http://gibberish-rubbish-trash.com/more_gibberish?page="
e = re.compile(r"//(\S+\.\w+)")
res = e.search(URL)
res.group(1)

Output:
'gibberish-rubbish-trash.com'

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Help with to check an Input list data with a data read from an external source	sacharyya	3	668	Mar-09-2024, 12:33 PM Last Post: Pedroski55
	Downloading images from webpages	H84Gabor	2	2,045	Sep-29-2021, 05:39 PM Last Post: snippsat
	Downloading a module Xlsxwriter	dan789	6	11,648	Jan-26-2019, 02:13 PM Last Post: dan789
	"if statement" and downloading a dataset	Alberto	1	2,601	Jan-25-2018, 01:44 PM Last Post: ka06059
	Downloading and using pyperclip	PMPythonlearner	2	5,217	Dec-31-2017, 04:37 PM Last Post: PMPythonlearner
	Problem downloading 2.7.8 Mac OSX	Benjipincus	2	3,183	Dec-18-2017, 01:33 PM Last Post: snippsat

Downloading Page Source From URL List

User Panel Messages

Announcements