Python Forum
Downloading Page Source From URL List - Printable Version

+- Python Forum (https://python-forum.io)
+-- Forum: Python Coding (https://python-forum.io/forum-7.html)
+--- Forum: General Coding Help (https://python-forum.io/forum-8.html)
+--- Thread: Downloading Page Source From URL List (/thread-42257.html)

Pages: 1 2


RE: Downloading Page Source From URL List - deanhystad - Jun-07-2024

(Jun-06-2024, 06:59 PM)zunebuggy Wrote: My sites.txt is just a list of urls like:
https://abc.com
https://def.com
https://ghi.com
Thanks.
If that is what your url's look like, not only is https: a problem. but there are problems with this:
        myfold = myurl[8:10]
        myfn = myurl[8:12]
"https://abc.com"[8:10] == "ab", not "abc". 10-8 = 2 characters, not 3. You would need [8:11], or [7:10] after you fix the https issue. Better yet you should use pattern matching to extract the file and folder names.
import re

URL = r"http://gibberish.com\more_gibberish?page="

if match := re.search(r"(\w+\.\w+)\\(.*)", URL):
    website, document = match.groups()
    name, ext = website.split(".")
    print(name, ext, document)
Output:
gibberish com more_gibberish?page=



RE: Downloading Page Source From URL List - Pedroski55 - Jun-08-2024

I don't think you will use \ in hyperlinks.

If your web address looks like:

URL = r"http://gibberish-rubbish-trash.com/more_gibberish?page="
and your regex is like this:

e = re.compile(r"(\w+\.\w+)\\(.*)")
You will not find the whole web address because, notwithstanding all hyphenated words, - is not \w:

Output:
e.search(URL) <re.Match object; span=(25, 55), match='trash.com\\more_gibberish?page='>
I had trouble with web addresses containing -

This finds the base web address:

URL = r"http://gibberish-rubbish-trash.com/more_gibberish?page="
e = re.compile(r"//(\S+\.\w+)")
res = e.search(URL)
res.group(1)
Output:
'gibberish-rubbish-trash.com'