![]() |
Downloading Page Source From URL List - Printable Version +- Python Forum (https://python-forum.io) +-- Forum: Python Coding (https://python-forum.io/forum-7.html) +--- Forum: General Coding Help (https://python-forum.io/forum-8.html) +--- Thread: Downloading Page Source From URL List (/thread-42257.html) Pages:
1
2
|
RE: Downloading Page Source From URL List - deanhystad - Jun-07-2024 (Jun-06-2024, 06:59 PM)zunebuggy Wrote: My sites.txt is just a list of urls like:If that is what your url's look like, not only is https: a problem. but there are problems with this: myfold = myurl[8:10] myfn = myurl[8:12]"https://abc.com"[8:10] == "ab", not "abc". 10-8 = 2 characters, not 3. You would need [8:11], or [7:10] after you fix the https issue. Better yet you should use pattern matching to extract the file and folder names. import re URL = r"http://gibberish.com\more_gibberish?page=" if match := re.search(r"(\w+\.\w+)\\(.*)", URL): website, document = match.groups() name, ext = website.split(".") print(name, ext, document)
RE: Downloading Page Source From URL List - Pedroski55 - Jun-08-2024 I don't think you will use \ in hyperlinks. If your web address looks like: URL = r"http://gibberish-rubbish-trash.com/more_gibberish?page="and your regex is like this: e = re.compile(r"(\w+\.\w+)\\(.*)")You will not find the whole web address because, notwithstanding all hyphenated words, - is not \w: I had trouble with web addresses containing -This finds the base web address: URL = r"http://gibberish-rubbish-trash.com/more_gibberish?page=" e = re.compile(r"//(\S+\.\w+)") res = e.search(URL) res.group(1)
|