Python Forum

Full Version: a better way to code this to fix a URL?
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Pages: 1 2
i have this silly code meant to fix a URL that is missing as many as 7 start characters:
if url.startswith('//'): url = 'https:' + url
if url.startswith('://'): url = 'https' + url
if url.startswith('s://'): url = 'http' + url
if url.startswith('ps://'): url = 'htt' + url
if url.startswith('tps://'): url = 'ht' + url
if url.startswith('ttps://'): url = 'h' + url
is there a nice clean and pythonic way to refactor this code into 3 lines or fewer?

i hope you got a good laugh out of that code above.
It seems that url will always contain // so why not just split and join and not bother whether is missing something or not:

>>> urls = ["//example.com", "://example.com", "s://example.com", "https://example.com"]
>>> for url in urls:
...     print("".join(["https://", url.split("//")[-1]]))
...
https://example.com
https://example.com
https://example.com
https://example.com
If the url should always start with https even, if only a hostname is given, you could use this:

from urllib.parse import urlsplit, urlunsplit


def conv2(url):
    return urlunsplit(("https", *urlsplit(url)[1:]))
But I would prefer an error message, that this scheme is not supported or that the url is mistyped.
Automatic correction of human input, leads to funny outcomes.

And sometimes gene sequences are also renamed because Excel has interpreted the text input and turned the name into a date.
if characters are not missing, i want them to be correct. correctness will be tested later. this code needs to just pass provided characters as is so they can be tested. later code will test for "http://" and file:///" and whatever else i might want to look for. it's "https://" that i am focusing on. the typical errors come from copy and paste sloppiness.
(Mar-19-2024, 09:15 AM)DeaD_EyE Wrote: [ -> ]And sometimes gene sequences are also renamed because Excel has interpreted the text input and turned the name into a date.
so that must be why i have so much difficulty with decorators Cry
You could try this perhaps
import re

url = re.sub(r'^(?:(?:(?:(?:(?:h?t)?t)?p)?s)?\:)?//', 'https://', url)
(Mar-22-2024, 04:58 PM)Gribouillis Wrote: [ -> ]re.sub(r'^(?:(?:(?:(?:(?:h?t)?t)?p)?s)?\:)?//', 'https://', url)
i don't understand it, but it seems to work.
(Mar-25-2024, 03:26 AM)Skaperen Wrote: [ -> ]i don't understand it
It is easy to understand. With a regular expression, if you want to write « spam optionally preceded by eggs », you write
r'(eggs)?spam'
This matches 'spam' and also 'eggsspam'. The problem is that (eggs) is a capturing group, which you don't need. So you can replace it by a non capturing group by using the (?: syntax. It gives
r'(?:eggs)?spam'
Now, instead of « spam optionally preceded by eggs », the above regular expression means
Output:
beginning of the string then '//' optionally preceded by ( ':' optionally preceded by ( 's' optionally preceded by ( 'p' optionallly preceded by ( 't' optionally preceded by ( 't' optionally preceded by ( 'h' ))))))
now, i'm totally lost. i just want to keep it simple and obvious or well explained in comments.
(Mar-25-2024, 11:59 PM)Skaperen Wrote: [ -> ]i just want to keep it simple and obvious or well explained in comments.
You can simplify it a bit because you don't need to match the 'h'. Here it is with a clear comment
import re
 
# replace any of //, ://, s://, ps://, tps://, ttps:// by https:// at beginning of url
url = re.sub(r'^(?:(?:(?:(?:t?t)?p)?s)?\:)?//', 'https://', url)
Pages: 1 2