Python Forum
a better way to code this to fix a URL?
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
a better way to code this to fix a URL?
#1
i have this silly code meant to fix a URL that is missing as many as 7 start characters:
if url.startswith('//'): url = 'https:' + url
if url.startswith('://'): url = 'https' + url
if url.startswith('s://'): url = 'http' + url
if url.startswith('ps://'): url = 'htt' + url
if url.startswith('tps://'): url = 'ht' + url
if url.startswith('ttps://'): url = 'h' + url
is there a nice clean and pythonic way to refactor this code into 3 lines or fewer?

i hope you got a good laugh out of that code above.
Tradition is peer pressure from dead people

What do you call someone who speaks three languages? Trilingual. Two languages? Bilingual. One language? American.
Reply
#2
It seems that url will always contain // so why not just split and join and not bother whether is missing something or not:

>>> urls = ["//example.com", "://example.com", "s://example.com", "https://example.com"]
>>> for url in urls:
...     print("".join(["https://", url.split("//")[-1]]))
...
https://example.com
https://example.com
https://example.com
https://example.com
I'm not 'in'-sane. Indeed, I am so far 'out' of sane that you appear a tiny blip on the distant coast of sanity. Bucky Katt, Get Fuzzy

Da Bishop: There's a dead bishop on the landing. I don't know who keeps bringing them in here. ....but society is to blame.
Reply
#3
If the url should always start with https even, if only a hostname is given, you could use this:

from urllib.parse import urlsplit, urlunsplit


def conv2(url):
    return urlunsplit(("https", *urlsplit(url)[1:]))
But I would prefer an error message, that this scheme is not supported or that the url is mistyped.
Automatic correction of human input, leads to funny outcomes.

And sometimes gene sequences are also renamed because Excel has interpreted the text input and turned the name into a date.
Almost dead, but too lazy to die: https://sourceserver.info
All humans together. We don't need politicians!
Reply
#4
if characters are not missing, i want them to be correct. correctness will be tested later. this code needs to just pass provided characters as is so they can be tested. later code will test for "http://" and file:///" and whatever else i might want to look for. it's "https://" that i am focusing on. the typical errors come from copy and paste sloppiness.
Tradition is peer pressure from dead people

What do you call someone who speaks three languages? Trilingual. Two languages? Bilingual. One language? American.
Reply
#5
(Mar-19-2024, 09:15 AM)DeaD_EyE Wrote: And sometimes gene sequences are also renamed because Excel has interpreted the text input and turned the name into a date.
so that must be why i have so much difficulty with decorators Cry
Tradition is peer pressure from dead people

What do you call someone who speaks three languages? Trilingual. Two languages? Bilingual. One language? American.
Reply
#6
You could try this perhaps
import re

url = re.sub(r'^(?:(?:(?:(?:(?:h?t)?t)?p)?s)?\:)?//', 'https://', url)
« We can solve any problem by introducing an extra level of indirection »
Reply
#7
(Mar-22-2024, 04:58 PM)Gribouillis Wrote: re.sub(r'^(?:(?:(?:(?:(?:h?t)?t)?p)?s)?\:)?//', 'https://', url)
i don't understand it, but it seems to work.
Tradition is peer pressure from dead people

What do you call someone who speaks three languages? Trilingual. Two languages? Bilingual. One language? American.
Reply
#8
(Mar-25-2024, 03:26 AM)Skaperen Wrote: i don't understand it
It is easy to understand. With a regular expression, if you want to write « spam optionally preceded by eggs », you write
r'(eggs)?spam'
This matches 'spam' and also 'eggsspam'. The problem is that (eggs) is a capturing group, which you don't need. So you can replace it by a non capturing group by using the (?: syntax. It gives
r'(?:eggs)?spam'
Now, instead of « spam optionally preceded by eggs », the above regular expression means
Output:
beginning of the string then '//' optionally preceded by ( ':' optionally preceded by ( 's' optionally preceded by ( 'p' optionallly preceded by ( 't' optionally preceded by ( 't' optionally preceded by ( 'h' ))))))
perfringo likes this post
« We can solve any problem by introducing an extra level of indirection »
Reply
#9
now, i'm totally lost. i just want to keep it simple and obvious or well explained in comments.
Tradition is peer pressure from dead people

What do you call someone who speaks three languages? Trilingual. Two languages? Bilingual. One language? American.
Reply
#10
(Mar-25-2024, 11:59 PM)Skaperen Wrote: i just want to keep it simple and obvious or well explained in comments.
You can simplify it a bit because you don't need to match the 'h'. Here it is with a clear comment
import re
 
# replace any of //, ://, s://, ps://, tps://, ttps:// by https:// at beginning of url
url = re.sub(r'^(?:(?:(?:(?:t?t)?p)?s)?\:)?//', 'https://', url)
« We can solve any problem by introducing an extra level of indirection »
Reply


Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020