Nov-14-2021, 03:27 AM
Thanks , that worked perfectly for the format of the sample data. I did run into some problems when the data was different. This is what works for the same type of sample data, plus another type. There may be a better way of addressing this, but have tried to comment here and there
#Removing the unwanted data from a file. First, test the sample data #string = 'family/Smallville, Robert & Mary/28134: Bioelectro healing - https://t.me/bioelectromagnetic_healing' string = 'Python Forums/Friends/35361:<a href=3D"https://python-forum.io/thread-35532.html" rel=3D"noreferrer" target=3D=' print(string) my_list = string.split(' ') #split the string into a list/array iter_len = len(my_list) for m in my_list: # go through the list & print each element print(m) matches = [] for match in my_list: # find the element that has the string 'https' in it if "https" in match: matches.append(match) print(matches) # Often there is no space or dash chars, but html encoding and other strange chars # so find the position of the 'https' # Initializing string ini_string1 = matches[0] # Character to find c = "https" # printing initial string and character print ("initial_strings : ", ini_string1, "\ncharacter_to_find : ", c) # Using index Method try: res = ini_string1.index(c) print ("Character {} in string {} is present at {}".format( c, ini_string1, str(res + 1))) except ValueError as e: print ("No such character available in string {}".format(ini_string1))
Output:Python Forums/Friends/35361:<a href=3D"https://python-forum.io/thread-35532.html" rel=3D"noreferrer" target=3D=
Python
Forums/Friends/35361:<a
href=3D"https://python-forum.io/thread-35532.html"
rel=3D"noreferrer"
target=3D=
['href=3D"https://python-forum.io/thread-35532.html"']
initial_strings : href=3D"https://python-forum.io/thread-35532.html"
character_to_find : https
Character https in string href=3D"https://python-forum.io/thread-35532.html" is present at 9
so now I have position 9 as the starting character, to then strip out the URI. Yet it assumes the URI is complete and no 'garbage' at the end of it. I guess it may be nearly ready to put that code to run through the file, print each successfully found 'https' and possibly modify the code further to cater for any gotchas.