Removing the unwanted data from a file

jehoshua · Nov-14-2021, 03:27 AM

Thanks , that worked perfectly for the format of the sample data. I did run into some problems when the data was different. This is what works for the same type of sample data, plus another type. There may be a better way of addressing this, but have tried to comment here and there

#Removing the unwanted data from a file. First, test the sample data

#string = 'family/Smallville, Robert & Mary/28134: Bioelectro healing - https://t.me/bioelectromagnetic_healing'
string = 'Python Forums/Friends/35361:<a href=3D"https://python-forum.io/thread-35532.html" rel=3D"noreferrer" target=3D='

print(string)
my_list = string.split(' ')     #split the string into a list/array

iter_len = len(my_list)

for m in my_list:           # go through the list & print each element
    print(m)
    
matches = []
 
for match in my_list:       # find the element that has the string 'https' in it
    if "https" in match:
        matches.append(match)
 
print(matches)    

# Often there is no space or dash chars, but html encoding and other strange chars
#       so find the position of the 'https'

# Initializing string
ini_string1 = matches[0]
 
# Character to find
c = "https"
# printing initial string and character
print ("initial_strings : ", ini_string1,
             "\ncharacter_to_find : ", c)
 
# Using index Method
try:
    res = ini_string1.index(c)
    print ("Character {} in string {} is present at {}".format(
                                  c, ini_string1, str(res + 1)))
except ValueError as e:
    print ("No such character available in string {}".format(ini_string1))

Output:Python Forums/Friends/35361:<a href=3D"https://python-forum.io/thread-35532.html" rel=3D"noreferrer" target=3D=
Python
Forums/Friends/35361:<a
href=3D"https://python-forum.io/thread-35532.html"
rel=3D"noreferrer"
target=3D=
['href=3D"https://python-forum.io/thread-35532.html"']
initial_strings :  href=3D"https://python-forum.io/thread-35532.html" 
character_to_find :  https
Character https in string href=3D"https://python-forum.io/thread-35532.html" is present at 9

so now I have position 9 as the starting character, to then strip out the URI. Yet it assumes the URI is complete and no 'garbage' at the end of it. I guess it may be nearly ready to put that code to run through the file, print each successfully found 'https' and possibly modify the code further to cater for any gotchas.

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	How to remove unwanted images and tables from a Word file using Python?	rownong	2	974	Feb-04-2025, 08:30 AM Last Post: Pedroski55
	Unwanted execution of unittest	ThomasFab	9	4,696	Nov-15-2022, 05:33 PM Last Post: snippsat
	HELP on Unwanted CSV Export Output \| Using Selenium to Scrape	soothsayerpg	0	1,890	Jun-13-2021, 12:23 PM Last Post: soothsayerpg
	xml file creation from an XML file template and data from an excel file	naji_python	1	2,980	Dec-21-2020, 03:24 PM Last Post: Gribouillis
	How to save CSV file data into the Azure Data Lake Storage Gen2 table?	Mangesh121	0	2,739	Jun-26-2020, 11:59 AM Last Post: Mangesh121
	How to eliminate unwanted spaces	Mohan	5	4,539	Jun-04-2020, 08:34 AM Last Post: buran
	Removing Certain Numbers From File	chascp	2	3,069	Feb-07-2020, 04:04 PM Last Post: chascp
	Unwanted delay between looped synth plays	WolfeCreek	1	3,022	Aug-02-2018, 09:24 PM Last Post: Vysero
	Unwanted variable change in module	dannyH	2	3,628	May-08-2018, 05:33 PM Last Post: dannyH
	Unwanted random generation of scripted Shapes in GASP	diemildefreude	3	6,603	Oct-23-2016, 03:11 PM Last Post: snippsat

Removing the unwanted data from a file

User Panel Messages

Announcements