![]() |
Removing the unwanted data from a file - Printable Version +- Python Forum (https://python-forum.io) +-- Forum: Python Coding (https://python-forum.io/forum-7.html) +--- Forum: General Coding Help (https://python-forum.io/forum-8.html) +--- Thread: Removing the unwanted data from a file (/thread-35532.html) Pages:
1
2
|
Removing the unwanted data from a file - jehoshua - Nov-14-2021 I've done a grep search through all these emails and folders. The output is in a file. The data is like this Quote:family/Smallville, Robert & Mary/28134: Bioelectro healing - https://t.me/bioelectromagnetic_healing All I want is the URI, so that I'm left with Quote:https://t.me/bioelectromagnetic_healing I assume in Python, I simply search for the string "https", then remove the data from zero position up to the position where the search has found that string. Looks simple in the example data above. yet many lines in the file have data after the end of the complete URI. I guess a search again to find a blank/space ? From https://www.w3schools.com/python/ref_string_find.asp an example .. txt = "Hello, welcome to my world." x = txt.find("e") print(x)Should I use 'find' or 'search' for this ? No doubt it is just an "open" of the file, then a "for" loop to check for the string. RE: Removing the unwanted data from a file - menator01 - Nov-14-2021 Could do something like this string = 'family/Smallville, Robert & Mary/28134: Bioelectro healing - https://t.me/bioelectromagnetic_healing' print(string) new_string = string.split('-') new_string = new_string[1].strip() print(new_string)
RE: Removing the unwanted data from a file - jehoshua - Nov-14-2021 Thanks , that worked perfectly for the format of the sample data. I did run into some problems when the data was different. This is what works for the same type of sample data, plus another type. There may be a better way of addressing this, but have tried to comment here and there #Removing the unwanted data from a file. First, test the sample data #string = 'family/Smallville, Robert & Mary/28134: Bioelectro healing - https://t.me/bioelectromagnetic_healing' string = 'Python Forums/Friends/35361:<a href=3D"https://python-forum.io/thread-35532.html" rel=3D"noreferrer" target=3D=' print(string) my_list = string.split(' ') #split the string into a list/array iter_len = len(my_list) for m in my_list: # go through the list & print each element print(m) matches = [] for match in my_list: # find the element that has the string 'https' in it if "https" in match: matches.append(match) print(matches) # Often there is no space or dash chars, but html encoding and other strange chars # so find the position of the 'https' # Initializing string ini_string1 = matches[0] # Character to find c = "https" # printing initial string and character print ("initial_strings : ", ini_string1, "\ncharacter_to_find : ", c) # Using index Method try: res = ini_string1.index(c) print ("Character {} in string {} is present at {}".format( c, ini_string1, str(res + 1))) except ValueError as e: print ("No such character available in string {}".format(ini_string1)) so now I have position 9 as the starting character, to then strip out the URI. Yet it assumes the URI is complete and no 'garbage' at the end of it. I guess it may be nearly ready to put that code to run through the file, print each successfully found 'https' and possibly modify the code further to cater for any gotchas.
RE: Removing the unwanted data from a file - jehoshua - Nov-14-2021 I added this line right at the bottom new_string = ini_string1[res,len(ini_string1)]but got
RE: Removing the unwanted data from a file - menator01 - Nov-14-2021 Another way urls = [ 'Python Forums/Friends/35361:<a href=3D"https://python-forum.io/thread-35532.html" rel=3D"noreferrer" target=3D=', 'family/Smallville, Robert & Mary/28134: Bioelectro healing - https://t.me/bioelectromagnetic_healing' ] splitters = ['\'','"'] for url in urls: url = url[url.find('http'):] for splitter in splitters: if splitter in url: url = url.split(splitter) url = url[0] print(url)
RE: Removing the unwanted data from a file - ndc85430 - Nov-14-2021 (Nov-14-2021, 12:39 AM)jehoshua Wrote: I've done a grep search through all these emails and folders. The output is in a file. The data is like this I haven't looked at all your cases, but why even bother with Python for this? AWK is a great little language for text processing things like this. It breaks lines into fields, which by default are separated by spaces and indexed from 1. Hence The point I'm really trying to make is that that there may be more appropriate tools for a given task. For text processing tasks like this, it is well worth learning some AWK, regular expressions, sed and you already mentioned grep. Since those tools are made for the job, there's less to write yourself, which means less to test and debug and less to go wrong. The Grymoire has great tutorials on these things.I'm a programmer at work and I often have to deal with text processing tasks like these - sometimes I'll have lists of items that need investigation or I have to do something to manually reprocess them in our systems and those kinds of tools help me focus on that instead of having to write a lot. RE: Removing the unwanted data from a file - jehoshua - Nov-14-2021 (Nov-14-2021, 05:17 AM)menator01 Wrote: Another way Thanks that worked really well on a lot of differently formatted data. Some of them still had garbage (spaces, html code) appended, however your code is easier to follow than the one I came up with. RE: Removing the unwanted data from a file - jehoshua - Nov-14-2021 Thanks, I tried that 'echo' code with the 'awk' being pipped, however I got .. Synaptic told me that 'mawk' was installed, so tried it with that being pipped, but same error message. I'm running Kubuntu 20.04.3
RE: Removing the unwanted data from a file - snippsat - Nov-14-2021 Copy without the $ (Nov-14-2021, 07:47 AM)jehoshua Wrote: Thanks, I tried that 'echo' code with the 'awk' being pipped, however I got .. echo "family/Smallville, Robert & Mary/28134: Bioelectro healing - https://t.me/bioelectromagnetic_healing" | awk '{print $8}'With regex this is what grep and awk use under the hood. import re data = '''\ Python Forums/Friends/35361:<a href=3D"https://python-forum.io/thread-35532.html" rel=3D"noreferrer" target=3D=' family/Smallville, Robert & Mary/28134: Bioelectro healing - https://t.me/bioelectromagnetic_healing" ''' pattern = re.compile(r'(http.*?)\"') for match in pattern.finditer(data): print(match.group(1)) There is a lot pattern available on web The Perfect URL Regular ExpressionTraining Regex101. import re data = '''\ Python Forums/Friends/35361:<a href=3D"https://python-forum.io/thread-35532.html" rel=3D"noreferrer" target=3D=' family/Smallville, Robert & Mary/28134: Bioelectro healing - https://t.me/bioelectromagnetic_healing" ''' pattern = re.compile(r'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+') for match in pattern.finditer(data): print(match.group())
RE: Removing the unwanted data from a file - jehoshua - Nov-14-2021 (Nov-14-2021, 11:41 AM)snippsat Wrote: Copy without the Thank you, I missed that I was adding in the $ The code you supplied worked just fine, thanks. I added some more test data to both scripts and they worked fine also. I can see potential problems with using awkbecause it expects the URI to be the eighth parameter , yet the data does not reflect that as a constant. |