Python Forum

Full Version: Removing the unwanted data from a file
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Pages: 1 2
I've done a grep search through all these emails and folders. The output is in a file. The data is like this

Quote:family/Smallville, Robert & Mary/28134: Bioelectro healing - https://t.me/bioelectromagnetic_healing

All I want is the URI, so that I'm left with

Quote:https://t.me/bioelectromagnetic_healing

I assume in Python, I simply search for the string "https", then remove the data from zero position up to the position where the search has found that string. Looks simple in the example data above. yet many lines in the file have data after the end of the complete URI. I guess a search again to find a blank/space ?

From https://www.w3schools.com/python/ref_string_find.asp an example ..

txt = "Hello, welcome to my world."

x = txt.find("e")

print(x)
Should I use 'find' or 'search' for this ? No doubt it is just an "open" of the file, then a "for" loop to check for the string.
Could do something like this

string = 'family/Smallville, Robert & Mary/28134: Bioelectro healing - https://t.me/bioelectromagnetic_healing'
print(string)
new_string = string.split('-')
new_string = new_string[1].strip()
print(new_string)
Output:
family/Smallville, Robert & Mary/28134: Bioelectro healing - https://t.me/bioelectromagnetic_healing https://t.me/bioelectromagnetic_healing
Thanks , that worked perfectly for the format of the sample data. I did run into some problems when the data was different. This is what works for the same type of sample data, plus another type. There may be a better way of addressing this, but have tried to comment here and there

#Removing the unwanted data from a file. First, test the sample data

#string = 'family/Smallville, Robert & Mary/28134: Bioelectro healing - https://t.me/bioelectromagnetic_healing'
string = 'Python Forums/Friends/35361:<a href=3D"https://python-forum.io/thread-35532.html" rel=3D"noreferrer" target=3D='

print(string)
my_list = string.split(' ')     #split the string into a list/array

iter_len = len(my_list)

for m in my_list:           # go through the list & print each element
    print(m)
    
matches = []
 
for match in my_list:       # find the element that has the string 'https' in it
    if "https" in match:
        matches.append(match)
 
print(matches)    

# Often there is no space or dash chars, but html encoding and other strange chars
#       so find the position of the 'https'

# Initializing string
ini_string1 = matches[0]
 
# Character to find
c = "https"
# printing initial string and character
print ("initial_strings : ", ini_string1,
             "\ncharacter_to_find : ", c)
 
# Using index Method
try:
    res = ini_string1.index(c)
    print ("Character {} in string {} is present at {}".format(
                                  c, ini_string1, str(res + 1)))
except ValueError as e:
    print ("No such character available in string {}".format(ini_string1))
    
Output:
Python Forums/Friends/35361:<a href=3D"https://python-forum.io/thread-35532.html" rel=3D"noreferrer" target=3D= Python Forums/Friends/35361:<a href=3D"https://python-forum.io/thread-35532.html" rel=3D"noreferrer" target=3D= ['href=3D"https://python-forum.io/thread-35532.html"'] initial_strings : href=3D"https://python-forum.io/thread-35532.html" character_to_find : https Character https in string href=3D"https://python-forum.io/thread-35532.html" is present at 9
so now I have position 9 as the starting character, to then strip out the URI. Yet it assumes the URI is complete and no 'garbage' at the end of it. I guess it may be nearly ready to put that code to run through the file, print each successfully found 'https' and possibly modify the code further to cater for any gotchas.
I added this line right at the bottom

new_string = ini_string1[res,len(ini_string1)]
but got

Output:
TypeError: string indices must be integers
Another way
urls = [
    'Python Forums/Friends/35361:<a href=3D"https://python-forum.io/thread-35532.html" rel=3D"noreferrer" target=3D=',
    'family/Smallville, Robert & Mary/28134: Bioelectro healing - https://t.me/bioelectromagnetic_healing'
    ]
splitters = ['\'','"']
for url in urls:
    url = url[url.find('http'):]
    for splitter in splitters:
        if splitter in url:
            url = url.split(splitter)
            url = url[0]
    print(url)
Output:
https://python-forum.io/thread-35532.html https://t.me/bioelectromagnetic_healing
(Nov-14-2021, 12:39 AM)jehoshua Wrote: [ -> ]I've done a grep search through all these emails and folders. The output is in a file. The data is like this

Quote:family/Smallville, Robert & Mary/28134: Bioelectro healing - https://t.me/bioelectromagnetic_healing

All I want is the URI, so that I'm left with

Quote:https://t.me/bioelectromagnetic_healing

I haven't looked at all your cases, but why even bother with Python for this? AWK is a great little language for text processing things like this. It breaks lines into fields, which by default are separated by spaces and indexed from 1. Hence

Output:
$ echo "family/Smallville, Robert & Mary/28134: Bioelectro healing - https://t.me/bioelectromagnetic_healing" | awk '{print $8}' https://t.me/bioelectromagnetic_healing
The point I'm really trying to make is that that there may be more appropriate tools for a given task. For text processing tasks like this, it is well worth learning some AWK, regular expressions, sed and you already mentioned grep. Since those tools are made for the job, there's less to write yourself, which means less to test and debug and less to go wrong. The Grymoire has great tutorials on these things.

I'm a programmer at work and I often have to deal with text processing tasks like these - sometimes I'll have lists of items that need investigation or I have to do something to manually reprocess them in our systems and those kinds of tools help me focus on that instead of having to write a lot.
(Nov-14-2021, 05:17 AM)menator01 Wrote: [ -> ]Another way
urls = [
    'Python Forums/Friends/35361:<a href=3D"https://python-forum.io/thread-35532.html" rel=3D"noreferrer" target=3D=',
    'family/Smallville, Robert & Mary/28134: Bioelectro healing - https://t.me/bioelectromagnetic_healing'
    ]
splitters = ['\'','"']
for url in urls:
    url = url[url.find('http'):]
    for splitter in splitters:
        if splitter in url:
            url = url.split(splitter)
            url = url[0]
    print(url)
Output:
https://python-forum.io/thread-35532.html https://t.me/bioelectromagnetic_healing

Thanks that worked really well on a lot of differently formatted data. Some of them still had garbage (spaces, html code) appended, however your code is easier to follow than the one I came up with.
Thanks, I tried that 'echo' code with the 'awk' being pipped, however I got ..

Output:
$: command not found
Synaptic told me that 'mawk' was installed, so tried it with that being pipped, but same error message. I'm running Kubuntu 20.04.3
Copy without the $
(Nov-14-2021, 07:47 AM)jehoshua Wrote: [ -> ]Thanks, I tried that 'echo' code with the 'awk' being pipped, however I got ..
Output:
$: command not found
echo "family/Smallville, Robert & Mary/28134: Bioelectro healing - https://t.me/bioelectromagnetic_healing" | awk '{print $8}'
With regex this is what grep and awk use under the hood.
import re

data = '''\
Python Forums/Friends/35361:<a href=3D"https://python-forum.io/thread-35532.html" rel=3D"noreferrer" target=3D='
family/Smallville, Robert & Mary/28134: Bioelectro healing - https://t.me/bioelectromagnetic_healing"
'''

pattern = re.compile(r'(http.*?)\"')
for match in pattern.finditer(data):
    print(match.group(1))
Output:
https://python-forum.io/thread-35532.html https://t.me/bioelectromagnetic_healing
There is a lot pattern available on web The Perfect URL Regular Expression
Training Regex101.
import re

data = '''\
Python Forums/Friends/35361:<a href=3D"https://python-forum.io/thread-35532.html" rel=3D"noreferrer" target=3D='
family/Smallville, Robert & Mary/28134: Bioelectro healing - https://t.me/bioelectromagnetic_healing"
'''

pattern = re.compile(r'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+')
for match in pattern.finditer(data):
    print(match.group())
Output:
https://python-forum.io/thread-35532.html https://t.me/bioelectromagnetic_healing
(Nov-14-2021, 11:41 AM)snippsat Wrote: [ -> ]Copy without the $

Thank you, I missed that I was adding in the $

The code you supplied worked just fine, thanks. I added some more test data to both scripts and they worked fine also. I can see potential problems with using
awk
because it expects the URI to be the eighth parameter , yet the data does not reflect that as a constant.
Pages: 1 2