Posts: 122
Threads: 24
Joined: Dec 2017
I've done a grep search through all these emails and folders. The output is in a file. The data is like this
Quote:family/Smallville, Robert & Mary/28134: Bioelectro healing - https://t.me/bioelectromagnetic_healing
All I want is the URI, so that I'm left with
Quote:https://t.me/bioelectromagnetic_healing
I assume in Python, I simply search for the string "https", then remove the data from zero position up to the position where the search has found that string. Looks simple in the example data above. yet many lines in the file have data after the end of the complete URI. I guess a search again to find a blank/space ?
From https://www.w3schools.com/python/ref_string_find.asp an example ..
txt = "Hello, welcome to my world."
x = txt.find("e")
print(x) Should I use 'find' or 'search' for this ? No doubt it is just an "open" of the file, then a "for" loop to check for the string.
Posts: 1,144
Threads: 114
Joined: Sep 2019
Could do something like this
string = 'family/Smallville, Robert & Mary/28134: Bioelectro healing - https://t.me/bioelectromagnetic_healing'
print(string)
new_string = string.split('-')
new_string = new_string[1].strip()
print(new_string) Output: family/Smallville, Robert & Mary/28134: Bioelectro healing - https://t.me/bioelectromagnetic_healing
https://t.me/bioelectromagnetic_healing
Posts: 122
Threads: 24
Joined: Dec 2017
Thanks , that worked perfectly for the format of the sample data. I did run into some problems when the data was different. This is what works for the same type of sample data, plus another type. There may be a better way of addressing this, but have tried to comment here and there
#Removing the unwanted data from a file. First, test the sample data
#string = 'family/Smallville, Robert & Mary/28134: Bioelectro healing - https://t.me/bioelectromagnetic_healing'
string = 'Python Forums/Friends/35361:<a href=3D"https://python-forum.io/thread-35532.html" rel=3D"noreferrer" target=3D='
print(string)
my_list = string.split(' ') #split the string into a list/array
iter_len = len(my_list)
for m in my_list: # go through the list & print each element
print(m)
matches = []
for match in my_list: # find the element that has the string 'https' in it
if "https" in match:
matches.append(match)
print(matches)
# Often there is no space or dash chars, but html encoding and other strange chars
# so find the position of the 'https'
# Initializing string
ini_string1 = matches[0]
# Character to find
c = "https"
# printing initial string and character
print ("initial_strings : ", ini_string1,
"\ncharacter_to_find : ", c)
# Using index Method
try:
res = ini_string1.index(c)
print ("Character {} in string {} is present at {}".format(
c, ini_string1, str(res + 1)))
except ValueError as e:
print ("No such character available in string {}".format(ini_string1))
Output: Python Forums/Friends/35361:<a href=3D"https://python-forum.io/thread-35532.html" rel=3D"noreferrer" target=3D=
Python
Forums/Friends/35361:<a
href=3D"https://python-forum.io/thread-35532.html"
rel=3D"noreferrer"
target=3D=
['href=3D"https://python-forum.io/thread-35532.html"']
initial_strings : href=3D"https://python-forum.io/thread-35532.html"
character_to_find : https
Character https in string href=3D"https://python-forum.io/thread-35532.html" is present at 9
so now I have position 9 as the starting character, to then strip out the URI. Yet it assumes the URI is complete and no 'garbage' at the end of it. I guess it may be nearly ready to put that code to run through the file, print each successfully found 'https' and possibly modify the code further to cater for any gotchas.
Posts: 122
Threads: 24
Joined: Dec 2017
I added this line right at the bottom
new_string = ini_string1[res,len(ini_string1)] but got
Output: TypeError: string indices must be integers
Posts: 1,144
Threads: 114
Joined: Sep 2019
Another way
urls = [
'Python Forums/Friends/35361:<a href=3D"https://python-forum.io/thread-35532.html" rel=3D"noreferrer" target=3D=',
'family/Smallville, Robert & Mary/28134: Bioelectro healing - https://t.me/bioelectromagnetic_healing'
]
splitters = ['\'','"']
for url in urls:
url = url[url.find('http'):]
for splitter in splitters:
if splitter in url:
url = url.split(splitter)
url = url[0]
print(url) Output: https://python-forum.io/thread-35532.html
https://t.me/bioelectromagnetic_healing
Posts: 1,838
Threads: 2
Joined: Apr 2017
(Nov-14-2021, 12:39 AM)jehoshua Wrote: I've done a grep search through all these emails and folders. The output is in a file. The data is like this
Quote:family/Smallville, Robert & Mary/28134: Bioelectro healing - https://t.me/bioelectromagnetic_healing
All I want is the URI, so that I'm left with
Quote:https://t.me/bioelectromagnetic_healing
I haven't looked at all your cases, but why even bother with Python for this? AWK is a great little language for text processing things like this. It breaks lines into fields, which by default are separated by spaces and indexed from 1. Hence
Output: $ echo "family/Smallville, Robert & Mary/28134: Bioelectro healing - https://t.me/bioelectromagnetic_healing" | awk '{print $8}'
https://t.me/bioelectromagnetic_healing
The point I'm really trying to make is that that there may be more appropriate tools for a given task. For text processing tasks like this, it is well worth learning some AWK, regular expressions, sed and you already mentioned grep. Since those tools are made for the job, there's less to write yourself, which means less to test and debug and less to go wrong. The Grymoire has great tutorials on these things.
I'm a programmer at work and I often have to deal with text processing tasks like these - sometimes I'll have lists of items that need investigation or I have to do something to manually reprocess them in our systems and those kinds of tools help me focus on that instead of having to write a lot.
Posts: 122
Threads: 24
Joined: Dec 2017
(Nov-14-2021, 05:17 AM)menator01 Wrote: Another way
urls = [
'Python Forums/Friends/35361:<a href=3D"https://python-forum.io/thread-35532.html" rel=3D"noreferrer" target=3D=',
'family/Smallville, Robert & Mary/28134: Bioelectro healing - https://t.me/bioelectromagnetic_healing'
]
splitters = ['\'','"']
for url in urls:
url = url[url.find('http'):]
for splitter in splitters:
if splitter in url:
url = url.split(splitter)
url = url[0]
print(url) Output: https://python-forum.io/thread-35532.html
https://t.me/bioelectromagnetic_healing
Thanks that worked really well on a lot of differently formatted data. Some of them still had garbage (spaces, html code) appended, however your code is easier to follow than the one I came up with.
Posts: 122
Threads: 24
Joined: Dec 2017
Thanks, I tried that 'echo' code with the 'awk' being pipped, however I got ..
Output: $: command not found
Synaptic told me that 'mawk' was installed, so tried it with that being pipped, but same error message. I'm running Kubuntu 20.04.3
Posts: 7,312
Threads: 123
Joined: Sep 2016
Nov-14-2021, 11:41 AM
(This post was last modified: Nov-14-2021, 11:43 AM by snippsat.)
Copy without the $
(Nov-14-2021, 07:47 AM)jehoshua Wrote: Thanks, I tried that 'echo' code with the 'awk' being pipped, however I got ..
Output: $: command not found
echo "family/Smallville, Robert & Mary/28134: Bioelectro healing - https://t.me/bioelectromagnetic_healing" | awk '{print $8}' With regex this is what grep and awk use under the hood.
import re
data = '''\
Python Forums/Friends/35361:<a href=3D"https://python-forum.io/thread-35532.html" rel=3D"noreferrer" target=3D='
family/Smallville, Robert & Mary/28134: Bioelectro healing - https://t.me/bioelectromagnetic_healing"
'''
pattern = re.compile(r'(http.*?)\"')
for match in pattern.finditer(data):
print(match.group(1)) Output: https://python-forum.io/thread-35532.html
https://t.me/bioelectromagnetic_healing
There is a lot pattern available on web The Perfect URL Regular Expression
Training Regex101.
import re
data = '''\
Python Forums/Friends/35361:<a href=3D"https://python-forum.io/thread-35532.html" rel=3D"noreferrer" target=3D='
family/Smallville, Robert & Mary/28134: Bioelectro healing - https://t.me/bioelectromagnetic_healing"
'''
pattern = re.compile(r'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+')
for match in pattern.finditer(data):
print(match.group()) Output: https://python-forum.io/thread-35532.html
https://t.me/bioelectromagnetic_healing
Posts: 122
Threads: 24
Joined: Dec 2017
(Nov-14-2021, 11:41 AM)snippsat Wrote: Copy without the $
Thank you, I missed that I was adding in the $
The code you supplied worked just fine, thanks. I added some more test data to both scripts and they worked fine also. I can see potential problems with using awk because it expects the URI to be the eighth parameter , yet the data does not reflect that as a constant.
|