Python Forum
Removing the unwanted data from a file
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Removing the unwanted data from a file
#1
I've done a grep search through all these emails and folders. The output is in a file. The data is like this

Quote:family/Smallville, Robert & Mary/28134: Bioelectro healing - https://t.me/bioelectromagnetic_healing

All I want is the URI, so that I'm left with

Quote:https://t.me/bioelectromagnetic_healing

I assume in Python, I simply search for the string "https", then remove the data from zero position up to the position where the search has found that string. Looks simple in the example data above. yet many lines in the file have data after the end of the complete URI. I guess a search again to find a blank/space ?

From https://www.w3schools.com/python/ref_string_find.asp an example ..

txt = "Hello, welcome to my world."

x = txt.find("e")

print(x)
Should I use 'find' or 'search' for this ? No doubt it is just an "open" of the file, then a "for" loop to check for the string.
Reply
#2
Could do something like this

string = 'family/Smallville, Robert & Mary/28134: Bioelectro healing - https://t.me/bioelectromagnetic_healing'
print(string)
new_string = string.split('-')
new_string = new_string[1].strip()
print(new_string)
Output:
family/Smallville, Robert & Mary/28134: Bioelectro healing - https://t.me/bioelectromagnetic_healing https://t.me/bioelectromagnetic_healing
jehoshua likes this post
I welcome all feedback.
The only dumb question, is one that doesn't get asked.
My Github
How to post code using bbtags


Reply
#3
Thanks , that worked perfectly for the format of the sample data. I did run into some problems when the data was different. This is what works for the same type of sample data, plus another type. There may be a better way of addressing this, but have tried to comment here and there

#Removing the unwanted data from a file. First, test the sample data

#string = 'family/Smallville, Robert & Mary/28134: Bioelectro healing - https://t.me/bioelectromagnetic_healing'
string = 'Python Forums/Friends/35361:<a href=3D"https://python-forum.io/thread-35532.html" rel=3D"noreferrer" target=3D='

print(string)
my_list = string.split(' ')     #split the string into a list/array

iter_len = len(my_list)

for m in my_list:           # go through the list & print each element
    print(m)
    
matches = []
 
for match in my_list:       # find the element that has the string 'https' in it
    if "https" in match:
        matches.append(match)
 
print(matches)    

# Often there is no space or dash chars, but html encoding and other strange chars
#       so find the position of the 'https'

# Initializing string
ini_string1 = matches[0]
 
# Character to find
c = "https"
# printing initial string and character
print ("initial_strings : ", ini_string1,
             "\ncharacter_to_find : ", c)
 
# Using index Method
try:
    res = ini_string1.index(c)
    print ("Character {} in string {} is present at {}".format(
                                  c, ini_string1, str(res + 1)))
except ValueError as e:
    print ("No such character available in string {}".format(ini_string1))
    
Output:
Python Forums/Friends/35361:<a href=3D"https://python-forum.io/thread-35532.html" rel=3D"noreferrer" target=3D= Python Forums/Friends/35361:<a href=3D"https://python-forum.io/thread-35532.html" rel=3D"noreferrer" target=3D= ['href=3D"https://python-forum.io/thread-35532.html"'] initial_strings : href=3D"https://python-forum.io/thread-35532.html" character_to_find : https Character https in string href=3D"https://python-forum.io/thread-35532.html" is present at 9
so now I have position 9 as the starting character, to then strip out the URI. Yet it assumes the URI is complete and no 'garbage' at the end of it. I guess it may be nearly ready to put that code to run through the file, print each successfully found 'https' and possibly modify the code further to cater for any gotchas.
Reply
#4
I added this line right at the bottom

new_string = ini_string1[res,len(ini_string1)]
but got

Output:
TypeError: string indices must be integers
Reply
#5
Another way
urls = [
    'Python Forums/Friends/35361:<a href=3D"https://python-forum.io/thread-35532.html" rel=3D"noreferrer" target=3D=',
    'family/Smallville, Robert & Mary/28134: Bioelectro healing - https://t.me/bioelectromagnetic_healing'
    ]
splitters = ['\'','"']
for url in urls:
    url = url[url.find('http'):]
    for splitter in splitters:
        if splitter in url:
            url = url.split(splitter)
            url = url[0]
    print(url)
Output:
https://python-forum.io/thread-35532.html https://t.me/bioelectromagnetic_healing
jehoshua likes this post
I welcome all feedback.
The only dumb question, is one that doesn't get asked.
My Github
How to post code using bbtags


Reply
#6
(Nov-14-2021, 12:39 AM)jehoshua Wrote: I've done a grep search through all these emails and folders. The output is in a file. The data is like this

Quote:family/Smallville, Robert & Mary/28134: Bioelectro healing - https://t.me/bioelectromagnetic_healing

All I want is the URI, so that I'm left with

Quote:https://t.me/bioelectromagnetic_healing

I haven't looked at all your cases, but why even bother with Python for this? AWK is a great little language for text processing things like this. It breaks lines into fields, which by default are separated by spaces and indexed from 1. Hence

Output:
$ echo "family/Smallville, Robert & Mary/28134: Bioelectro healing - https://t.me/bioelectromagnetic_healing" | awk '{print $8}' https://t.me/bioelectromagnetic_healing
The point I'm really trying to make is that that there may be more appropriate tools for a given task. For text processing tasks like this, it is well worth learning some AWK, regular expressions, sed and you already mentioned grep. Since those tools are made for the job, there's less to write yourself, which means less to test and debug and less to go wrong. The Grymoire has great tutorials on these things.

I'm a programmer at work and I often have to deal with text processing tasks like these - sometimes I'll have lists of items that need investigation or I have to do something to manually reprocess them in our systems and those kinds of tools help me focus on that instead of having to write a lot.
jehoshua likes this post
Reply
#7
(Nov-14-2021, 05:17 AM)menator01 Wrote: Another way
urls = [
    'Python Forums/Friends/35361:<a href=3D"https://python-forum.io/thread-35532.html" rel=3D"noreferrer" target=3D=',
    'family/Smallville, Robert & Mary/28134: Bioelectro healing - https://t.me/bioelectromagnetic_healing'
    ]
splitters = ['\'','"']
for url in urls:
    url = url[url.find('http'):]
    for splitter in splitters:
        if splitter in url:
            url = url.split(splitter)
            url = url[0]
    print(url)
Output:
https://python-forum.io/thread-35532.html https://t.me/bioelectromagnetic_healing

Thanks that worked really well on a lot of differently formatted data. Some of them still had garbage (spaces, html code) appended, however your code is easier to follow than the one I came up with.
Reply
#8
Thanks, I tried that 'echo' code with the 'awk' being pipped, however I got ..

Output:
$: command not found
Synaptic told me that 'mawk' was installed, so tried it with that being pipped, but same error message. I'm running Kubuntu 20.04.3
Reply
#9
Copy without the $
(Nov-14-2021, 07:47 AM)jehoshua Wrote: Thanks, I tried that 'echo' code with the 'awk' being pipped, however I got ..
Output:
$: command not found
echo "family/Smallville, Robert & Mary/28134: Bioelectro healing - https://t.me/bioelectromagnetic_healing" | awk '{print $8}'
With regex this is what grep and awk use under the hood.
import re

data = '''\
Python Forums/Friends/35361:<a href=3D"https://python-forum.io/thread-35532.html" rel=3D"noreferrer" target=3D='
family/Smallville, Robert & Mary/28134: Bioelectro healing - https://t.me/bioelectromagnetic_healing"
'''

pattern = re.compile(r'(http.*?)\"')
for match in pattern.finditer(data):
    print(match.group(1))
Output:
https://python-forum.io/thread-35532.html https://t.me/bioelectromagnetic_healing
There is a lot pattern available on web The Perfect URL Regular Expression
Training Regex101.
import re

data = '''\
Python Forums/Friends/35361:<a href=3D"https://python-forum.io/thread-35532.html" rel=3D"noreferrer" target=3D='
family/Smallville, Robert & Mary/28134: Bioelectro healing - https://t.me/bioelectromagnetic_healing"
'''

pattern = re.compile(r'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+')
for match in pattern.finditer(data):
    print(match.group())
Output:
https://python-forum.io/thread-35532.html https://t.me/bioelectromagnetic_healing
jehoshua likes this post
Reply
#10
(Nov-14-2021, 11:41 AM)snippsat Wrote: Copy without the $

Thank you, I missed that I was adding in the $

The code you supplied worked just fine, thanks. I added some more test data to both scripts and they worked fine also. I can see potential problems with using
awk
because it expects the URI to be the eighth parameter , yet the data does not reflect that as a constant.
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
Question Unwanted execution of unittest ThomasFab 9 2,056 Nov-15-2022, 05:33 PM
Last Post: snippsat
  HELP on Unwanted CSV Export Output | Using Selenium to Scrape soothsayerpg 0 1,273 Jun-13-2021, 12:23 PM
Last Post: soothsayerpg
  xml file creation from an XML file template and data from an excel file naji_python 1 2,110 Dec-21-2020, 03:24 PM
Last Post: Gribouillis
  How to save CSV file data into the Azure Data Lake Storage Gen2 table? Mangesh121 0 2,113 Jun-26-2020, 11:59 AM
Last Post: Mangesh121
  How to eliminate unwanted spaces Mohan 5 2,886 Jun-04-2020, 08:34 AM
Last Post: buran
  Removing Certain Numbers From File chascp 2 2,098 Feb-07-2020, 04:04 PM
Last Post: chascp
  Unwanted delay between looped synth plays WolfeCreek 1 2,318 Aug-02-2018, 09:24 PM
Last Post: Vysero
  Unwanted variable change in module dannyH 2 2,684 May-08-2018, 05:33 PM
Last Post: dannyH
  Unwanted random generation of scripted Shapes in GASP diemildefreude 3 5,174 Oct-23-2016, 03:11 PM
Last Post: snippsat

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020