Python Forum

Full Version: Find and delete above a certain line in text file
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Pages: 1 2
So i have a text file with the following line structure: (In some cases the text file has thousands of rows)

3/8/22, 4:06 PM, String Value
3/8/22, 4:12 PM, String Value
3/8/22, 4:12 PM, String Value
3/8/22, 4:13 PM, String Value
3/8/22, 4:14 PM, String Value
3/8/22, 4:14 PM, String Value
3/8/22, 4:15 PM, String Value
3/8/22, 4:15 PM, String Value
3/8/22, 4:15 PM, String Value
3/8/22, 4:15 PM, String Value
3/8/22, 4:17 PM, String Value
3/8/22, 4:17 PM, String Value

I would be like to be able to run a script against this file to delete rows above a given date/time..

So in the above example, say my date/time was "3/8/22, 4:15 PM", so in that case, i would need to delete the following rows:

DELETE THESE
3/8/22, 4:06 PM, String Value
3/8/22, 4:12 PM, String Value
3/8/22, 4:12 PM, String Value
3/8/22, 4:13 PM, String Value
3/8/22, 4:14 PM, String Value
3/8/22, 4:14 PM, String Value

UPDATED FILE TO ONLY CONTAIN THESE
3/8/22, 4:15 PM, String Value
3/8/22, 4:15 PM, String Value
3/8/22, 4:15 PM, String Value
3/8/22, 4:15 PM, String Value
3/8/22, 4:17 PM, String Value
3/8/22, 4:17 PM, String Value

I have found a few example of using "seek" and found some examples using pointers to get the position, but those examples were very simple search for specific word/string and wasnt able to use my data to accomplish the same thing.

In my case i need to match the date and time to find that row, then delete everything above that row. And in the above case, which is possible, i need to make sure that it deletes above the first row found with that date/time combo..
There is no deleting. You create a new file that only contains the items you want to keep. The new file may have the same name as the old and be in the same folder.

Your date and time strings are not formatted well for comparisons. 12/31/21 is older than any of your examples, but a string comparison would say it is newest. For easy comparison your date should be YYYY/MM/DD and time should be 24 hour time.

To compare times and dates you 'll need to convert the strings to something that can be compared. I would convert them to datetime objects. This means you will have to read each line, get the date and time string, convert the string to a datetime object, and compare the datetime object to the datetime object for your cutoff time.
from datetime import datetime

dates = [
    "3/8/22, 4:06 PM, String Value",
    "3/8/22, 4:12 PM, String Value",
    "3/8/22, 4:12 PM, String Value",
    "3/8/22, 4:13 PM, String Value",
    "3/8/22, 4:14 PM, String Value",
    "3/8/22, 4:14 PM, String Value",
    "3/8/22, 4:15 PM, String Value",
    "3/8/22, 4:15 PM, String Value",
    "I am the walrus, koo koo kachoo"
    "3/8/22, 4:15 PM, String Value",
    "3/8/22, 4:15 PM, String Value",
    "3/8/22, 4:17 PM, String Value",
    "3/8/22, 4:17 PM, String Value",
]

def get_time(line):
    try:
        date, time, *extra = line.split(", ")
        return datetime.strptime(f"{date} {time}", "%d/%m/%y %I:%M %p")
    except ValueError:
        return None  # Return none if line does not start with date, time

cutoff = get_time("3/8/22, 4:14 PM, String Value")

for line in dates:
    dt = get_time(line)
    if dt and dt >= cutoff:
        print(line)
Output:
3/8/22, 4:14 PM, String Value 3/8/22, 4:14 PM, String Value 3/8/22, 4:15 PM, String Value 3/8/22, 4:15 PM, String Value 3/8/22, 4:15 PM, String Value 3/8/22, 4:17 PM, String Value 3/8/22, 4:17 PM, String Value
You'll need to decide what to do with lines that don't start with date, time. In this example I ignore them.

This example assumes that lines start with date followed by time and the time and date are always formatted the same way. If the date/time is not always at the start of the line you could do a search using a regular expression to find text that matches the pattern used for date and time.
i will take a look at that, i have no control over the source file, that is exported out of another system and provided to me, currently i have to do an data import into excel to update a master file, but the source file contains more records than we need.. so thats the reason for the removal of rows..

Since it is comma delimited, would the converting of the date/time be easier or harder because of the row structure?
If it is always the same rows, there are tools to make this easier. Pandas will read data in column form and do the date and time conversions for you. Then you can use pandas filters to remove items. Essentially it will do what I did in my example, but faster and easier.
I have used pandas before, but only once or twice for different projects.. didnt even consider that.. but will take a look at your above example and pandas as an alternative solution..
So i played around with both your suggestion and pandas and ended up with the pandas version, alot less lines and achieved the same results

Here is a simple working example with my data file

import pandas as pd
df = pd.read_csv("Dates.txt", usecols=range(3), header=None, names=["Date", "Time", "Comment"])
df['Date'] = pd.to_datetime(df['Date'])
mask = (df['Date'] > '2022-03-08')
df = df.loc[mask]
print(df)

So i got this working great based on just the date, but finding that i need to also filter based on the time as well..

So i was trying this, but my dataframe comes back empty..

import pandas as pd

df = pd.read_csv("Dates.txt", usecols=range(3), header=None, names=["Date", "Time", "Comment"])

df['Date'] = pd.to_datetime(df['Date'])

mask = (df['Date'] >= '2022-03-08') & (df['Time'] == '4:14 PM')

df = df.loc[mask]

print(df)
Since the time portion is in its own column in the raw data, how can i combine the 2 conditions to find that specific row

This is what i get in the console:

Empty DataFrame
Columns: [Date, Time, Comment]
Index: []

So in the sample data below, i would want make sure that i find all records greater than 3/8/22 4:14pm

DELETE THESE
3/8/22, 4:06 PM, String Value
3/8/22, 4:12 PM, String Value
3/8/22, 4:12 PM, String Value
3/8/22, 4:13 PM, String Value
3/8/22, 4:14 PM, String Value
3/8/22, 4:14 PM, String Value

UPDATED FILE TO ONLY CONTAIN THESE
3/8/22, 4:15 PM, String Value
3/8/22, 4:15 PM, String Value
3/8/22, 4:15 PM, String Value
3/8/22, 4:15 PM, String Value
3/8/22, 4:17 PM, String Value
3/8/22, 4:17 PM, String Valu
You want to filter using datetime, not date or time. Here I create a new Date Time column from Date and Time. Then I use the Date Time column to do my filtering.
df = pd.DataFrame([
    ["3/6/22", "4:06 PM", "String Value"],
    ["3/7/22", "4:12 PM", "String Value"],
    ["3/8/22", "3:12 PM", "String Value"],
    ["3/8/22", "4:13 PM", "String Value"],
    ["3/9/22", "2:14 PM", "String Value"],
    ["3/9/22", "4:14 PM", "String Value"],
    ["3/10/22", "1:15 PM", "String Value"],
    ["3/10/22", "2:15 PM", "String Value"],
    ["3/10/22", "3:15 PM", "String Value"],
    ["3/10/22", "4:15 PM", "String Value"],
    ["3/12/22", "4:17 AM", "String Value"],
    ["3/13/22", "4:17 PM", "String Value"]])
df.columns=["Date", "Time", "String"]

df["Date Time"] = pd.to_datetime(df['Date'] + ' ' + df['Time'], errors='ignore')
cutoff = df["Date Time"][7]  # Grab a datetime object for filtering
print("Cutoff =", cutoff)
df = df.loc[df["Date Time"] >= cutoff]
print(df)
Output:
Cutoff = 2022-03-10 14:15:00 Date Time String Date Time 7 3/10/22 2:15 PM String Value 2022-03-10 14:15:00 8 3/10/22 3:15 PM String Value 2022-03-10 15:15:00 9 3/10/22 4:15 PM String Value 2022-03-10 16:15:00 10 3/12/22 4:17 AM String Value 2022-03-12 04:17:00 11 3/13/22 4:17 PM String Value 2022-03-13 16:17:00
You would remove the "Date Time" column before printing.
deanhystad Pandas has own Date and Time support,so using datetime from stander library is not needed.
I agree that data and time could be join together.
Here example.
import pandas as pd
from io import StringIO

data = """\
3/8/22, 4:06 PM, String Value
3/8/22, 4:12 PM, String Value
3/8/22, 4:12 PM, String Value
3/8/22, 4:13 PM, String Value
3/8/22, 4:14 PM, String Value
3/8/22, 4:14 PM, String Value
3/8/22, 4:15 PM, String Value
3/8/22, 4:15 PM, String Value
3/8/22, 4:15 PM, String Value
3/8/22, 4:15 PM, String Value
3/8/22, 4:17 PM, String Value
3/8/22, 4:17 PM, String Value"""

df = pd.read_csv(StringIO(data), sep=",", names=["Date", "Time", "Comment"])
df["DateTime"] = df["Date"] + df["Time"]
#df = df.drop(columns=['Date', 'Time']) # Just keep DateTime
df['DateTime'] = pd.to_datetime(df['DateTime']) 
So i always check dtypes to see that DateTime is now a datetime64 Pandas own datetime object.
>>> df.dtypes
Date                object
Time                object
Comment             object
DateTime    datetime64[ns]
dtype: object
Put together using Boolean masks to filter.
import pandas as pd
from io import StringIO

data = """\
3/8/22, 4:06 PM, String Value
3/8/22, 4:12 PM, String Value
3/8/22, 4:12 PM, String Value
3/8/22, 4:13 PM, String Value
3/8/22, 4:14 PM, String Value
3/8/22, 4:14 PM, String Value
3/8/22, 4:15 PM, String Value
3/8/22, 4:15 PM, String Value
3/8/22, 4:15 PM, String Value
3/8/22, 4:15 PM, String Value
3/8/22, 4:17 PM, String Value
3/8/22, 4:17 PM, String Value"""

df = pd.read_csv(StringIO(data), sep=",", names=["Date", "Time", "Comment"])
df["DateTime"] = df["Date"] + df["Time"]
#df = df.drop(columns=['Date', 'Time']) # njust keep DateTime
df['DateTime'] = pd.to_datetime(df['DateTime'])
mask = (df['DateTime'] > '2022-03-08 16:06:00') & (df['DateTime'] >= '2022-03-08 16:15:00')
df_new = df.loc[mask])
print(df_new
Output:
Date Time Comment DateTime 6 3/8/22 4:15 PM String Value 2022-03-08 16:15:00 7 3/8/22 4:15 PM String Value 2022-03-08 16:15:00 8 3/8/22 4:15 PM String Value 2022-03-08 16:15:00 9 3/8/22 4:15 PM String Value 2022-03-08 16:15:00 10 3/8/22 4:17 PM String Value 2022-03-08 16:17:00 11 3/8/22 4:17 PM String Value 2022-03-08 16:17:00
Uncomment line 20 or instead drop DateTime if want original format:
Output:
Comment DateTime 6 String Value 2022-03-08 16:15:00 7 String Value 2022-03-08 16:15:00 8 String Value 2022-03-08 16:15:00 9 String Value 2022-03-08 16:15:00 10 String Value 2022-03-08 16:17:00 11 String Value 2022-03-08 16:17:00
How is this:
df["DateTime"] = df["Date"] + df["Time"]
df['DateTime'] = pd.to_datetime(df['DateTime']) 
different than this?
df["Date Time"] = pd.to_datetime(df['Date'] + ' ' + df['Time'], errors='ignore')
Is it better to make a column of strings and then convert to datetime? I assumed that is what happens here:
pd.to_datetime(df['Date'] + ' ' + df['Time'])
Is it the str concatenation?
(Mar-17-2022, 01:32 AM)deanhystad Wrote: [ -> ]Is it better to make a column of strings and then convert to datetime? I assumed that is what happens here:
I could have done it one step,which would have saved me a line.
Looking at it again so is your post #8 fine,just grab a datetime object and use it filtering is good👍
I was more thinking about post #2 where you did use datetime from standard library.

So update make code to use one step and as a example doing it the other way around dropping DateTime if want the original split in Date and Time.
import pandas as pd
from io import StringIO

data = """\
3/8/22, 4:06 PM, String Value
3/8/22, 4:12 PM, String Value
3/8/22, 4:12 PM, String Value
3/8/22, 4:13 PM, String Value
3/8/22, 4:14 PM, String Value
3/8/22, 4:14 PM, String Value
3/8/22, 4:15 PM, String Value
3/8/22, 4:15 PM, String Value
3/8/22, 4:15 PM, String Value
3/8/22, 4:15 PM, String Value
3/8/22, 4:17 PM, String Value
3/8/22, 4:17 PM, String Value"""

df = pd.read_csv(StringIO(data), sep=",", names=["Date", "Time", "Comment"])
df['DateTime'] = pd.to_datetime(df["Date"] + df["Time"])
mask = (df['DateTime'] > '2022-03-08 16:06:00') & (df['DateTime'] >= '2022-03-08 16:15:00')
df_new = df.loc[mask]
df_new = df_new.drop(columns=['DateTime'])
print(df_new)
Output:
Date Time Comment 6 3/8/22 4:15 PM String Value 7 3/8/22 4:15 PM String Value 8 3/8/22 4:15 PM String Value 9 3/8/22 4:15 PM String Value 10 3/8/22 4:17 PM String Value 11 3/8/22 4:17 PM String Value
@snippsat can i ask why have the mask using 2 dates?

mask = (df['DateTime'] > '2022-03-08 16:06:00') & (df['DateTime'] >= '2022-03-08 16:15:00')
Ill be trying both suggestions above shortly after my meeting, so thank you again for the suggestions..
Also just to be clear, i can add "columns" by adding a dataframe after reading in the file like below? and if i want 2 columns, i can just define it below this one and it will add both to the end of the file columns correct?

df["DateTime"] = df["Date"] + df["Time"]

So i implemented the suggestions above.. but getting an error:

import pandas as pd
from io import StringIO

df = pd.read_csv(StringIO("Dates.txt"), sep=",", names=["Date", "Time", "Comment"])

df['Date'] = pd.to_datetime(df['Date'])
df['DateTime'] = pd.to_datetime(df["Date"] + df["Time"])

mask = (df['DateTime'] > '2022-03-08 15:18:00')

df_new = df.loc[mask]
df_new = df_new.drop(columns=['DateTime'])

print(df_new)
Error:
ParserError: Too many columns specified: expected 3 and found 1

Figured it out.. because i had left this line in place, it was causing the issue

df['Date'] = pd.to_datetime(df['Date'])
removed it and runs and works like a charm, thank you so much for all the suggestions and help.
Pages: 1 2