Find and delete above a certain line in text file - Printable Version +- Python Forum (https://python-forum.io) +-- Forum: Python Coding (https://python-forum.io/forum-7.html) +--- Forum: General Coding Help (https://python-forum.io/forum-8.html) +--- Thread: Find and delete above a certain line in text file (/thread-36661.html) Pages:
1
2
|
Find and delete above a certain line in text file - cubangt - Mar-15-2022 So i have a text file with the following line structure: (In some cases the text file has thousands of rows) 3/8/22, 4:06 PM, String Value 3/8/22, 4:12 PM, String Value 3/8/22, 4:12 PM, String Value 3/8/22, 4:13 PM, String Value 3/8/22, 4:14 PM, String Value 3/8/22, 4:14 PM, String Value 3/8/22, 4:15 PM, String Value 3/8/22, 4:15 PM, String Value 3/8/22, 4:15 PM, String Value 3/8/22, 4:15 PM, String Value 3/8/22, 4:17 PM, String Value 3/8/22, 4:17 PM, String Value I would be like to be able to run a script against this file to delete rows above a given date/time.. So in the above example, say my date/time was "3/8/22, 4:15 PM", so in that case, i would need to delete the following rows: DELETE THESE 3/8/22, 4:06 PM, String Value 3/8/22, 4:12 PM, String Value 3/8/22, 4:12 PM, String Value 3/8/22, 4:13 PM, String Value 3/8/22, 4:14 PM, String Value 3/8/22, 4:14 PM, String Value UPDATED FILE TO ONLY CONTAIN THESE 3/8/22, 4:15 PM, String Value 3/8/22, 4:15 PM, String Value 3/8/22, 4:15 PM, String Value 3/8/22, 4:15 PM, String Value 3/8/22, 4:17 PM, String Value 3/8/22, 4:17 PM, String Value I have found a few example of using "seek" and found some examples using pointers to get the position, but those examples were very simple search for specific word/string and wasnt able to use my data to accomplish the same thing. In my case i need to match the date and time to find that row, then delete everything above that row. And in the above case, which is possible, i need to make sure that it deletes above the first row found with that date/time combo.. RE: Find and delete above a certain line in text file - deanhystad - Mar-15-2022 There is no deleting. You create a new file that only contains the items you want to keep. The new file may have the same name as the old and be in the same folder. Your date and time strings are not formatted well for comparisons. 12/31/21 is older than any of your examples, but a string comparison would say it is newest. For easy comparison your date should be YYYY/MM/DD and time should be 24 hour time. To compare times and dates you 'll need to convert the strings to something that can be compared. I would convert them to datetime objects. This means you will have to read each line, get the date and time string, convert the string to a datetime object, and compare the datetime object to the datetime object for your cutoff time. from datetime import datetime dates = [ "3/8/22, 4:06 PM, String Value", "3/8/22, 4:12 PM, String Value", "3/8/22, 4:12 PM, String Value", "3/8/22, 4:13 PM, String Value", "3/8/22, 4:14 PM, String Value", "3/8/22, 4:14 PM, String Value", "3/8/22, 4:15 PM, String Value", "3/8/22, 4:15 PM, String Value", "I am the walrus, koo koo kachoo" "3/8/22, 4:15 PM, String Value", "3/8/22, 4:15 PM, String Value", "3/8/22, 4:17 PM, String Value", "3/8/22, 4:17 PM, String Value", ] def get_time(line): try: date, time, *extra = line.split(", ") return datetime.strptime(f"{date} {time}", "%d/%m/%y %I:%M %p") except ValueError: return None # Return none if line does not start with date, time cutoff = get_time("3/8/22, 4:14 PM, String Value") for line in dates: dt = get_time(line) if dt and dt >= cutoff: print(line) You'll need to decide what to do with lines that don't start with date, time. In this example I ignore them.This example assumes that lines start with date followed by time and the time and date are always formatted the same way. If the date/time is not always at the start of the line you could do a search using a regular expression to find text that matches the pattern used for date and time. RE: Find and delete above a certain line in text file - cubangt - Mar-15-2022 i will take a look at that, i have no control over the source file, that is exported out of another system and provided to me, currently i have to do an data import into excel to update a master file, but the source file contains more records than we need.. so thats the reason for the removal of rows.. Since it is comma delimited, would the converting of the date/time be easier or harder because of the row structure? RE: Find and delete above a certain line in text file - deanhystad - Mar-15-2022 If it is always the same rows, there are tools to make this easier. Pandas will read data in column form and do the date and time conversions for you. Then you can use pandas filters to remove items. Essentially it will do what I did in my example, but faster and easier. RE: Find and delete above a certain line in text file - cubangt - Mar-15-2022 I have used pandas before, but only once or twice for different projects.. didnt even consider that.. but will take a look at your above example and pandas as an alternative solution.. So i played around with both your suggestion and pandas and ended up with the pandas version, alot less lines and achieved the same results Here is a simple working example with my data file import pandas as pd df = pd.read_csv("Dates.txt", usecols=range(3), header=None, names=["Date", "Time", "Comment"]) df['Date'] = pd.to_datetime(df['Date']) mask = (df['Date'] > '2022-03-08') df = df.loc[mask] print(df) So i got this working great based on just the date, but finding that i need to also filter based on the time as well.. So i was trying this, but my dataframe comes back empty.. import pandas as pd df = pd.read_csv("Dates.txt", usecols=range(3), header=None, names=["Date", "Time", "Comment"]) df['Date'] = pd.to_datetime(df['Date']) mask = (df['Date'] >= '2022-03-08') & (df['Time'] == '4:14 PM') df = df.loc[mask] print(df)Since the time portion is in its own column in the raw data, how can i combine the 2 conditions to find that specific row This is what i get in the console: Empty DataFrame Columns: [Date, Time, Comment] Index: [] So in the sample data below, i would want make sure that i find all records greater than 3/8/22 4:14pm DELETE THESE 3/8/22, 4:06 PM, String Value 3/8/22, 4:12 PM, String Value 3/8/22, 4:12 PM, String Value 3/8/22, 4:13 PM, String Value 3/8/22, 4:14 PM, String Value 3/8/22, 4:14 PM, String Value UPDATED FILE TO ONLY CONTAIN THESE 3/8/22, 4:15 PM, String Value 3/8/22, 4:15 PM, String Value 3/8/22, 4:15 PM, String Value 3/8/22, 4:15 PM, String Value 3/8/22, 4:17 PM, String Value 3/8/22, 4:17 PM, String Valu RE: Find and delete above a certain line in text file - deanhystad - Mar-16-2022 You want to filter using datetime, not date or time. Here I create a new Date Time column from Date and Time. Then I use the Date Time column to do my filtering. df = pd.DataFrame([ ["3/6/22", "4:06 PM", "String Value"], ["3/7/22", "4:12 PM", "String Value"], ["3/8/22", "3:12 PM", "String Value"], ["3/8/22", "4:13 PM", "String Value"], ["3/9/22", "2:14 PM", "String Value"], ["3/9/22", "4:14 PM", "String Value"], ["3/10/22", "1:15 PM", "String Value"], ["3/10/22", "2:15 PM", "String Value"], ["3/10/22", "3:15 PM", "String Value"], ["3/10/22", "4:15 PM", "String Value"], ["3/12/22", "4:17 AM", "String Value"], ["3/13/22", "4:17 PM", "String Value"]]) df.columns=["Date", "Time", "String"] df["Date Time"] = pd.to_datetime(df['Date'] + ' ' + df['Time'], errors='ignore') cutoff = df["Date Time"][7] # Grab a datetime object for filtering print("Cutoff =", cutoff) df = df.loc[df["Date Time"] >= cutoff] print(df) You would remove the "Date Time" column before printing.
RE: Find and delete above a certain line in text file - snippsat - Mar-17-2022 deanhystad Pandas has own Date and Time support,so using datetime from stander library is not needed. I agree that data and time could be join together. Here example. import pandas as pd from io import StringIO data = """\ 3/8/22, 4:06 PM, String Value 3/8/22, 4:12 PM, String Value 3/8/22, 4:12 PM, String Value 3/8/22, 4:13 PM, String Value 3/8/22, 4:14 PM, String Value 3/8/22, 4:14 PM, String Value 3/8/22, 4:15 PM, String Value 3/8/22, 4:15 PM, String Value 3/8/22, 4:15 PM, String Value 3/8/22, 4:15 PM, String Value 3/8/22, 4:17 PM, String Value 3/8/22, 4:17 PM, String Value""" df = pd.read_csv(StringIO(data), sep=",", names=["Date", "Time", "Comment"]) df["DateTime"] = df["Date"] + df["Time"] #df = df.drop(columns=['Date', 'Time']) # Just keep DateTime df['DateTime'] = pd.to_datetime(df['DateTime'])So i always check dtypes to see that DateTime is now a datetime64 Pandas own datetime object.>>> df.dtypes Date object Time object Comment object DateTime datetime64[ns] dtype: objectPut together using Boolean masks to filter. import pandas as pd from io import StringIO data = """\ 3/8/22, 4:06 PM, String Value 3/8/22, 4:12 PM, String Value 3/8/22, 4:12 PM, String Value 3/8/22, 4:13 PM, String Value 3/8/22, 4:14 PM, String Value 3/8/22, 4:14 PM, String Value 3/8/22, 4:15 PM, String Value 3/8/22, 4:15 PM, String Value 3/8/22, 4:15 PM, String Value 3/8/22, 4:15 PM, String Value 3/8/22, 4:17 PM, String Value 3/8/22, 4:17 PM, String Value""" df = pd.read_csv(StringIO(data), sep=",", names=["Date", "Time", "Comment"]) df["DateTime"] = df["Date"] + df["Time"] #df = df.drop(columns=['Date', 'Time']) # njust keep DateTime df['DateTime'] = pd.to_datetime(df['DateTime']) mask = (df['DateTime'] > '2022-03-08 16:06:00') & (df['DateTime'] >= '2022-03-08 16:15:00') df_new = df.loc[mask]) print(df_new Uncomment line 20 or instead drop DateTime if want original format:
RE: Find and delete above a certain line in text file - deanhystad - Mar-17-2022 How is this: df["DateTime"] = df["Date"] + df["Time"] df['DateTime'] = pd.to_datetime(df['DateTime'])different than this? df["Date Time"] = pd.to_datetime(df['Date'] + ' ' + df['Time'], errors='ignore')Is it better to make a column of strings and then convert to datetime? I assumed that is what happens here: pd.to_datetime(df['Date'] + ' ' + df['Time'])Is it the str concatenation? RE: Find and delete above a certain line in text file - snippsat - Mar-17-2022 (Mar-17-2022, 01:32 AM)deanhystad Wrote: Is it better to make a column of strings and then convert to datetime? I assumed that is what happens here:I could have done it one step,which would have saved me a line. Looking at it again so is your post #8 fine,just grab a datetime object and use it filtering is good👍 I was more thinking about post #2 where you did use datetime from standard library. So update make code to use one step and as a example doing it the other way around dropping DateTime if want the original split in Date and Time. import pandas as pd from io import StringIO data = """\ 3/8/22, 4:06 PM, String Value 3/8/22, 4:12 PM, String Value 3/8/22, 4:12 PM, String Value 3/8/22, 4:13 PM, String Value 3/8/22, 4:14 PM, String Value 3/8/22, 4:14 PM, String Value 3/8/22, 4:15 PM, String Value 3/8/22, 4:15 PM, String Value 3/8/22, 4:15 PM, String Value 3/8/22, 4:15 PM, String Value 3/8/22, 4:17 PM, String Value 3/8/22, 4:17 PM, String Value""" df = pd.read_csv(StringIO(data), sep=",", names=["Date", "Time", "Comment"]) df['DateTime'] = pd.to_datetime(df["Date"] + df["Time"]) mask = (df['DateTime'] > '2022-03-08 16:06:00') & (df['DateTime'] >= '2022-03-08 16:15:00') df_new = df.loc[mask] df_new = df_new.drop(columns=['DateTime']) print(df_new)
RE: Find and delete above a certain line in text file - cubangt - Mar-17-2022 @snippsat can i ask why have the mask using 2 dates? mask = (df['DateTime'] > '2022-03-08 16:06:00') & (df['DateTime'] >= '2022-03-08 16:15:00')Ill be trying both suggestions above shortly after my meeting, so thank you again for the suggestions.. Also just to be clear, i can add "columns" by adding a dataframe after reading in the file like below? and if i want 2 columns, i can just define it below this one and it will add both to the end of the file columns correct? df["DateTime"] = df["Date"] + df["Time"] So i implemented the suggestions above.. but getting an error: import pandas as pd from io import StringIO df = pd.read_csv(StringIO("Dates.txt"), sep=",", names=["Date", "Time", "Comment"]) df['Date'] = pd.to_datetime(df['Date']) df['DateTime'] = pd.to_datetime(df["Date"] + df["Time"]) mask = (df['DateTime'] > '2022-03-08 15:18:00') df_new = df.loc[mask] df_new = df_new.drop(columns=['DateTime']) print(df_new)
Figured it out.. because i had left this line in place, it was causing the issue df['Date'] = pd.to_datetime(df['Date'])removed it and runs and works like a charm, thank you so much for all the suggestions and help. |