Dec-15-2021, 11:00 PM
So i have been trying to parse or work with the export from WhatsApp chat history. I'm using that because i thought it would be a simple text format to get working, which it is to a point, but then i found some issues within the file when i open to view it in Notepad++
for the most part, majority of the rows in the file are formatted correctly and look like this:
9/5/21, 8:07 PM - Me: Lol romantic comedy
9/5/21, 8:38 PM - Friend: Yup
then there are rows in the file that are mixed in that throw off the formatting and rows, so my python code doesn't parse those lines well
9/5/21, 8:07 PM - Me: Lol romantic comedy
https://music.youtube.com/
2006 present
9/5/21, 8:38 PM - Friend: Yup
If i go back to the actual chats those lines, the chat was multi-line and maybe a issue during the export, but not looking to fix that exporting, i just really want to parse out ONLY rows/lines that have a date at the beginning.
Here is one set of logic i have tried and got close, but cant seem to exclude the lines that aren't complete. I think the end goal here is to be able to parse the lines out in a way to get total messages per day. But if there is a way to clear out the orphan lines that dont have a date or who sent it, that may work as well..
My first attempt, i was trying to just grab the "dates" out of the file and doing a count in excel for that, but because some rows didnt have dates, they were still being pulled in with the dt[0] method i was using.
Can anyone help suggest what i can do or focus on to try and clean out the rows that are not 100% complete?
Do i have to read each line and if doesnt start with a date, then delete or exclude from writing it to the file? If this is, what functions or methods would i need to look at using?
for the most part, majority of the rows in the file are formatted correctly and look like this:
9/5/21, 8:07 PM - Me: Lol romantic comedy
9/5/21, 8:38 PM - Friend: Yup
then there are rows in the file that are mixed in that throw off the formatting and rows, so my python code doesn't parse those lines well
9/5/21, 8:07 PM - Me: Lol romantic comedy
https://music.youtube.com/
2006 present
9/5/21, 8:38 PM - Friend: Yup
If i go back to the actual chats those lines, the chat was multi-line and maybe a issue during the export, but not looking to fix that exporting, i just really want to parse out ONLY rows/lines that have a date at the beginning.
Here is one set of logic i have tried and got close, but cant seem to exclude the lines that aren't complete. I think the end goal here is to be able to parse the lines out in a way to get total messages per day. But if there is a way to clear out the orphan lines that dont have a date or who sent it, that may work as well..
import pandas as pd from datetime import datetime as DateT file2 = open("Dates.txt", "w",encoding='utf-8') with open("Chat.txt", "r",encoding='utf-8') as file_in: lines = [] for line in file_in: dt = line.partition('-') #datetime_obj = DateT.strptime(dt[0],'%m/%d/%y').date() print(dt[0]) # print(line) # print(dt[0]) file2.write(dt[0] + '\n') file2.close()Im only using python to learn some more about what it can do, If i could use it to clean up the file that would work, because i can then import it into excel and create a pivot table from all clean rows and get my total counts per day and then if i want to see the messages, i can get to them from the pivot table.
My first attempt, i was trying to just grab the "dates" out of the file and doing a count in excel for that, but because some rows didnt have dates, they were still being pulled in with the dt[0] method i was using.
Can anyone help suggest what i can do or focus on to try and clean out the rows that are not 100% complete?
Do i have to read each line and if doesnt start with a date, then delete or exclude from writing it to the file? If this is, what functions or methods would i need to look at using?