Python Forum
Trying to parse and work with the WhatsApp export of chats
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Trying to parse and work with the WhatsApp export of chats
#1
So i have been trying to parse or work with the export from WhatsApp chat history. I'm using that because i thought it would be a simple text format to get working, which it is to a point, but then i found some issues within the file when i open to view it in Notepad++

for the most part, majority of the rows in the file are formatted correctly and look like this:
9/5/21, 8:07 PM - Me: Lol romantic comedy
9/5/21, 8:38 PM - Friend: Yup

then there are rows in the file that are mixed in that throw off the formatting and rows, so my python code doesn't parse those lines well

9/5/21, 8:07 PM - Me: Lol romantic comedy
https://music.youtube.com/
2006 present
9/5/21, 8:38 PM - Friend: Yup

If i go back to the actual chats those lines, the chat was multi-line and maybe a issue during the export, but not looking to fix that exporting, i just really want to parse out ONLY rows/lines that have a date at the beginning.

Here is one set of logic i have tried and got close, but cant seem to exclude the lines that aren't complete. I think the end goal here is to be able to parse the lines out in a way to get total messages per day. But if there is a way to clear out the orphan lines that dont have a date or who sent it, that may work as well..
import pandas as pd
from datetime import datetime as DateT

file2 = open("Dates.txt", "w",encoding='utf-8')

with open("Chat.txt", "r",encoding='utf-8') as file_in:
    lines = []
    for line in file_in:
        dt = line.partition('-')
        
        #datetime_obj = DateT.strptime(dt[0],'%m/%d/%y').date()

        print(dt[0])
        # print(line)
        # print(dt[0])

        file2.write(dt[0] + '\n')

        
file2.close()
Im only using python to learn some more about what it can do, If i could use it to clean up the file that would work, because i can then import it into excel and create a pivot table from all clean rows and get my total counts per day and then if i want to see the messages, i can get to them from the pivot table.

My first attempt, i was trying to just grab the "dates" out of the file and doing a count in excel for that, but because some rows didnt have dates, they were still being pulled in with the dt[0] method i was using.

Can anyone help suggest what i can do or focus on to try and clean out the rows that are not 100% complete?
Do i have to read each line and if doesnt start with a date, then delete or exclude from writing it to the file? If this is, what functions or methods would i need to look at using?
Reply
#2
My goto here would be to use a regular expression to match lines with the formatting you want. You can ignore, store, or otherwise flag the lines that don't match.

import re

text_in = """9/5/21, 8:07 PM - Me: Lol romantic comedy
https://music.youtube.com/
2006 present
9/5/21, 8:38 PM - Friend: Yup
"""

datetime_rx = re.compile(r"(\d\d?/\d\d?/\d\d, \d\d?:\d\d [AP]M) - (.*)")
for line in text_in.splitlines():
    match = datetime_rx.match(line)
    if match:
        print(f"Found timestamp: {match.groups()[0]}, Contents: {match.groups()[1]}")
    else:
        print(f"no timestamp: {line}")
cubangt likes this post
Reply
#3
(Dec-15-2021, 11:35 PM)bowlofred Wrote: My goto here would be to use a regular expression to match lines with the formatting you want. You can ignore, store, or otherwise flag the lines that don't match.

import re

text_in = """9/5/21, 8:07 PM - Me: Lol romantic comedy
https://music.youtube.com/
2006 present
9/5/21, 8:38 PM - Friend: Yup
"""

datetime_rx = re.compile(r"(\d\d?/\d\d?/\d\d, \d\d?:\d\d [AP]M) - (.*)")
for line in text_in.splitlines():
    match = datetime_rx.match(line)
    if match:
        print(f"Found timestamp: {match.groups()[0]}, Contents: {match.groups()[1]}")
    else:
        print(f"no timestamp: {line}")

So i have a question, im pretty much a beginner to python, i have used it in the past for some web scraping, but 95% of that code was repurposed code from other users in the company and online samples.. My background is C# windows applications and in alot of cases, we would use regex, but only when it was small amounts of data we were checking against, just because they said it was slow with large volumes of data..
Is that true with python as well? I don't really care so much with this little project, because this is more for fun, but would like to know just as best practice?
The text file I'm currently reading today is just over 40k rows/messages

Ill be giving the above a try this morning, so will post back my questions or issues if any.

thank you
Reply
#4
Your example works within itself, so im trying to implement in my logic to see if i can get a cleaner output file, thank you.
Ill post back shortly after i try that and see what i get.
Reply
#5
Regex can be quite fast, but it's almost an entire language. It is possible to write an expression that takes a huge amount of time to process (think a naive fibonacci recursion program). If your expression requires backtracking, then running it on a large dataset can take forever for it to fail.

The one above uses match, which must match at the beginning of a line. There's no opportunity for backtracking, so it wouldn't be expected to be slow even on huge targets.
Reply
#6
Awesome, still havent had a chance to test on the actual code im using, but will for sure today, had back to back meetings that were not all expected.

But really appreciate the explanation.

thank you
Reply
#7
Ok, i just tested this against the actual complete file and works great, takes about 2 seconds to run and i can pull the data into excel and build a quick pivot table with the data.

Thank you

Is there a way to mark this thread/post answered and complete?
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Eliminate entering QR - Whatsapp web automated by selenium akanowhere 1 3,096 Jan-21-2024, 01:12 PM
Last Post: owalahtole
  Sending Whatsapp Message in the text file ebincharles869 9 3,564 Jun-21-2022, 04:26 PM
Last Post: snippsat

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020