CSV file with irregular structure

***zivoni*** · (This post was last modified: May-17-2017, 04:15 PM by zivoni.)

If you only need to "compress" your file to one tweet per line, then you can "define" tweet as anything starting with row with timestamp followed by | and ending when another tweet starts (or file ends). You can iterate over lines from your file and for each line check whether it starts with timestamp followed by | and either start new tweet, or append it to actual tweet ...

import re
pattern = re.compile(r"^\d{4}-\d\d-\d\d \d\d:\d\d:\d\d\.\d{6}\|")  # timestamp| 

with open('tweets_sample.csv') as infile, open('tweets_compress.psv', 'w') as outfile:
    outfile.write(next(infile).strip())  # to start first tweet
    for line in infile:        
        if pattern.match(line):  # new tweet starts, previous one ends ...
            outfile.write('\n')
        outfile.write(line.strip())

There is possibility that it would split tweet message (if body of tweet contains new line followed by timestamp|).

CSV file with irregular structure

User Panel Messages

Announcements