Python Forum
CSV file with irregular structure
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
CSV file with irregular structure
#1
I got a CSV file with about 500.000 lines of twitter tweets. The file is structured as follows:


Timestamp1 | Topic | Timestamp2 | User | Message | RetweetCount | Location

The problem I have is that especially the messages sometimes are not in one row but in two or even three rows. Furthermore, sometimes there is even a blank row between the parts of one message. 

Here is an example:
http://imgur.com/a/R0pbz (the upper row is good, the one under it isnt)

I also uploaded a sample file:
https://mega.nz/#!rZ5GwRrC!yKsHs26ZZZtXE...mIfIbp-fqs

How can I fix this?
I thought maybe there is a way to tell python to compromize the data into one row.

Thanks in advance!
Reply
#2
Quote:I thought maybe there is a way to tell python to compromize the data into one row.
I hope you meant consolidate, don't think you want to work with compromised data.
There must be some sort of delimiter at the end of each record.
It would go a long way to have a sample that contains a few records clipped from the top of the file.

I am also wondering if you are seeing three lines, because of wrapping, and not because of end-line sequences in the data.
Really need to see a sample.
Reply
#3
(May-16-2017, 10:40 AM)ulrich48155 Wrote: I got a CSV file with about 500.000 lines of twitter tweets. The file is structured as follows:


Timestamp1 | Topic | Timestamp2 | User | Message | RetweetCount | Location

The problem I have is that especially the messages sometimes are not in one row but in two or even three rows. Furthermore, sometimes there is even a blank row between the parts of one message. 

Here is an example:
http://imgur.com/a/R0pbz (the upper row is good, the one under it isnt)

I also uploaded a sample file:
https://mega.nz/#!rZ5GwRrC!yKsHs26ZZZtXE...mIfIbp-fqs

How can I fix this?
I thought maybe there is a way to tell python to compromize the data into one row.

Thanks in advance!

We don't need a huge sample... you can certainly extract some of the problem lines to a small file?

Normally the CSV reader will not split a record on line terminators if these characters are between the defined quote characters, and CSVs are supposed to use quotes as soon as there are ambiguities.

But maybe you assume it's a CSV when it is not and would require some massaging to be transformed into a valid CSV?
Unless noted otherwise, code in my posts should be understood as "coding suggestions", and its use may require more neurones than the two necessary for Ctrl-C/Ctrl-V.
Your one-stop place for all your GIMP needs: gimp-forum.net
Reply
#4
I got this file from my professor in order to do a sentiment analysis, but I got problems due to these damaged rows.

I attached the csv file including 30rows. The columns are seperated with a '|'.

Attached Files

.csv   tweets_sample.csv (Size: 64.32 KB / Downloads: 227)
Reply
#5
Do you specify the pipe (|) as the one and only delimiter?
Unless noted otherwise, code in my posts should be understood as "coding suggestions", and its use may require more neurones than the two necessary for Ctrl-C/Ctrl-V.
Your one-stop place for all your GIMP needs: gimp-forum.net
Reply
#6
If you only need to "compress" your file to one tweet per line, then you can "define" tweet as anything starting with row with timestamp followed by | and ending when another tweet starts (or file ends). You can iterate over lines from your file and for each line check whether it starts with timestamp followed by |  and either start new tweet, or append it to actual tweet ...

import re
pattern = re.compile(r"^\d{4}-\d\d-\d\d \d\d:\d\d:\d\d\.\d{6}\|")  # timestamp| 

with open('tweets_sample.csv') as infile, open('tweets_compress.psv', 'w') as outfile:
    outfile.write(next(infile).strip())  # to start first tweet
    for line in infile:        
        if pattern.match(line):  # new tweet starts, previous one ends ...
            outfile.write('\n')
        outfile.write(line.strip())
There is possibility that it would split tweet message (if body of tweet contains new line followed by timestamp|).
Reply
#7
My prof specified '|' as the only delimiter.

Thanks for that code! That really helped me Dance

Do you think it would even be possible to really connect the parts that were splitted over different rows? Now, when opening the compressed file with excel, some of the rows are seperated with blank columns. In order to do the sentiment analysis this still would require a great deal of manual work.
Reply
#8
(May-17-2017, 06:31 PM)ulrich48155 Wrote: Do you think it would even be possible to really connect the parts that were splitted over different rows? Now, when opening the compressed file with excel, some of the rows are seperated with blank columns. In order to do the sentiment analysis this still would require a great deal of manual work.

Do you mean two tweets separated by empty space in one line? That code writes new line only before row starting with timestamp|, so if there are tweets starting say with <space>timestamp|, they will be concatenated to previous one. This could be solved by changing pattern to match such lines (while increasing risk of breaking tweet in its message). Please post small sample of your input data that leads to "empty columns".
Reply
#9
It seems like every tweet which were seperated by blank lines is now seperate by blank rows.
I attached a sample which contains tweets after using that code. 

When I try to read the file and define '|' as delimiter in order to do a sentiment analysis I get this message:

Output:
CParserError: Error tokenizing data. C error: Expected 7 fields in line 24, saw 8

Sorry, I got the mistake now! 
The problem is, that some of these tweets have an extra '|' which makes everything more complicated.

Attached Files

.csv   tweets_sample2.csv (Size: 6.39 KB / Downloads: 320)
Reply
#10
(May-22-2017, 06:11 PM)ulrich48155 Wrote: It seems like every tweet which were seperated by blank lines is now seperate by blank rows.

I am afraid that I have no idea what it does mean (and what is difference between blank line and blank row?).

Yes, it seems that your data are "dirty" and pipe ("|") beside being field separator is used in body of tweets. If (and only if) you know that pipe could be used only inside of body of tweet (fifth field?) and correct tweet should have exactly six pipes, then you can try to preprocess it and keep first four pipes and last two pipes, while replacing all other ones with some character of your choice - that way your lines will have correct number of separators and there wont be any pipes in the text of tweet. You can do such replacing with something like
splits = line.split("|")
new_line = "|".join(splits[:4] + ["+".join(splits[4:-2])] + splits[-2:])  # replaces offending pipes with +
used on lines in your "compressed" file.
Reply


Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020