May-22-2017, 07:35 PM
(May-22-2017, 06:11 PM)ulrich48155 Wrote: It seems like every tweet which were seperated by blank lines is now seperate by blank rows.
I am afraid that I have no idea what it does mean (and what is difference between blank line and blank row?).
Yes, it seems that your data are "dirty" and pipe ("|") beside being field separator is used in body of tweets. If (and only if) you know that pipe could be used only inside of body of tweet (fifth field?) and correct tweet should have exactly six pipes, then you can try to preprocess it and keep first four pipes and last two pipes, while replacing all other ones with some character of your choice - that way your lines will have correct number of separators and there wont be any pipes in the text of tweet. You can do such replacing with something like
splits = line.split("|") new_line = "|".join(splits[:4] + ["+".join(splits[4:-2])] + splits[-2:]) # replaces offending pipes with +used on lines in your "compressed" file.