Python Forum
Read csv file with inconsistent delimiter - Printable Version

+- Python Forum (https://python-forum.io)
+-- Forum: Python Coding (https://python-forum.io/forum-7.html)
+--- Forum: General Coding Help (https://python-forum.io/forum-8.html)
+--- Thread: Read csv file with inconsistent delimiter (/thread-39690.html)



Read csv file with inconsistent delimiter - gracenz - Mar-27-2023

Hi everyone.

Can I seek your help on properly reading this csv file with incomplete data (e.g. row 14 to 23) and different delimiters?
row 23 has been incorrectly captured as far as it looks.
The data contains 8 columns i.e. Release_Date, Title, Overview, Popularity, Vote_Count, Vote_Average, Original_Language, Genre.

Thanking you in advance.


RE: Read csv file with inconsistent delimiter - DeaD_EyE - Mar-27-2023

Just skip the rows, which start with a -.

import csv
from datetime import datetime as DateTime


def read_broken_csv(file):
    with open(file, newline="") as fd:
        reader = csv.reader(fd)
        header = next(reader)
        for row in reader:
            # skipping everything where row[0] starts with a `-`
            if row[0].lstrip().startswith("-"):
                continue
            # conversion to date may fail, if the data is in the wrong format
            try:
                row[0] = DateTime.strptime(row[0], "%m/%d/%Y").date()
                row[-1] = tuple(map(str.strip, row[-1].split(",")))
            except ValueError:
                # skipping row, if the format of date was not ok
                continue

            yield row


for row in read_broken_csv("Downloads/testing.csv"):
    print(row)



RE: Read csv file with inconsistent delimiter - deanhystad - Mar-27-2023

To fix you need to identify when a linefeed is not the end of a row, but a continuation of the description. You could look " - " at the start of the "Release_Date" to find the continuations. If the Overivew is "NaN" append the description to the previous description. If the Overview is a number, use the Overview, Popularity, Vote_Count, Vote_Average, and Original_Language as the Popularity, Vote_Count, Vote_Average, Original_Language and Genre for the row. Even then you still have that odd link after genre. Quite complicated.

Or you could just fix whatever is generating these files and have it place quotes around the Title and Overview. I edited your csv file and wrapped all the titles and Overviews in quotes. I left the linefeeds in the pixie bakeoff description and deleted the link after the pixie bakeoff genre (Animaged) as well as all the commas that were added to the csv file because you did not have quotes on the Overview. The modified csv file reads fine.