Python Forum

I'm working on a project where I have a CSV file containing about 25,000 rows of a unique URL in the first (and only) column throughout the CSV file. I'm iterating through each row, getting the unique URL, doing some processing on some data contained behind the URL once I open each unique URL, and writing some extended data to a 2nd CSV file.

The problem I'm having is that every now and then, something causes my Python script to fail and I've got to restart it and manually edit the initial CSV file of URLs to remove the rows containing URLs I've already processed so that the script resumes with what's the new first line containing the beginning of the next many URLs that I have yet to process.

In my script, I'm writing to another small CSV file the # of the row that was just successfully processed to where I have this as a reference if/when the script fails. For example, my script runs, the last row successfully processed is row 500, and the number 500 is written to my other CSV file.

I'd like to be able to retrieve that # 500 (I know how to do this) when I have to restart the script and have that utilized by the script to know that in the large file of 25,000 rows of unique URLs, I can effectively skip rows 0-500 since I've already processed those, and resume the processing of ongoing unique URLs beginning with row 501.

What I'm trying not to have to do is for the script to have to iterate through each individual row until it sees that I'm at row 500, as that seems to be a time waster. Is there a way after I read the CSV file that contains "500" to have my script open the CSV file of URLs & just "skip" immediately to row 501 and begin iteration from that point forward instead of having to cycle through all rows until it gets to the desired row (501) to begin?

I've read that I could possibly use "islice" to help do this more efficiently, but I haven't seen anything out there that's concrete on the best & most efficient way to accomplish this.

Thanks in advance for any suggestions you can offer.

With open('my_data', 'r') as data_csv:
    data = csv.reader(data_csv)
    
    for line in data[501:]
        print(line)

Did you try something like this?

Tells me an error of "csv.reader object is not subscriptable"

Hm!

with open('my_data', 'r') as data_csv:
    data = csv.reader(data_csv)
    for _ in range(500):  # skip the first 500 rows
        next(data)

    for row in data:
        print(line)

Use islice from itertools.
https://docs.python.org/3.6/library/iter...ols.islice

Example:

from itertools import islice

with open('my_data.csv') as fd:
    for row in islice(csv.reader(fd), 500, None):
        print(row)

The isslice recommendation worked PERFECTLY! Thanks so very much! Really appreciate the help!

bmccollum

wavic

bmccollum

wavic

DeaD_EyE

bmccollum