Apr-03-2017, 05:46 PM
In your posted code you are converting and concatenating json's files to one big csv file. You are repeatedly reading one line from one of your json's file, extracting four values you are interested in and storing them in your elements, so they can be converted to a dataframe and exported as a csv at the end.
Instead adding tweet values to elements you could just write that tweet directly to csv file - you wont "accumulate" all tweets in memory, you will just read one line with tweet, parse that single json, write it into file - memory reqs will be low.
Your code could be something like (untested, just writer and writing added):
Instead adding tweet values to elements you could just write that tweet directly to csv file - you wont "accumulate" all tweets in memory, you will just read one line with tweet, parse that single json, write it into file - memory reqs will be low.
Your code could be something like (untested, just writer and writing added):
import csv, json, os elements_keys = ['created_at', 'text', 'lang', 'geo'] with open('outfile.csv', 'w') as csvfile: writer = csv.writer(csvfile) writer.writerow(elements_keys) # header for dirs, subdirs, files in os.walk('/home/Dir'): for file in files: if file.endswith('.json'): with open(file, 'r') as input_file: for line in input_file: try: tweet = json.loads(line) row = [tweet[key] for key in elements_keys] writer.writerow(row) # writing tweet into file except: continueI am not sure about performance of csv.writer when writing line by line, maybe it would be better to accumulate rows in a auxiliary list and write them at once every 10000ish rows with csv.writerows(). But for start i would try it one by one (and on smaller number of files).