Automating the code using os.walk - Printable Version +- Python Forum (https://python-forum.io) +-- Forum: Python Coding (https://python-forum.io/forum-7.html) +--- Forum: Data Science (https://python-forum.io/forum-44.html) +--- Thread: Automating the code using os.walk (/thread-2359.html) |
Automating the code using os.walk - kiton - Mar-09-2017 So I have a piece of code that works with a single file that is in the root folder: import json import pandas as pd import numpy as np import matplotlib.pyplot as plt tweets = [] # Read Twitter JSON for line in open('00.json'): # single JSON to work with try: tweet = json.loads(line) tweets.append(tweet) except: continue # Extract data of interest tweet = tweets[0] ids = [tweet['id_str'] for tweet in tweets if 'id_str' in tweet] #tweets often have missing data, therefore use if text = [tweet['text'] for tweet in tweets if 'text' in tweet] lang = [tweet['lang'] for tweet in tweets if 'lang' in tweet] geo = [tweet['geo'] for tweet in tweets if 'geo' in tweet] place = [tweet['place'] for tweet in tweets if 'place' in tweet] # Save data of interest in a pandas data frame df=pd.DataFrame({'Ids':pd.Index(ids), 'Text':pd.Index(text), 'Lang':pd.Index(lang), 'Geo':pd.Index(geo), 'Place':pd.Index(place)}) # Create a data frame for this specific JSON excluding some data: df00 = df[(df['Lang']==('en')) & (df['Geo'].dropna())]Now, I have about a thousand of similar JSON files each of which is in the separate sub-folder. The following coding elaborations are difficult to me, so please bare with me here. My current goal is to (1) look into each sub-folder, (2) locate the *.json, (3) perform data extraction on it, (4) create a data frame with extracted data for all JSONs read. import os import json import pandas as pd import numpy as np import matplotlib.pyplot as plt import os tweets = [] rootdir = '/Users/mymac/Documents/00/23' for subdir, dirs, files in os.walk(rootdir): for file in files: if file.endswith(".json"): for line in open(file) : try: tweet = json.loads(line) tweets.append(tweet) except: continue tweet = tweets[0] ids = [tweet['id_str'] for tweet in tweets if 'id_str' in tweet] text = [tweet['text'] for tweet in tweets if 'text' in tweet] lang = [tweet['lang'] for tweet in tweets if 'lang' in tweet] geo = [tweet['geo'] for tweet in tweets if 'geo' in tweet] place = [tweet['place'] for tweet in tweets if 'place' in tweet] df=pd.DataFrame({'Ids':pd.Index(ids), 'Text':pd.Index(text), 'Lang':pd.Index(lang), 'Geo':pd.Index(geo), 'Place':pd.Index(place)}) dfThanks to suggestions provided below by wavic and zivoni, the code now works. RE: Automating the code using os.walk - wavic - Mar-09-2017 >>> import os >>> for root, dirs, files in os.walk('.'): ... for file in files: ... if file.endswith(".py"): ... print(file) ... animBanner.py sound.py python_bulk_rename_script.py simple_mass_renamer.py recipe-499305-1.py 2bg-adv.py btpeer.py ed25519.py RE: Automating the code using os.walk - kiton - Mar-09-2017 wavic, thank you for reply. So, I have adjusted the code based on your suggestion (see the bottom code in the first post). However, it still does not put all the extracted data in the data frame (just column names). RE: Automating the code using os.walk - zivoni - Mar-09-2017 Perhaps you should start to organize your code a little to make it more manageable - something like def parse_json(filename): # load and parse json filename return dataframeAnd after that you can do df_list = [] for root, dirs, files in os.walk('.'): # stolen from wavic for file in files: if file.endswith(".py"): df_list.append(parse_json(file)) df = pd.concat(df_list)You should use with for file opening and maybe pass in except clause is more common (no difference for your code, but if it wasnt last line in your for code block...) EDIT: Another reason for code organizing is that its much easier to debug it. If you split your funcionality to small pieces (like functions with limited "responsibilty" ), its much easier to localize problem than in big monolithic code ... RE: Automating the code using os.walk - wavic - Mar-09-2017 Your indentation is wrong. You have to unindent the code after the continue statement.
RE: Automating the code using os.walk - kiton - Mar-09-2017 wavic, zivoni -- guys, your help is just superb! I was able to correct the errors and make the code work (the bottom code in the first post is updated). I'll now turn to learning how to make things nicer, as suggested by zivoni. RE: Automating the code using os.walk - kiton - Apr-03-2017 Hi! With your invaluable help, I was able to successfully test the following code using 15 gigabyte (~1400 JSON files) data set. # Parse a body (dayly/monthly) of JSONs import os import json import pandas as pd import numpy as np from collections import defaultdict import timeit tic=timeit.default_timer() elements_keys = ['created_at', 'text', 'lang', 'geo'] elements = defaultdict(list) for dirs, subdirs, files in os.walk('/home/Dir'): for file in files: if file.endswith('.json'): with open(file, 'r') as input_file: for line in input_file: try: tweet = json.loads(line) for key in elements_keys: elements[key].append(tweet[key]) except: continue df=pd.DataFrame({'created_at': pd.Index(elements['created_at']), 'Text': pd.Index(elements['text']), 'Lang': pd.Index(elements['lang']), 'Geo': pd.Index(elements['geo'])}) df df.to_csv('month_12_01.csv')Now, I am testing the above code on the EC2 instance (i3.x8large) feeding in 230 gigabyte (~44,000 JSON files). However, about 90 minutes into the run, the following error occurs. Please suggest a way to fix it.
RE: Automating the code using os.walk - Ofnuts - Apr-03-2017 (Apr-03-2017, 12:10 AM)kiton Wrote: Hi! With your invaluable help, I was able to successfully test the following code using 15 gigabyte (~1400 JSON files) data set. Your first run and the 90mn execution of the 2nd run makes it likely that your code is OK. So that could be a problem with your data. You have to update your code to add print statements(*) that tell you what the code is doing (file loaded, processing step, etc...) to help you pinpoint the problem. (*) Or better, go professional and use logging RE: Automating the code using os.walk - zivoni - Apr-03-2017 You are trying to create pandas dataframe from columns with different lengths, that usually does not work. Your code appends to elements for each key separately, so if there is a missing key in the tweet for a key from second or higher position in the elements_key, values with "lower" keys are already appended when exception raise. And from that time rest of your data is misaligned (and worthless). To solve it you need to secure that you append either complete "tweet row" or nothing. There is multiple ways how it can be done - checking if all elements_keys are in the tweet before appending. "Extracting" all your key's values in single statement so exception raises before any appending. Or modifiyng append code so for missing items it appends None. Example of "second" approach - this should replace your innermost for loop: items = [(key, tweet[key]) for key in elements_keys] # should raise error if any key is missing for key, value in items: elements[key].append(value)From your code your are just converting your bunch of json's to one mega csv. You could skip dataframe step and do it by writing your tweet line directly to a file (with csv.writerow(s) ?) and maybe with 200MB of memory instead of 230GB. RE: Automating the code using os.walk - kiton - Apr-03-2017 zivoni, Thank you for such an informative response. I am testing the solution you suggested. Ofnuts, thank you for feedback. I got your point on learning "logging". |