Python Forum - Automating the code using os.walk

Pages: 1 2 3

So I have a piece of code that works with a single file that is in the root folder:

import json
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

tweets = []

# Read Twitter JSON
for line in open('00.json'): # single JSON to work with
    try:
        tweet = json.loads(line)
        tweets.append(tweet)
    except:
        continue

# Extract data of interest        
tweet = tweets[0]
ids = [tweet['id_str'] for tweet in tweets if 'id_str' in tweet] #tweets often have missing data, therefore use if
text = [tweet['text'] for tweet in tweets if 'text' in tweet]
lang = [tweet['lang'] for tweet in tweets if 'lang' in tweet]
geo = [tweet['geo'] for tweet in tweets if 'geo' in tweet]
place = [tweet['place'] for tweet in tweets if 'place' in tweet]

# Save data of interest in a pandas data frame
df=pd.DataFrame({'Ids':pd.Index(ids),
                'Text':pd.Index(text),
                'Lang':pd.Index(lang),
                'Geo':pd.Index(geo),
                'Place':pd.Index(place)})

# Create a data frame for this specific JSON excluding some data:
df00 = df[(df['Lang']==('en')) & (df['Geo'].dropna())]

Now, I have about a thousand of similar JSON files each of which is in the separate sub-folder. The following coding elaborations are difficult to me, so please bare with me here. My current goal is to (1) look into each sub-folder, (2) locate the *.json, (3) perform data extraction on it, (4) create a data frame with extracted data for all JSONs read.

import os
import json
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import os

tweets = []

rootdir = '/Users/mymac/Documents/00/23'

for subdir, dirs, files in os.walk(rootdir):
    for file in files:
        if file.endswith(".json"):
            for line in open(file) :
                try:
                    tweet = json.loads(line)
                    tweets.append(tweet)
                except:
                    continue
                    
tweet = tweets[0]
                        
ids = [tweet['id_str'] for tweet in tweets if 'id_str' in tweet] 
text = [tweet['text'] for tweet in tweets if 'text' in tweet]
lang = [tweet['lang'] for tweet in tweets if 'lang' in tweet]
geo = [tweet['geo'] for tweet in tweets if 'geo' in tweet]                    
place = [tweet['place'] for tweet in tweets if 'place' in tweet]
                            
df=pd.DataFrame({'Ids':pd.Index(ids),
                 'Text':pd.Index(text),
                 'Lang':pd.Index(lang),
                 'Geo':pd.Index(geo),
                 'Place':pd.Index(place)})
df

Thanks to suggestions provided below by wavic and zivoni, the code now works.

>>> import os
>>> for root, dirs, files in os.walk('.'):
...     for file in files:
...         if file.endswith(".py"):
...             print(file)
... 
animBanner.py
sound.py
python_bulk_rename_script.py
simple_mass_renamer.py
recipe-499305-1.py
2bg-adv.py
btpeer.py
ed25519.py

wavic, thank you for reply. So, I have adjusted the code based on your suggestion (see the bottom code in the first post). However, it still does not put all the extracted data in the data frame (just column names).

Perhaps you should start to organize your code a little to make it more manageable - something like

def parse_json(filename):
    # load and parse json filename
    return dataframe

And after that you can do

df_list = []
for root, dirs, files in os.walk('.'):       # stolen from wavic
    for file in files:
       if file.endswith(".py"):
            df_list.append(parse_json(file))

df = pd.concat(df_list)

You should use with for file opening and maybe pass in except clause is more common (no difference for your code, but if it wasnt last line in your for code block...)

EDIT:
Another reason for code organizing is that its much easier to debug it. If you split your funcionality to small pieces (like functions with limited "responsibilty" ), its much easier to localize problem than in big monolithic code ...

Your indentation is wrong. You have to unindent the code after the continue statement.

wavic, zivoni -- guys, your help is just superb! I was able to correct the errors and make the code work (the bottom code in the first post is updated). I'll now turn to learning how to make things nicer, as suggested by zivoni.

Hi! With your invaluable help, I was able to successfully test the following code using 15 gigabyte (~1400 JSON files) data set.

# Parse a body (dayly/monthly) of JSONs

import os
import json
import pandas as pd
import numpy as np
from collections import defaultdict
import timeit
tic=timeit.default_timer()

elements_keys = ['created_at', 'text', 'lang', 'geo']
elements = defaultdict(list)

for dirs, subdirs, files in os.walk('/home/Dir'):
   for file in files:
       if file.endswith('.json'):
           with open(file, 'r') as input_file:
               for line in input_file:
                   try:
                       tweet = json.loads(line)
                       for key in elements_keys:
                           elements[key].append(tweet[key])
                   except:
                       continue

df=pd.DataFrame({'created_at': pd.Index(elements['created_at']),
                'Text': pd.Index(elements['text']),
                'Lang': pd.Index(elements['lang']),
                'Geo': pd.Index(elements['geo'])})
df
df.to_csv('month_12_01.csv')

Now, I am testing the above code on the EC2 instance (i3.x8large) feeding in 230 gigabyte (~44,000 JSON files). However, about 90 minutes into the run, the following error occurs.

Please suggest a way to fix it.

Error:[code]---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-12-d4376f526220> in <module>()
     27                  'text': pd.Index(elements['text']),
     28                  'lang': pd.Index(elements['lang']),
---> 29                  'geo': pd.Index(elements['geo'])})
     30 df
     31 df.to_csv('month_12_01.csv')

/home/ubuntu/anaconda3/lib/python3.6/site-packages/pandas/core/frame.py in __init__(self, data, index, columns, dtype, copy)
    264                                  dtype=dtype, copy=copy)
    265         elif isinstance(data, dict):
--> 266             mgr = self._init_dict(data, index, columns, dtype=dtype)
    267         elif isinstance(data, ma.MaskedArray):
    268             import numpy.ma.mrecords as mrecords

/home/ubuntu/anaconda3/lib/python3.6/site-packages/pandas/core/frame.py in _init_dict(self, data, index, columns, dtype)
    400             arrays = [data[k] for k in keys]
    401 
--> 402         return _arrays_to_mgr(arrays, data_names, index, columns, dtype=dtype)
    403 
    404     def _init_ndarray(self, values, index, columns, dtype=None, copy=False):

/home/ubuntu/anaconda3/lib/python3.6/site-packages/pandas/core/frame.py in _arrays_to_mgr(arrays, arr_names, index, columns, dtype)
   5396     # figure out the index, if necessary
   5397     if index is None:
-> 5398         index = extract_index(arrays)
   5399     else:
   5400         index = _ensure_index(index)

/home/ubuntu/anaconda3/lib/python3.6/site-packages/pandas/core/frame.py in extract_index(data)
   5444             lengths = list(set(raw_lengths))
   5445             if len(lengths) > 1:
-> 5446                 raise ValueError('arrays must all be same length')
   5447 
   5448             if have_dicts:

ValueError: arrays must all be same length[/code]

(Apr-03-2017, 12:10 AM)kiton Wrote: [ -> ]Hi! With your invaluable help, I was able to successfully test the following code using 15 gigabyte (~1400 JSON files) data set.

Now, I am testing the above code on the EC2 instance (i3.x8large) feeding in 230 gigabyte (~44,000 JSON files). However, about 90 minutes into the run, the following error occurs.

Please suggest a way to fix it.

Your first run and the 90mn execution of the 2nd run makes it likely that your code is OK. So that could be a problem with your data. You have to update your code to add print statements(*) that tell you what the code is doing (file loaded, processing step, etc...) to help you pinpoint the problem.

(*) Or better, go professional and use logging

You are trying to create pandas dataframe from columns with different lengths, that usually does not work.

Your code appends to elements for each key separately, so if there is a missing key in the tweet for a key from second or higher position in the elements_key, values with "lower" keys are already appended when exception raise. And from that time rest of your data is misaligned (and worthless).

To solve it you need to secure that you append either complete "tweet row" or nothing. There is multiple ways how it can be done - checking if all elements_keys are in the tweet before appending. "Extracting" all your key's values in single statement so exception raises before any appending. Or modifiyng append code so for missing items it appends None.

Example of "second" approach - this should replace your innermost for loop:

items = [(key, tweet[key]) for key in elements_keys] # should raise error if any key is missing
for key, value in items:
    elements[key].append(value)

From your code your are just converting your bunch of json's to one mega csv. You could skip dataframe step and do it by writing your tweet line directly to a file (with csv.writerow(s) ?) and maybe with 200MB of memory instead of 230GB.

zivoni, Thank you for such an informative response. I am testing the solution you suggested.

Ofnuts, thank you for feedback. I got your point on learning "logging".

Pages: 1 2 3