Posts: 70
Threads: 17
Joined: Feb 2017
Mar-09-2017, 09:48 PM
(This post was last modified: Mar-09-2017, 11:47 PM by kiton.)
So I have a piece of code that works with a single file that is in the root folder:
import json
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
tweets = []
# Read Twitter JSON
for line in open('00.json'): # single JSON to work with
try:
tweet = json.loads(line)
tweets.append(tweet)
except:
continue
# Extract data of interest
tweet = tweets[0]
ids = [tweet['id_str'] for tweet in tweets if 'id_str' in tweet] #tweets often have missing data, therefore use if
text = [tweet['text'] for tweet in tweets if 'text' in tweet]
lang = [tweet['lang'] for tweet in tweets if 'lang' in tweet]
geo = [tweet['geo'] for tweet in tweets if 'geo' in tweet]
place = [tweet['place'] for tweet in tweets if 'place' in tweet]
# Save data of interest in a pandas data frame
df=pd.DataFrame({'Ids':pd.Index(ids),
'Text':pd.Index(text),
'Lang':pd.Index(lang),
'Geo':pd.Index(geo),
'Place':pd.Index(place)})
# Create a data frame for this specific JSON excluding some data:
df00 = df[(df['Lang']==('en')) & (df['Geo'].dropna())] Now, I have about a thousand of similar JSON files each of which is in the separate sub-folder. The following coding elaborations are difficult to me, so please bare with me here. My current goal is to (1) look into each sub-folder, (2) locate the *.json, (3) perform data extraction on it, (4) create a data frame with extracted data for all JSONs read.
import os
import json
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import os
tweets = []
rootdir = '/Users/mymac/Documents/00/23'
for subdir, dirs, files in os.walk(rootdir):
for file in files:
if file.endswith(".json"):
for line in open(file) :
try:
tweet = json.loads(line)
tweets.append(tweet)
except:
continue
tweet = tweets[0]
ids = [tweet['id_str'] for tweet in tweets if 'id_str' in tweet]
text = [tweet['text'] for tweet in tweets if 'text' in tweet]
lang = [tweet['lang'] for tweet in tweets if 'lang' in tweet]
geo = [tweet['geo'] for tweet in tweets if 'geo' in tweet]
place = [tweet['place'] for tweet in tweets if 'place' in tweet]
df=pd.DataFrame({'Ids':pd.Index(ids),
'Text':pd.Index(text),
'Lang':pd.Index(lang),
'Geo':pd.Index(geo),
'Place':pd.Index(place)})
df Thanks to suggestions provided below by wavic and zivoni, the code now works.
Posts: 2,953
Threads: 48
Joined: Sep 2016
>>> import os
>>> for root, dirs, files in os.walk('.'):
... for file in files:
... if file.endswith(".py"):
... print(file)
...
animBanner.py
sound.py
python_bulk_rename_script.py
simple_mass_renamer.py
recipe-499305-1.py
2bg-adv.py
btpeer.py
ed25519.py
Posts: 70
Threads: 17
Joined: Feb 2017
Mar-09-2017, 10:21 PM
(This post was last modified: Mar-09-2017, 10:21 PM by kiton.)
wavic, thank you for reply. So, I have adjusted the code based on your suggestion (see the bottom code in the first post). However, it still does not put all the extracted data in the data frame (just column names).
Posts: 331
Threads: 2
Joined: Feb 2017
Mar-09-2017, 10:51 PM
(This post was last modified: Mar-09-2017, 10:53 PM by zivoni.)
Perhaps you should start to organize your code a little to make it more manageable - something like
def parse_json(filename):
# load and parse json filename
return dataframe And after that you can do
df_list = []
for root, dirs, files in os.walk('.'): # stolen from wavic
for file in files:
if file.endswith(".py"):
df_list.append(parse_json(file))
df = pd.concat(df_list) You should use with for file opening and maybe pass in except clause is more common (no difference for your code, but if it wasnt last line in your for code block...)
EDIT:
Another reason for code organizing is that its much easier to debug it. If you split your funcionality to small pieces (like functions with limited "responsibilty" ), its much easier to localize problem than in big monolithic code ...
Posts: 2,953
Threads: 48
Joined: Sep 2016
Your indentation is wrong. You have to unindent the code after the continue statement.
Posts: 70
Threads: 17
Joined: Feb 2017
wavic, zivoni -- guys, your help is just superb! I was able to correct the errors and make the code work (the bottom code in the first post is updated). I'll now turn to learning how to make things nicer, as suggested by zivoni.
Posts: 70
Threads: 17
Joined: Feb 2017
Apr-03-2017, 12:10 AM
(This post was last modified: Apr-03-2017, 12:10 AM by kiton.)
Hi! With your invaluable help, I was able to successfully test the following code using 15 gigabyte (~1400 JSON files) data set.
# Parse a body (dayly/monthly) of JSONs
import os
import json
import pandas as pd
import numpy as np
from collections import defaultdict
import timeit
tic=timeit.default_timer()
elements_keys = ['created_at', 'text', 'lang', 'geo']
elements = defaultdict(list)
for dirs, subdirs, files in os.walk('/home/Dir'):
for file in files:
if file.endswith('.json'):
with open(file, 'r') as input_file:
for line in input_file:
try:
tweet = json.loads(line)
for key in elements_keys:
elements[key].append(tweet[key])
except:
continue
df=pd.DataFrame({'created_at': pd.Index(elements['created_at']),
'Text': pd.Index(elements['text']),
'Lang': pd.Index(elements['lang']),
'Geo': pd.Index(elements['geo'])})
df
df.to_csv('month_12_01.csv') Now, I am testing the above code on the EC2 instance (i3.x8large) feeding in 230 gigabyte (~44,000 JSON files). However, about 90 minutes into the run, the following error occurs.
Please suggest a way to fix it.
Error: [code]---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-12-d4376f526220> in <module>()
27 'text': pd.Index(elements['text']),
28 'lang': pd.Index(elements['lang']),
---> 29 'geo': pd.Index(elements['geo'])})
30 df
31 df.to_csv('month_12_01.csv')
/home/ubuntu/anaconda3/lib/python3.6/site-packages/pandas/core/frame.py in __init__(self, data, index, columns, dtype, copy)
264 dtype=dtype, copy=copy)
265 elif isinstance(data, dict):
--> 266 mgr = self._init_dict(data, index, columns, dtype=dtype)
267 elif isinstance(data, ma.MaskedArray):
268 import numpy.ma.mrecords as mrecords
/home/ubuntu/anaconda3/lib/python3.6/site-packages/pandas/core/frame.py in _init_dict(self, data, index, columns, dtype)
400 arrays = [data[k] for k in keys]
401
--> 402 return _arrays_to_mgr(arrays, data_names, index, columns, dtype=dtype)
403
404 def _init_ndarray(self, values, index, columns, dtype=None, copy=False):
/home/ubuntu/anaconda3/lib/python3.6/site-packages/pandas/core/frame.py in _arrays_to_mgr(arrays, arr_names, index, columns, dtype)
5396 # figure out the index, if necessary
5397 if index is None:
-> 5398 index = extract_index(arrays)
5399 else:
5400 index = _ensure_index(index)
/home/ubuntu/anaconda3/lib/python3.6/site-packages/pandas/core/frame.py in extract_index(data)
5444 lengths = list(set(raw_lengths))
5445 if len(lengths) > 1:
-> 5446 raise ValueError('arrays must all be same length')
5447
5448 if have_dicts:
ValueError: arrays must all be same length[/code]
Posts: 687
Threads: 37
Joined: Sep 2016
(Apr-03-2017, 12:10 AM)kiton Wrote: Hi! With your invaluable help, I was able to successfully test the following code using 15 gigabyte (~1400 JSON files) data set.
Now, I am testing the above code on the EC2 instance (i3.x8large) feeding in 230 gigabyte (~44,000 JSON files). However, about 90 minutes into the run, the following error occurs.
Please suggest a way to fix it.
Your first run and the 90mn execution of the 2nd run makes it likely that your code is OK. So that could be a problem with your data. You have to update your code to add print statements(*) that tell you what the code is doing (file loaded, processing step, etc...) to help you pinpoint the problem.
(*) Or better, go professional and use logging
Unless noted otherwise, code in my posts should be understood as "coding suggestions", and its use may require more neurones than the two necessary for Ctrl-C/Ctrl-V.
Your one-stop place for all your GIMP needs: gimp-forum.net
Posts: 331
Threads: 2
Joined: Feb 2017
You are trying to create pandas dataframe from columns with different lengths, that usually does not work.
Your code appends to elements for each key separately, so if there is a missing key in the tweet for a key from second or higher position in the elements_key, values with "lower" keys are already appended when exception raise. And from that time rest of your data is misaligned (and worthless).
To solve it you need to secure that you append either complete "tweet row" or nothing. There is multiple ways how it can be done - checking if all elements_keys are in the tweet before appending. "Extracting" all your key's values in single statement so exception raises before any appending. Or modifiyng append code so for missing items it appends None.
Example of "second" approach - this should replace your innermost for loop:
items = [(key, tweet[key]) for key in elements_keys] # should raise error if any key is missing
for key, value in items:
elements[key].append(value) From your code your are just converting your bunch of json's to one mega csv. You could skip dataframe step and do it by writing your tweet line directly to a file (with csv.writerow(s) ?) and maybe with 200MB of memory instead of 230GB.
Posts: 70
Threads: 17
Joined: Feb 2017
Apr-03-2017, 04:34 PM
(This post was last modified: Apr-03-2017, 10:17 PM by kiton.)
zivoni, Thank you for such an informative response. I am testing the solution you suggested.
Ofnuts, thank you for feedback. I got your point on learning "logging".
|