Jun-16-2017, 03:39 PM
Hello dear forum members,
With your generous help I was able to successfully run the following code:
Thank you in advance!
With your generous help I was able to successfully run the following code:
import os import json import pandas as pd import numpy as np from collections import defaultdict elements_keys = ['created_at', 'text', 'lang', 'geo'] elements = defaultdict(list) for dirs, subdirs, files in os.walk('/PATH'): for file in files: if file.endswith('.json'): with open(file, 'r') as input_file: for line in input_file: try: tweet = json.loads(line) items = [(key, tweet[key]) for key in elements_keys] # should raise error if any key is missing for key, value in items: elements[key].append(value) except: continue df=pd.DataFrame({'created_at': pd.Index(elements['created_at']), 'text': pd.Index(elements['text']), 'lang': pd.Index(elements['lang']), 'geo': pd.Index(elements['geo'])})Yet, as the volume of data has increased substantially, I am curious if it is possible to speed up the processing time. Specifically, of all the data analyzed I am interested only in those tweets, which are (1) in English ('lang' = en), and (2) contain geo-tag ('geo' = lan/lot [not None values]). What is the correct way of adding those two conditions?
Thank you in advance!