Python Forum

Hello dear forum members,

With your generous help I was able to successfully run the following code:

import os
import json
import pandas as pd
import numpy as np
from collections import defaultdict

elements_keys = ['created_at', 'text', 'lang', 'geo']
elements = defaultdict(list)

for dirs, subdirs, files in os.walk('/PATH'):
   for file in files:
       if file.endswith('.json'):
           with open(file, 'r') as input_file:
               for line in input_file:
                   try:
                       tweet = json.loads(line)
                       items = [(key, tweet[key]) for key in elements_keys] # should raise error if any key is missing
                       for key, value in items:
                           elements[key].append(value)
                   except:
                       continue

df=pd.DataFrame({'created_at': pd.Index(elements['created_at']),
                'text': pd.Index(elements['text']),
                'lang': pd.Index(elements['lang']),
                'geo': pd.Index(elements['geo'])})

Yet, as the volume of data has increased substantially, I am curious if it is possible to speed up the processing time. Specifically, of all the data analyzed I am interested only in those tweets, which are (1) in English ('lang' = en), and (2) contain geo-tag ('geo' = lan/lot [not None values]). What is the correct way of adding those two conditions?

Thank you in advance!

So, I managed to run the following version of the code:

import os
import json
import pandas as pd
import numpy as np
from collections import defaultdict

elements_keys = ['created_at', 'text', 'lang', 'geo']
elements = defaultdict(list)

for dirs, subdirs, files in os.walk('/Users/mymac/Documents/Jupyter/Twitter/00/test'):
    for file in files:
        if file.endswith('.json'):
            with open(file, 'r') as input_file: 
                for line in input_file:
                    try:
                        tweet = json.loads(line)
                        items = [(key, tweet[key]) for key in elements_keys] # should raise error if any key is missing
                        for key, value in items:
                            if tweet['lang'] == 'en':
                                if tweet['geo'] is not None:
                                    elements[key].append(value)
                    except:
                                continue
df=pd.DataFrame({'created_at': pd.Index(elements['created_at']),
                 'text': pd.Index(elements['text']),
                 'lang': pd.Index(elements['lang']),
                 'geo': pd.Index(elements['geo'])})

I wonder though if this specific placement of IF condition is the one that would reduce processing time or not.

Heck yeah the thing is running now way faster :)

You are probably dropping majority of your tweets, so it increases processing speed greatly. But putting if condition into your inner for loop is not so good, as you are checking exactly same condition four times for each row (for each elements_key). Better would be to put it just before for loop to do single check for every tweet.

for dirs, subdirs, files in os.walk('/Users/mymac/Documents/Jupyter/Twitter/00/test'):
   for file in files:
       if file.endswith('.json'):
           with open(file, 'r') as input_file: 
               for line in input_file:
                   try:
                       tweet = json.loads(line)
                       items = [(key, tweet[key]) for key in elements_keys] # should raise error if any key is missing
                       if tweet['lang'] == 'en' and tweet['geo'] is not None:
                           for key, value in items:                          
                                   elements[key].append(value)
                   except:
                               continue

kiton

kiton

kiton

zivoni