Speeding up Twitter parser

kiton · Jun-16-2017, 03:39 PM

Hello dear forum members,

With your generous help I was able to successfully run the following code:

import os
import json
import pandas as pd
import numpy as np
from collections import defaultdict

elements_keys = ['created_at', 'text', 'lang', 'geo']
elements = defaultdict(list)

for dirs, subdirs, files in os.walk('/PATH'):
   for file in files:
       if file.endswith('.json'):
           with open(file, 'r') as input_file:
               for line in input_file:
                   try:
                       tweet = json.loads(line)
                       items = [(key, tweet[key]) for key in elements_keys] # should raise error if any key is missing
                       for key, value in items:
                           elements[key].append(value)
                   except:
                       continue

df=pd.DataFrame({'created_at': pd.Index(elements['created_at']),
                'text': pd.Index(elements['text']),
                'lang': pd.Index(elements['lang']),
                'geo': pd.Index(elements['geo'])})

Yet, as the volume of data has increased substantially, I am curious if it is possible to speed up the processing time. Specifically, of all the data analyzed I am interested only in those tweets, which are (1) in English ('lang' = en), and (2) contain geo-tag ('geo' = lan/lot [not None values]). What is the correct way of adding those two conditions?

Thank you in advance!

kiton · (This post was last modified: Jun-17-2017, 04:33 AM by kiton.)

So, I managed to run the following version of the code:

import os
import json
import pandas as pd
import numpy as np
from collections import defaultdict

elements_keys = ['created_at', 'text', 'lang', 'geo']
elements = defaultdict(list)

for dirs, subdirs, files in os.walk('/Users/mymac/Documents/Jupyter/Twitter/00/test'):
    for file in files:
        if file.endswith('.json'):
            with open(file, 'r') as input_file: 
                for line in input_file:
                    try:
                        tweet = json.loads(line)
                        items = [(key, tweet[key]) for key in elements_keys] # should raise error if any key is missing
                        for key, value in items:
                            if tweet['lang'] == 'en':
                                if tweet['geo'] is not None:
                                    elements[key].append(value)
                    except:
                                continue
df=pd.DataFrame({'created_at': pd.Index(elements['created_at']),
                 'text': pd.Index(elements['text']),
                 'lang': pd.Index(elements['lang']),
                 'geo': pd.Index(elements['geo'])})

I wonder though if this specific placement of IF condition is the one that would reduce processing time or not.

kiton · Jun-18-2017, 03:30 PM

Heck yeah the thing is running now way faster :)

***zivoni*** · Jun-19-2017, 06:01 PM

You are probably dropping majority of your tweets, so it increases processing speed greatly. But putting if condition into your inner for loop is not so good, as you are checking exactly same condition four times for each row (for each elements_key). Better would be to put it just before for loop to do single check for every tweet.

for dirs, subdirs, files in os.walk('/Users/mymac/Documents/Jupyter/Twitter/00/test'):
   for file in files:
       if file.endswith('.json'):
           with open(file, 'r') as input_file: 
               for line in input_file:
                   try:
                       tweet = json.loads(line)
                       items = [(key, tweet[key]) for key in elements_keys] # should raise error if any key is missing
                       if tweet['lang'] == 'en' and tweet['geo'] is not None:
                           for key, value in items:                          
                                   elements[key].append(value)
                   except:
                               continue

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Speeding up iterating over rows	NeoXx	3	1,134	Nov-29-2022, 05:08 AM Last Post: Pedroski55
	Help speeding up ode_int	andeye	2	2,072	Jan-27-2019, 04:04 PM Last Post: andeye
	Need tips for speeding up python code	iineo	1	2,356	May-29-2018, 12:57 AM Last Post: micseydel

Speeding up Twitter parser

User Panel Messages

Announcements