Python Forum
Speeding up Twitter parser
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Speeding up Twitter parser
#1
Hello dear forum members,

With your generous help I was able to successfully run the following code:

import os
import json
import pandas as pd
import numpy as np
from collections import defaultdict

elements_keys = ['created_at', 'text', 'lang', 'geo']
elements = defaultdict(list)

for dirs, subdirs, files in os.walk('/PATH'):
   for file in files:
       if file.endswith('.json'):
           with open(file, 'r') as input_file:
               for line in input_file:
                   try:
                       tweet = json.loads(line)
                       items = [(key, tweet[key]) for key in elements_keys] # should raise error if any key is missing
                       for key, value in items:
                           elements[key].append(value)
                   except:
                       continue

df=pd.DataFrame({'created_at': pd.Index(elements['created_at']),
                'text': pd.Index(elements['text']),
                'lang': pd.Index(elements['lang']),
                'geo': pd.Index(elements['geo'])})
Yet, as the volume of data has increased substantially, I am curious if it is possible to speed up the processing time. Specifically, of all the data analyzed I am interested only in those tweets, which are (1) in English ('lang' = en), and (2) contain geo-tag ('geo' = lan/lot [not None values]). What is the correct way of adding those two conditions?

Thank you in advance!
Reply
#2
So, I managed to run the following version of the code:

import os
import json
import pandas as pd
import numpy as np
from collections import defaultdict

elements_keys = ['created_at', 'text', 'lang', 'geo']
elements = defaultdict(list)

for dirs, subdirs, files in os.walk('/Users/mymac/Documents/Jupyter/Twitter/00/test'):
    for file in files:
        if file.endswith('.json'):
            with open(file, 'r') as input_file: 
                for line in input_file:
                    try:
                        tweet = json.loads(line)
                        items = [(key, tweet[key]) for key in elements_keys] # should raise error if any key is missing
                        for key, value in items:
                            if tweet['lang'] == 'en':
                                if tweet['geo'] is not None:
                                    elements[key].append(value)
                    except:
                                continue
df=pd.DataFrame({'created_at': pd.Index(elements['created_at']),
                 'text': pd.Index(elements['text']),
                 'lang': pd.Index(elements['lang']),
                 'geo': pd.Index(elements['geo'])})
I wonder though if this specific placement of IF condition is the one that would reduce processing time or not.
Reply
#3
Heck yeah the thing is running now way faster :)
Reply
#4
You are probably dropping majority of your tweets, so it increases processing speed greatly. But putting if condition into your inner for loop is not so good, as you are checking exactly same condition four times for each row (for each elements_key). Better would be to put it just before for loop to do single check for every tweet.

for dirs, subdirs, files in os.walk('/Users/mymac/Documents/Jupyter/Twitter/00/test'):
   for file in files:
       if file.endswith('.json'):
           with open(file, 'r') as input_file: 
               for line in input_file:
                   try:
                       tweet = json.loads(line)
                       items = [(key, tweet[key]) for key in elements_keys] # should raise error if any key is missing
                       if tweet['lang'] == 'en' and tweet['geo'] is not None:
                           for key, value in items:                          
                                   elements[key].append(value)
                   except:
                               continue
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Speeding up iterating over rows NeoXx 3 1,134 Nov-29-2022, 05:08 AM
Last Post: Pedroski55
  Help speeding up ode_int andeye 2 2,072 Jan-27-2019, 04:04 PM
Last Post: andeye
  Need tips for speeding up python code iineo 1 2,356 May-29-2018, 12:57 AM
Last Post: micseydel

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020