Hi! So, I came up with the following code to extract Twitter data from JSON and create a data frame with several columns:
Now, the problem.
For example:
"Ids" value is recorded as "(396154642666913792,)" ;
Or "Geo" value is recorded as "({'coordinates': [41.63349811, -93.65831894], 'type': 'Point'},)"
Question: How do I remove the "extra" characters -- i.e., (), {}, 'coordinates':, etc.?
Thank you in advance for help!
# Import libraries import json import pandas as pd # Extract data from JSON tweets = [] for line in open('00.json'): try: tweets.append(json.loads(line)) except: pass # Tweets often have missing data, therefore use -if- when extracting "keys" tweet = tweets[0] ids = [tweet['id_str'] for tweet in tweets if 'id_str' in tweet] text = [tweet['text'] for tweet in tweets if 'text' in tweet] lang = [tweet['lang'] for tweet in tweets if 'lang' in tweet] geo = [tweet['geo'] for tweet in tweets if 'geo' in tweet] place = [tweet['place'] for tweet in tweets if 'place' in tweet] # Create a data frame (using pd.Index may be "incorrect", but I am a noob) df=pd.DataFrame({'Ids':pd.Index(ids), 'Text':pd.Index(text), 'Lang':pd.Index(lang), 'Geo':pd.Index(geo), 'Place':pd.Index(place)}) # Convert "object" to "string" type df.Lang.apply(str) df.Geo.apply(str) # Select tweets in English and with geo tag df[(df['Lang']==('en',)) & (df['Geo'] != (None,))]So far, everything seems more or less fine.
Now, the problem.
For example:
"Ids" value is recorded as "(396154642666913792,)" ;
Or "Geo" value is recorded as "({'coordinates': [41.63349811, -93.65831894], 'type': 'Point'},)"
Question: How do I remove the "extra" characters -- i.e., (), {}, 'coordinates':, etc.?
Thank you in advance for help!