Python Forum
Removing characters from columns in data frame
Thread Rating:
  • 3 Vote(s) - 3 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Removing characters from columns in data frame
#1
Hi! So, I came up with the following code to extract Twitter data from JSON and create a data frame with several columns:

# Import libraries
import json
import pandas as pd

# Extract data from JSON
tweets = []
for line in open('00.json'):
  try: 
    tweets.append(json.loads(line))
  except:
    pass

# Tweets often have missing data, therefore use -if- when extracting "keys"
tweet = tweets[0]
ids = [tweet['id_str'] for tweet in tweets if 'id_str' in tweet] 
text = [tweet['text'] for tweet in tweets if 'text' in tweet]
lang = [tweet['lang'] for tweet in tweets if 'lang' in tweet]
geo = [tweet['geo'] for tweet in tweets if 'geo' in tweet]
place = [tweet['place'] for tweet in tweets if 'place' in tweet]

# Create a data frame (using pd.Index may be "incorrect", but I am a noob)
df=pd.DataFrame({'Ids':pd.Index(ids),
                'Text':pd.Index(text),
                'Lang':pd.Index(lang),
                'Geo':pd.Index(geo),
                'Place':pd.Index(place)})

# Convert "object" to "string" type
df.Lang.apply(str)
df.Geo.apply(str)

# Select tweets in English and with geo tag
df[(df['Lang']==('en',)) & (df['Geo'] != (None,))]
So far, everything seems more or less fine. 

Now, the problem. 

For example:

"Ids" value is recorded as "(396154642666913792,)" ;
Or "Geo" value is recorded as "({'coordinates': [41.63349811, -93.65831894], 'type': 'Point'},)"

Question: How do I remove the "extra" characters -- i.e., (), {}, 'coordinates':, etc.?

Thank you in advance for help!
Reply
#2
If your transformations are not too difficults, you can use pandas' str.replace to clean it - it works as re.sub, so

df.Geo.str.replace("[({':]", "")
to remove matches of regular expression.

If its too complicated, you can define some "filtering function" and map it to column with .apply()
Reply
#3
Thank you, zivoni! I'll try this approach and check how it works.
Reply
#4
Unfortunately, I am having difficulties with the proposed solution. So, here is the value that I have:

df.loc[1921,'Geo']
{'coordinates': [39.11890951, -84.48903638], 'type': 'Point'}
And here is what I tried:

df.Geo.str.replace("[({':]", "") ### results in NaN
# and also this:
df['Geo'] = df['Geo'].map(lambda x: x.lstrip('{'coordinates': [').rstrip('], 'type': 'Point'')) ### results in syntax error
Please advise on the correct way to obtain coordinates values only (i.e., part within square brackets).
Reply
#5
Admittedly, I messed up with understanding the data type of the returned object. The following addresses the issue properly:

df2.loc[1921,'Geo']
{'coordinates': [39.11890951, -84.48903638], 'type': 'Point'}
df2.loc[1921,'Geo']['coordinates'][0]
39.11890951
# creating new column with desired value
df2["geo_coord0"] = [x['coordinates'][0] for x in df2['Geo']]
df2["geo_coord1"] = [x['coordinates'][1] for x in df2['Geo']]
Reply
#6
Is your Geo column converted to a string? df.Geo.apply(str) does not work in place and from your output it does not look that you actually converted it (no quotes, but it could be interpreter).

This worked for me:
Output:
In [176]: df = pd.DataFrame({'Geo':["{'coordinates': [39.11890951, -84.48903638], 'type': 'Point'}"]}) In [177]: df.loc[0, "Geo"] Out[177]: "{'coordinates': [39.11890951, -84.48903638], 'type': 'Point'}" In [178]: pattern = r".*\[(-*\d+\.\d+), (-*\d+\.\d+)\].*" In [179]: df.Geo = df.Geo.str.replace(pattern, r"\1 \2") In [180]: df.Geo[0] Out[180]: '39.11890951 -84.48903638'
If you want to extract coordinates as seperate columns, you will need another split and convert to number. Or you can extract coordinates separately:
df['long'] = np.float(df.Geo.str.replace(pattern, r"\1"))
df['lat'] = np.float(df.Geo.str.replace(pattern, r"\2"))
My re knowlodge is rudimentary, so its possible that the pattern I have used isnt the right one ....

EDIT: yes, if your column consists of dict/list, its much better to do it with direct access
Reply
#7
zivoni, thank you for reply. Somehow, everything becomes "bad" once I do df.Geo.apply(str) -- the values become NaN. I'll experiment more with you code though and try to make it work.
Reply
#8
Dear zivoni, again I am seeking your help on the following issue that I face trying to run the code you previously suggested. So, I follow the steps outlined in you previous post, but unexpectedly (i.e., it did not occur before) face an error:

df_filter.loc[17765091,'geo']
"{u'type': u'Point', u'coordinates': [36.26511117, -115.10532525]}" 
#Note, the above outcome contains additional components compared to that in your example

pattern = r".*\[(-*\d+\.\d+), (-*\d+\.\d+)\].*"
df_filter.geo = df_filter.geo.str.replace(pattern, r"\1 \2")
df_filter.geo[17765091]
'36.26511117 -115.10532525' # same as yours
# Then extract coordinates to columns
df_filter['lon'] = np.float(df_filter.geo.str.replace(pattern, r"\1"))
df_filter['lat'] = np.float(df_filter.geo.str.replace(pattern, r"\2"))
Error:
--------------------------------------------------------------------------- TypeError                                 Traceback (most recent call last) <ipython-input-75-7ac8460a87be> in <module>() ----> 1 df_filter['lon'] = np.float(df_filter.geo.str.replace(pattern, r"\1"))      2 df_filter['lat'] = np.float(df_filter.geo.str.replace(pattern, r"\2")) /Users/mymac/anaconda/lib/python3.6/site-packages/pandas/core/series.py in wrapper(self)     91             return converter(self.iloc[0])     92         raise TypeError("cannot convert the series to " ---> 93                         "{0}".format(str(converter)))     94     95     return wrapper TypeError: cannot convert the series to <class 'float'>
I tried multiple ways to convert the values in geo column to "float" using different approaches, but they all failed. Please advise if there is some kind other error.
Reply
#9
(Mar-07-2017, 06:14 PM)kiton Wrote: Admittedly, I messed up with understanding the data type of the returned object. The following addresses the issue properly:

df2.loc[1921,'Geo']
{'coordinates': [39.11890951, -84.48903638], 'type': 'Point'}
df2.loc[1921,'Geo']['coordinates'][0]
39.11890951
# creating new column with desired value
df2["geo_coord0"] = [x['coordinates'][0] for x in df2['Geo']]
df2["geo_coord1"] = [x['coordinates'][1] for x in df2['Geo']]

(Apr-13-2017, 06:01 PM)kiton Wrote:
Error:
--------------------------------------------------------------------------- TypeError                                 Traceback (most recent call last) <ipython-input-75-7ac8460a87be> in <module>() ----> 1 df_filter['lon'] = np.float(df_filter.geo.str.replace(pattern, r"\1"))      2 df_filter['lat'] = np.float(df_filter.geo.str.replace(pattern, r"\2")) /Users/mymac/anaconda/lib/python3.6/site-packages/pandas/core/series.py in wrapper(self)     91             return converter(self.iloc[0])     92         raise TypeError("cannot convert the series to " ---> 93                         "{0}".format(str(converter)))     94     95     return wrapper TypeError: cannot convert the series to <class 'float'>
I tried multiple ways to convert the values in geo column to "float" using different approaches, but they all failed. Please advise if there is some kind other error.

As was stated previously, str methods work only with string column and your column seems to have object (dictionary) dtype. Instead of using str you should use native access for dictionary/list. Try something like:
df['lat'] = df.geo.apply(lambda x:x['coordinates'][0])
df['lon'] = df.geo.apply(lambda x:x['coordinates'][1])
Reply
#10
Yep, I understand that issue with string. It used to work correctly. So, I tried the code as you suggested:

df_filter['lat'] = df_filter.geo.apply(lambda x:x['coordinates'][0])
df_filter['lon'] = df_filter.geo.apply(lambda x:x['coordinates'][1])
But it still gives an error:

Error:
--------------------------------------------------------------------------- TypeError Traceback (most recent call last) <ipython-input-81-2ed1bdd10a03> in <module>() ----> 1 df_filter['lat'] = df_filter.geo.apply(lambda x:x['coordinates'][0]) 2 df_filter['lon'] = df_filter.geo.apply(lambda x:x['coordinates'][1]) /Users/mymac/anaconda/lib/python3.6/site-packages/pandas/core/series.py in apply(self, func, convert_dtype, args, **kwds) 2292 else: 2293 values = self.asobject -> 2294 mapped = lib.map_infer(values, f, convert=convert_dtype) 2295 2296 if len(mapped) and isinstance(mapped[0], Series): pandas/src/inference.pyx in pandas.lib.map_infer (pandas/lib.c:66124)() <ipython-input-81-2ed1bdd10a03> in <lambda>(x) ----> 1 df_filter['lat'] = df_filter.geo.apply(lambda x:x['coordinates'][0]) 2 df_filter['lon'] = df_filter.geo.apply(lambda x:x['coordinates'][1]) TypeError: string indices must be integers
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Grouping in pandas/multi-index data frame Aleqsie 3 666 Jan-06-2024, 03:55 PM
Last Post: deanhystad
  Filtering Data Frame, with another value NewBiee 9 1,396 Aug-21-2023, 10:53 AM
Last Post: NewBiee
  Deleting characters between certain characters stahorse 7 1,126 Jul-03-2023, 12:59 AM
Last Post: Pedroski55
  Exporting data frame to excel dyerlee91 0 1,626 Oct-05-2021, 11:34 AM
Last Post: dyerlee91
  Pandas Data frame column condition check based on length of the value aditi06 1 2,692 Jul-28-2021, 11:08 AM
Last Post: jefsummers
  Adding a new column to a Panda Data Frame rsherry8 2 2,123 Jun-06-2021, 06:49 PM
Last Post: jefsummers
  import columns of data from local csv file CatherineKan 2 3,333 May-10-2021, 05:10 AM
Last Post: ricslato
  pandas.to_datetime: Combine data from 2 columns ju21878436312 1 2,451 Feb-20-2021, 08:25 PM
Last Post: perfringo
  grouped data frame glitter 0 1,597 Feb-02-2021, 11:22 AM
Last Post: glitter
  how to filter data frame dynamically with the columns psahay 0 2,403 Aug-24-2020, 01:10 PM
Last Post: psahay

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020