Removing characters from columns in data frame

***zivoni*** · Apr-15-2017, 10:13 AM

Please post result of repr(df_filter.geo[1234]) (1234 is just example, use index value of element with nonempty geo information).

kiton · Apr-15-2017, 02:20 PM

zivoni, please find the information you requested below.

repr(df_filter.geo[1048280])
'"{u\'type\': u\'Point\', u\'coordinates\': [43.18040111, -79.02213645]}"'

I have tried several ways to convert the string geo column in data frame to an integer (and float, too). But they all failed. I can provide those codes and errors, if necessary.

***zivoni*** · Apr-15-2017, 06:23 PM

For me it works quite well.

Output:In [4]: a = '"{u\'type\': u\'Point\', u\'coordinates\': [43.18040111, -79.02213645]}"'

In [5]: df = pd.DataFrame({"geo":[eval(a)]})

In [6]: df
Out[6]: 
                                                 geo
0  {u'type': u'Point', u'coordinates': [43.180401...

In [7]: repr(df.geo[0])
Out[7]: '"{u\'type\': u\'Point\', u\'coordinates\': [43.18040111, -79.02213645]}"'

In [8]: pattern = r".*\[(-*\d+\.\d+), (-*\d+\.\d+)\].*"

In [9]: df['long'] = np.float(df.geo.str.replace(pattern, r"\1"))

In [10]: df
Out[10]: 
                                                 geo       long
0  {u'type': u'Point', u'coordinates': [43.180401...  43.180401

It looks that your string is actually representation of geo dictionary, it would be easier if you extracted long/lat from dictionary before converting it to a string.

kiton · Apr-16-2017, 03:23 PM

zivoni, thank you for feedback. Indeed, your example of the code works fine. However, it still gives an error for my data. I'll keep working on it.

kiton · (This post was last modified: Apr-17-2017, 05:01 PM by kiton.)

Long story short, I managed to deal with the problem using Stata [yeah, sorry I am more of a statistics rather than programmer guy :)] -- I stripped off the leading and the trailing stuff and extracted lon/lat as integers to new columns.

While the task is solved for now, I will have more similar tasks in the future. So, I am still curious what is wrong with the extracted -geo- coordinates. Initially (during test runs), I used the following code, which had no such issues:

# DF Parser

import os
import json
import pandas as pd
import numpy as np
from collections import defaultdict

elements_keys = ['created_at', 'text', 'lang', 'geo']
elements = defaultdict(list)

for dirs, subdirs, files in os.walk('/DIR'):
   for file in files:
       if file.endswith('.json'):
           with open(file, 'r') as input_file:
               for line in input_file:
                   try:
                       tweet = json.loads(line)
                       items = [(key, tweet[key]) for key in elements_keys] # should raise error if any key is missing
                       for key, value in items:
                           elements[key].append(value)
                   except:
                       continue

df=pd.DataFrame({'created_at': pd.Index(elements['created_at']),
                'text': pd.Index(elements['text']),
                'lang': pd.Index(elements['lang']),
                'geo': pd.Index(elements['geo'])})

df #then clean it a little bit, then save to CSV

But then for the "big" task, I adapted the following code suggested by zivoni (code's performance was substantially better).

# CSV Parser

import csv, json, os

elements_keys = ['created_at', 'text', 'lang', 'geo']
with open('file.csv', 'w') as csvfile:
   writer = csv.writer(csvfile)
   writer.writerow(elements_keys)   # header
    
   for dirs, subdirs, files in os.walk('/DIR'):
       for file in files:
           if file.endswith('.json'):
               with open(file, 'r') as input_file:
                   for line in input_file:
                       try:
                           tweet = json.loads(line)
                           row = [tweet[key] for key in elements_keys]
                           writer.writerow(row)     # writing tweet into file
                       except:
                           continue

Although the code worked perfectly fine, the extracted -geo- coordinates gave me the problem that we were recently discussing. Does anybody see anything specific in the above codes that could possibly lead to that problem (i.e., any possible discrepancy in the extracted coordinates)?

***zivoni*** · Apr-17-2017, 07:01 PM

When I think about it there could be problem with np.float applied on pandas serie, maybe following code would work:

df['long'] = df.geo.str.replace(pattern, r"\1").astype('float')

Regardless of it, its much better to extract coordinates directly from json than convert it to string and then extract it from string. I have added lines 15-18 and modified lines 3, 6, 19 (and 11), so it should create .csv with lat and long fields instead of geo.

import csv, json, os

elements_keys = ['created_at', 'text', 'lang']
with open('file.csv', 'w') as csvfile:
    writer = csv.writer(csvfile)
    writer.writerow(elements_keys + ['lat', 'long'])  # header

    for dirs, subdirs, files in os.walk('/DIR'):
        for file in files:
            if file.endswith('.json'):
                with open(os.path.join(dirs,file), 'r') as input_file:  # added os.path.join
                    for line in input_file:
                        try:
                            tweet = json.loads(line)
                            try:
                                coords = tweet['geo']['coordinates']  # trying to get lat, long
                            except Exception as e:
                                coords = [None, None]
                            row = [tweet[key] for key in elements_keys] + coords
                            writer.writerow(row)  # writing tweet into file
                        except:
                            continue

This code is getting more ugly; as you are processing tons of small files, it would be probably better to split it to a few small functions - say one to traverse, second one to convert one file to .csv. That would create tons of small csv, these can be concatenated after script finishes (in shell if you create them without header).

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Grouping in pandas/multi-index data frame	Aleqsie	3	2,251	Jan-06-2024, 03:55 PM Last Post: deanhystad
	Filtering Data Frame, with another value	NewBiee	9	3,033	Aug-21-2023, 10:53 AM Last Post: NewBiee
	Deleting characters between certain characters	stahorse	7	2,561	Jul-03-2023, 12:59 AM Last Post: Pedroski55
	Exporting data frame to excel	dyerlee91	0	2,103	Oct-05-2021, 11:34 AM Last Post: dyerlee91
	Pandas Data frame column condition check based on length of the value	aditi06	1	3,711	Jul-28-2021, 11:08 AM Last Post: jefsummers
	Adding a new column to a Panda Data Frame	rsherry8	2	2,941	Jun-06-2021, 06:49 PM Last Post: jefsummers
	import columns of data from local csv file	CatherineKan	2	4,671	May-10-2021, 05:10 AM Last Post: ricslato
	pandas.to_datetime: Combine data from 2 columns	ju21878436312	1	3,544	Feb-20-2021, 08:25 PM Last Post: perfringo
	grouped data frame	glitter	0	2,050	Feb-02-2021, 11:22 AM Last Post: glitter
	how to filter data frame dynamically with the columns	psahay	0	3,035	Aug-24-2020, 01:10 PM Last Post: psahay

Removing characters from columns in data frame

User Panel Messages

Announcements