Python Forum
Removing characters from columns in data frame
Thread Rating:
  • 3 Vote(s) - 3 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Removing characters from columns in data frame
#11
Please post result of repr(df_filter.geo[1234]) (1234 is just example, use index value of element with nonempty geo information).
Reply
#12
zivoni, please find the information you requested below.

repr(df_filter.geo[1048280])
'"{u\'type\': u\'Point\', u\'coordinates\': [43.18040111, -79.02213645]}"'
I have tried several ways to convert the string geo column in data frame to an integer (and float, too). But they all failed. I can provide those codes and errors, if necessary.
Reply
#13
For me it works quite well.
Output:
In [4]: a = '"{u\'type\': u\'Point\', u\'coordinates\': [43.18040111, -79.02213645]}"' In [5]: df = pd.DataFrame({"geo":[eval(a)]}) In [6]: df Out[6]:                                                  geo 0  {u'type': u'Point', u'coordinates': [43.180401... In [7]: repr(df.geo[0]) Out[7]: '"{u\'type\': u\'Point\', u\'coordinates\': [43.18040111, -79.02213645]}"' In [8]: pattern = r".*\[(-*\d+\.\d+), (-*\d+\.\d+)\].*" In [9]: df['long'] = np.float(df.geo.str.replace(pattern, r"\1")) In [10]: df Out[10]:                                                  geo       long 0  {u'type': u'Point', u'coordinates': [43.180401...  43.180401
It looks that your string is actually representation of geo dictionary, it would be easier if you extracted long/lat from dictionary before converting it to a string.
Reply
#14
zivoni, thank you for feedback. Indeed, your example of the code works fine. However, it still gives an error for my data. I'll keep working on it.
Reply
#15
Long story short, I managed to deal with the problem using Stata [yeah, sorry I am more of a statistics rather than programmer guy :)] -- I stripped off the leading and the trailing stuff and extracted lon/lat as integers to new columns.

While the task is solved for now, I will have more similar tasks in the future. So, I am still curious what is wrong with the extracted -geo- coordinates. Initially (during test runs), I used the following code, which had no such issues:

# DF Parser

import os
import json
import pandas as pd
import numpy as np
from collections import defaultdict

elements_keys = ['created_at', 'text', 'lang', 'geo']
elements = defaultdict(list)

for dirs, subdirs, files in os.walk('/DIR'):
   for file in files:
       if file.endswith('.json'):
           with open(file, 'r') as input_file:
               for line in input_file:
                   try:
                       tweet = json.loads(line)
                       items = [(key, tweet[key]) for key in elements_keys] # should raise error if any key is missing
                       for key, value in items:
                           elements[key].append(value)
                   except:
                       continue

df=pd.DataFrame({'created_at': pd.Index(elements['created_at']),
                'text': pd.Index(elements['text']),
                'lang': pd.Index(elements['lang']),
                'geo': pd.Index(elements['geo'])})

df #then clean it a little bit, then save to CSV
But then for the "big" task, I adapted the following code suggested by zivoni (code's performance was substantially better).

# CSV Parser

import csv, json, os

elements_keys = ['created_at', 'text', 'lang', 'geo']
with open('file.csv', 'w') as csvfile:
   writer = csv.writer(csvfile)
   writer.writerow(elements_keys)   # header
    
   for dirs, subdirs, files in os.walk('/DIR'):
       for file in files:
           if file.endswith('.json'):
               with open(file, 'r') as input_file:
                   for line in input_file:
                       try:
                           tweet = json.loads(line)
                           row = [tweet[key] for key in elements_keys]
                           writer.writerow(row)     # writing tweet into file
                       except:
                           continue
Although the code worked perfectly fine, the extracted -geo- coordinates gave me the problem that we were recently discussing. Does anybody see anything specific in the above codes that could possibly lead to that problem (i.e., any possible discrepancy in the extracted coordinates)?
Reply
#16
When I think about it there could be problem with np.float applied on pandas serie, maybe following code would work:
df['long'] = df.geo.str.replace(pattern, r"\1").astype('float')
Regardless of it, its much better to extract coordinates directly from json than convert it to string and then extract it from string. I have added lines 15-18 and modified lines 3, 6, 19 (and 11), so it should create .csv with lat and long fields instead of geo.

import csv, json, os

elements_keys = ['created_at', 'text', 'lang']
with open('file.csv', 'w') as csvfile:
    writer = csv.writer(csvfile)
    writer.writerow(elements_keys + ['lat', 'long'])  # header

    for dirs, subdirs, files in os.walk('/DIR'):
        for file in files:
            if file.endswith('.json'):
                with open(os.path.join(dirs,file), 'r') as input_file:  # added os.path.join
                    for line in input_file:
                        try:
                            tweet = json.loads(line)
                            try:
                                coords = tweet['geo']['coordinates']  # trying to get lat, long
                            except Exception as e:
                                coords = [None, None]
                            row = [tweet[key] for key in elements_keys] + coords
                            writer.writerow(row)  # writing tweet into file
                        except:
                            continue
This code is getting more ugly; as  you are processing tons of small files, it would be probably better to split it to a few small functions - say one to traverse, second one to convert one file to .csv. That would create tons of small csv, these can be concatenated after script finishes (in shell if you create them without header).
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Grouping in pandas/multi-index data frame Aleqsie 3 607 Jan-06-2024, 03:55 PM
Last Post: deanhystad
  Filtering Data Frame, with another value NewBiee 9 1,331 Aug-21-2023, 10:53 AM
Last Post: NewBiee
  Deleting characters between certain characters stahorse 7 1,075 Jul-03-2023, 12:59 AM
Last Post: Pedroski55
  Exporting data frame to excel dyerlee91 0 1,604 Oct-05-2021, 11:34 AM
Last Post: dyerlee91
  Pandas Data frame column condition check based on length of the value aditi06 1 2,655 Jul-28-2021, 11:08 AM
Last Post: jefsummers
  Adding a new column to a Panda Data Frame rsherry8 2 2,083 Jun-06-2021, 06:49 PM
Last Post: jefsummers
  import columns of data from local csv file CatherineKan 2 3,301 May-10-2021, 05:10 AM
Last Post: ricslato
  pandas.to_datetime: Combine data from 2 columns ju21878436312 1 2,420 Feb-20-2021, 08:25 PM
Last Post: perfringo
  grouped data frame glitter 0 1,577 Feb-02-2021, 11:22 AM
Last Post: glitter
  how to filter data frame dynamically with the columns psahay 0 2,378 Aug-24-2020, 01:10 PM
Last Post: psahay

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020