Python Forum
How to update values in a pyarrow table?
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
How to update values in a pyarrow table?
#1
I have a python script that reads in a parquet file using pyarrow. I'm trying to loop through the table to update values in it. If I try this:

for col_name in table2.column_names:
    if col_name in my_columns:
        print('updating values in column '  + col_name)
        
        col_data = pa.Table.column(table2, col_name)
        
        row_ct = 1
        for i in col_data:
            pa.Table.column(table2, col_name)[row_ct] = change_str(pa.StringScalar.as_py(i))
            row_ct += 1
I get this error:

Error:
TypeError: 'pyarrow.lib.ChunkedArray' object does not support item assignment
How can I update these values?

I tried using pandas, but it couldn't handle null values in the original table, and it also incorrectly translated the datatypes of the columns in the original table. Does pyarrow have a native way to edit the data?

Python 3.7.3
Debian 10
Reply
#2
I was able to get it working using these references:

http://arrow.apache.org/docs/python/gene...Table.html

http://arrow.apache.org/docs/python/gene...Field.html

https://github.com/apache/arrow/blob/mas...t_table.py

Basically it loops through the original table and creates new columns (pa.array) with the adjusted text that it appends to a new table. It's probably not the best way to do it, but it worked. Most importantly, it let me preserve the nulls and specify the data type of each column.

import sys, getopt
import random
import re
import math

import pyarrow.parquet as pq
import pyarrow.csv as pcsv
import numpy as np
import pandas as pd
import pyarrow as pa
import os.path

<a lot of other code here>

changed_ct = 0
all_cols_ct = 0
table3 = pa.Table.from_arrays([pa.array(range(0,863))], names=('0')) # CREATE TEMP COLUMN!!
#print(table3)
#exit()
changed_column_list = []
for col_name in table2.column_names:
    print('processing column: ' + col_name)
    new_list = []
    col_data = pa.Table.column(table2, col_name)
    col_data_type = table2.schema.field(col_name).type
    printed_changed_flag = False
    for i in col_data:
        # GET STRING REPRESENTATION OF THE COLUMN DATA
        if(col_data_type == 'string'):
            col_str = pa.StringScalar.as_py(i)
        elif(col_data_type == 'int32'):
            col_str = pa.Int32Scalar.as_py(i)
        elif(col_data_type == 'int64'):
            col_str = pa.Int64Scalar.as_py(i)
            
            
        if col_name in change_columns:
            if printed_changed_flag == False:
                print('changing values in column '  + col_name)
                changed_column_list.append(col_name)
                changed_ct += 1
                printed_changed_flag = True

            new_list.append(change_str(col_str))
        
        else:
            new_list.append(col_str)
        
    #set data type for the column
    if(col_data_type == 'string'):
        col_data_type = pa.string()
    elif(col_data_type == 'int32'):
        col_data_type = pa.int32()
    elif(col_data_type == 'int64'):
        col_data_type = pa.int64()
        
    arr = pa.array(new_list, type=col_data_type)
        
    new_field = pa.field(col_name, col_data_type)
    
    table3 = pa.Table.append_column(table3, new_field, arr)
        
    all_cols_ct += 1
    
#for i in table3:
#   print(i)

table3 = pa.Table.remove_column(table3, 0) # REMOVE TEMP COLUMN!!
#print(table2)
#print('-------------------')
#print(table3)
#exit()

print('changed ' + str(changed_ct) + ' columns:')
print(*changed_column_list, sep='\n')

# WRITE NEW PARQUET FILE
pa.parquet.write_table(table3, out_file)
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
Question Using SQLAlchemy, prevent SQLite3 table update by multiple program instances Calab 3 701 Aug-09-2023, 05:51 PM
Last Post: Calab
  pyarrow error when importing pandas sravva 1 884 Jun-06-2023, 05:09 PM
Last Post: snippsat
Photo How to select NULL and blank values from MySQL table into csv python300 9 2,328 Dec-27-2022, 09:43 PM
Last Post: deanhystad
  store all variable values into list and insert to sql_summary table mg24 3 1,096 Sep-28-2022, 09:13 AM
Last Post: Larz60+
  Sum the values in a pandas pivot table specific columns klllmmm 1 4,543 Nov-19-2021, 04:43 PM
Last Post: klllmmm
  UPDATE SQLITE TABLE - Copy a fields content to another field. andrewarles 14 4,245 May-08-2021, 04:58 PM
Last Post: ibreeden
  OSError: Unable to load libjvm when connecting to hdfs with pyarrow 3.0.0 aupres 0 3,098 Mar-22-2021, 10:25 AM
Last Post: aupres
Question Python + Google Sheet | Best way to update specific cells in a single Update()? Vokofe 1 2,626 Dec-16-2020, 05:26 AM
Last Post: Vokofe
  pyarrow throws oserror winerror 193 1 is not a valid win32 application aupres 2 3,722 Oct-21-2020, 01:04 AM
Last Post: aupres
  Unable to Update SQLite Table sambanerjee 5 2,865 Sep-30-2020, 12:21 PM
Last Post: Larz60+

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020