![]() |
How to update values in a pyarrow table? - Printable Version +- Python Forum (https://python-forum.io) +-- Forum: Python Coding (https://python-forum.io/forum-7.html) +--- Forum: General Coding Help (https://python-forum.io/forum-8.html) +--- Thread: How to update values in a pyarrow table? (/thread-32127.html) |
How to update values in a pyarrow table? - xraphael75 - Jan-22-2021 I have a python script that reads in a parquet file using pyarrow. I'm trying to loop through the table to update values in it. If I try this: for col_name in table2.column_names: if col_name in my_columns: print('updating values in column ' + col_name) col_data = pa.Table.column(table2, col_name) row_ct = 1 for i in col_data: pa.Table.column(table2, col_name)[row_ct] = change_str(pa.StringScalar.as_py(i)) row_ct += 1I get this error: How can I update these values? I tried using pandas, but it couldn't handle null values in the original table, and it also incorrectly translated the datatypes of the columns in the original table. Does pyarrow have a native way to edit the data? Python 3.7.3 Debian 10 RE: How to update values in a pyarrow table? - xraphael75 - Jan-25-2021 I was able to get it working using these references: http://arrow.apache.org/docs/python/generated/pyarrow.Table.html http://arrow.apache.org/docs/python/generated/pyarrow.Field.html https://github.com/apache/arrow/blob/master/python/pyarrow/tests/test_table.py Basically it loops through the original table and creates new columns (pa.array) with the adjusted text that it appends to a new table. It's probably not the best way to do it, but it worked. Most importantly, it let me preserve the nulls and specify the data type of each column. import sys, getopt import random import re import math import pyarrow.parquet as pq import pyarrow.csv as pcsv import numpy as np import pandas as pd import pyarrow as pa import os.path <a lot of other code here> changed_ct = 0 all_cols_ct = 0 table3 = pa.Table.from_arrays([pa.array(range(0,863))], names=('0')) # CREATE TEMP COLUMN!! #print(table3) #exit() changed_column_list = [] for col_name in table2.column_names: print('processing column: ' + col_name) new_list = [] col_data = pa.Table.column(table2, col_name) col_data_type = table2.schema.field(col_name).type printed_changed_flag = False for i in col_data: # GET STRING REPRESENTATION OF THE COLUMN DATA if(col_data_type == 'string'): col_str = pa.StringScalar.as_py(i) elif(col_data_type == 'int32'): col_str = pa.Int32Scalar.as_py(i) elif(col_data_type == 'int64'): col_str = pa.Int64Scalar.as_py(i) if col_name in change_columns: if printed_changed_flag == False: print('changing values in column ' + col_name) changed_column_list.append(col_name) changed_ct += 1 printed_changed_flag = True new_list.append(change_str(col_str)) else: new_list.append(col_str) #set data type for the column if(col_data_type == 'string'): col_data_type = pa.string() elif(col_data_type == 'int32'): col_data_type = pa.int32() elif(col_data_type == 'int64'): col_data_type = pa.int64() arr = pa.array(new_list, type=col_data_type) new_field = pa.field(col_name, col_data_type) table3 = pa.Table.append_column(table3, new_field, arr) all_cols_ct += 1 #for i in table3: # print(i) table3 = pa.Table.remove_column(table3, 0) # REMOVE TEMP COLUMN!! #print(table2) #print('-------------------') #print(table3) #exit() print('changed ' + str(changed_ct) + ' columns:') print(*changed_column_list, sep='\n') # WRITE NEW PARQUET FILE pa.parquet.write_table(table3, out_file) |