Read TXT file in Pandas and save to Parquet

zinho · Sep-14-2024, 11:20 PM

Hi.

I would like import a TXT file, change types(object to [date, str, float etc]), save data to parquet.

When a use pandas for this I get error about types.

This file is dirt, like skip extra header data and footer data.

The columns is separeted by | (pipe), but content extra space, look example below in column-1.

Col-1 | Col-2 | Col-3 | Col-4 | Col-4
B |ES|0000192806|01206820002060|BLABLABLA|0000181882|

Link for the file, the file has 900k
https://drive.google.com/file/d/175JjBY1...drive_link

Thanks

Pedroski55 · (This post was last modified: Sep-15-2024, 05:05 AM by Pedroski55.)

A useful link for you.

Just tell Pandas that the separator is | (default separator is: , )

import pandas as pd

path2csv = 'csv/csv_files/pipe_separated.csv'
df = pd.read_csv(path2csv, sep="|")
df

Output:  Col-1   Col-2    Col-3          Col-4      Col-5    Col-6   Col-7 
0     B       ES   192806  1206820002060  BLABLABLA   181882     XYZ

Oh, I just realised, you want to tidy up the column names:

# Rename columns
for name in df.columns:
    df.rename(columns={name: name.strip()}, inplace=True)

print(df["Col-1"])

zinho · Sep-15-2024, 06:14 PM

I solved.

I need to convert all columns that has object to spscific types.

mport pandas as pd
import time
 
# ==============Time execution =================
# record start time
start = time.time()


# Define columns
colunas = [
    "Tipo_Lanc","UF","Fis/Jur","CNPJ","RAZAO_SOCIAL","NF","CHAVE_NFE",
    "DT_Emissão","DT_Fiscal","Data_Lan_amento","PRODUTO","DESCRICAO",
    "NUM_ITEM","Unid","LISTA","NCM","Monit/Liber","EAN","MVA_Original",
    "MVA","CFOP","CST","QUANTIDADE","PF_UNIT","PF_TOTAL","VLR_LIQ_UNIT",
    "VLR_LIQ_ITEM","VL_UNIT_NF","TOTAL_NF","VL_UNIT_LIQ_NF","TOTAL_LIQ_NF",
    "DESC._TOT","DESCONTO","REPASSE","VC","BC","ICMS","BC_N_Escriturado",
    "ICMS_N_Escriturado","ALIQ_ICMS","APROPRIA","BC_ICMS_ST","ICMS_ST",
    "ALIQ_INTERNA","DEB_ICMS","IPI","CAT_ANVISA","TIPO_PRODUTO",
    "TIPO_DESCONTO","%_DESCONTO","PMC","PMC_FCIA_POP","%_REDUTOR_ANVISA",
    "CROSS","INDICADOR_ICMSS","BC_ICMS_ST_REC","ICMS_ST_REC","MCANCER","CBASICA","-no-"]


# Define columns to convert to numeric
numeric_columns = ["MVA_Original","MVA","CFOP","CST","QUANTIDADE","PF_UNIT",
                   "PF_TOTAL","VLR_LIQ_UNIT","VLR_LIQ_ITEM","VL_UNIT_NF",
                   "TOTAL_NF","VL_UNIT_LIQ_NF","TOTAL_LIQ_NF","DESC._TOT",
                   "DESCONTO","REPASSE","VC","BC","ICMS","BC_N_Escriturado",
                   "ICMS_N_Escriturado","ALIQ_ICMS","APROPRIA","BC_ICMS_ST",
                   "ICMS_ST","ALIQ_INTERNA	DEB_ICMS","IPI","%_DESCONTO","PMC",
                   "PMC_FCIA_POP","%_REDUTOR_ANVISA","INDICADOR_ICMSS","BC_ICMS_ST_REC","ICMS_ST_REC"]

# Define columns to convert to string
string_columns = ["Tipo_Lanc", "UF", "Fis/Jur", "CNPJ", 
                  "RAZAO_SOCIAL", "NF", "CHAVE_NFE", "PRODUTO", 
                  "NUM_ITEM", "Unid", "LISTA", "NCM", "Monit/Liber", 
                  "EAN", "CAT_ANVISA",	"TIPO_PRODUTO",	"TIPO_DESCONTO",
                  "CROSS", "MCANCER","CBASICA"]


# Define columns to convert to date
date_columns = ["DT_Fiscal", "DT_Emissão","Data_Lan_amento"]

def convert_brazilian_numeric(value):
    """Convert a Brazilian formatted numeric string to a float."""
    if pd.isna(value):
        return pd.NA
    value = str(value).replace('.', '').replace(',', '.')
    try:
        return float(value)
    except ValueError:
        return pd.NA

def read_csv(file_name, chunksize):
    for chunk in pd.read_csv(file_name, encoding='ISO-8859-1', sep='|', on_bad_lines='skip', skiprows=9, chunksize=chunksize, header=None):
        chunk.columns = colunas
        yield chunk

chunk_size = 5000
file_name = "Fat.TXT" #"10Linhas_ms_file.TXT"
master_df = pd.concat(read_csv(file_name, chunksize=chunk_size), ignore_index=True)
master_df.pop("-no-")
master_df = master_df.applymap(lambda x: x.strip() if isinstance(x, str) else x)


# Convert specified columns to string
for col in string_columns:
    if col in master_df.columns:
        master_df[col] = master_df[col].astype(str)
 	
# Convert specified columns to numeric
for col in numeric_columns:
    if col in master_df.columns:
        master_df[col] = master_df[col].apply(convert_brazilian_numeric)
        master_df[col] = master_df[col].fillna(0).astype(float)

# Convert specified columns to date using pd.to_datetime directly
for col in date_columns:
    if col in master_df.columns:
        master_df[col] = pd.to_datetime(master_df[col], format="%d/%m/%Y", errors='coerce')


# Verify the DataFrame
print(master_df.head(10))  # Print the first few rows to check the data
print(master_df.dtypes)   # Check the types of columns

# Save to Excel or Partquet
#master_df.to_excel("rl_fiscal.xlsx", sheet_name="teste_fiscal", index=False)
master_df.to_parquet("fat.parquet")


# ==============Time execution =================
# record end time
end = time.time()
#print("_Total of execution...: ", (end-start) * 10**3, "ms")
print("Fim...")

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	How to read a file as binary or hex "string" so that I can do regex search?	tatahuft	3	1,038	Dec-19-2024, 11:57 AM Last Post: snippsat
	Parquet file generation	woliveiras	1	648	Dec-07-2024, 02:52 AM Last Post: deanhystad
	Pycharm can't read file	Genericgamemaker	5	1,567	Jul-24-2024, 08:10 PM Last Post: deanhystad
	Python is unable to read file	Genericgamemaker	13	3,656	Jul-19-2024, 06:42 PM Last Post: snippsat
	Connecting to Remote Server to read contents of a file	ChaitanyaSharma	1	3,274	May-03-2024, 07:23 AM Last Post: Pedroski55
	Open/save file on Android	frohr	0	1,107	Jan-24-2024, 06:28 PM Last Post: frohr
	Recommended way to read/create PDF file?	Winfried	3	4,685	Nov-26-2023, 07:51 AM Last Post: Pedroski55
	python Read each xlsx file and write it into csv with pipe delimiter	mg24	4	3,810	Nov-09-2023, 10:56 AM Last Post: mg24
	how to save to multiple locations during save	cubangt	1	1,283	Oct-23-2023, 10:16 PM Last Post: deanhystad
	save values permanently in python (perhaps not in a text file)?	flash77	8	2,657	Jul-07-2023, 05:44 PM Last Post: flash77

Read TXT file in Pandas and save to Parquet

User Panel Messages

Announcements