Python Forum

Full Version: doing data treatment on a file import-parsing a variable
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Pages: 1 2
Hi Dean, I inserted the code into my existing code for reading the file and deleting empty lines. I altered it some because the file is an excel file, and am getting the following error:

TypeError: read_excel() got an unexpected keyword argument 'delimiter'

Here is the whole block of code, some names are changed for confidentiality. I also changed the column names after checking my data.

# Read the Directory file and delete the empty lines
project = "V2_L"
folder = "File exports"
filename = "DIRECTORY.xlsx"
filePath = project + "/" + folder + "/" + filename
print(filePath)

s3 = boto3.client('s3')
obj = s3.get_object(Bucket=bucket, Key=ennovfPath)
data = obj['Body'].read()
directory = pd.read_excel(io.BytesIO(data), delimiter="-",names=["SAMPLEID", "VIS_ISOLATE_NUMBER"])

directory.sort_values(by=["SAMPLEID", "VIS_ISOLATE_NUMBER"], inplace=True)

directory.columns = map(lambda x: str(x).upper(), directory.sort_values.columns)

directory=directory.sort_values.columns[directory['SAMPLEID'].isna()!=True]

directory['SAMPLEID']=directory['SAMPLEID'].astype(str)

#Create a variable SITEID based on the SUBJID (run)

directory['SITEID'] = directory['SUBJID'].str.split('_').str[0]
s3 = boto3.client('s3')
obj = s3.get_object(Bucket=bucket, Key=ennovfPath)
data = obj['Body'].read()
directory = pd.read_excel(io.BytesIO(data), delimiter="-",names=["SAMPLEID", "VIS_ISOLATE_NUMBER"])
What is "boto3"? More importantly, what is data? Is it actually an excel file?

Assuming data is an excel file (or the binary data from an excel file), you need to use the different approach.

Read the excel file/data.
You do not specify a separator for excel files, but you might have to provide a sheet name.
directory = pd.read_excel(io.BytesIO(data), sheet_name="sheet1")
If you don't provide a sheet name, the first sheet is used. You can provide multiple sheet names in a list or tuple ("sheet1", "sheet3").

One of the columns in the spreadsheet contains the sampleID's ("5269-1" for example). You need to find the name of that column. Let's assume it is called "Samples".

Create "SAMPLE ID" and "ISOLATE NUMBER" columns by splitting the Samples columns.
directory[["SAMPLE ID", "ISOLATE NUMBER"]] = directory["Sample"].str.split('-', expand=True)
The new columns are both strings. You might want to convert the ISOLATE NUMBER column to ints.
directory = directory.astype({"ISOLATE NUMBER": int})
And you might want to sort the dataframe to group all the SAMPLE ID's together.
df = df.sort_values((["SAMPLE ID", "ISOLATE NUMBER"])
Let me clarify- this code has worked perfectly before adding the code for parsing. Files have been imported already earlier in the code and specified, that all works just fine. I do not need to repeat this step.
boto3.client is the official AWS SDK for pYTHON USED TO CREAT CONFIGURE AND MANAGER AWS SERVICES.
Is there a way to do this so that two new columns are not made, just the one column, VIS_ISOLATE_NUMBERS?
Sure. You could use apply() which gives you more freedom in what you can do.
dictionary["Number"] = dictionary["ID"].apply(lambda x: x.split("-")[1])
I think it is better to stay with the current solution and either ignore the extra column or drop it.
dictionary.drop(axis=1, labels=["SAMPLE ID"])
or
new_dictionary = dictionary[["names", "of", "columns", "i", "want", "to", "keep"]]
data..........................
Pages: 1 2