Python Forum
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Excel - Python
#1
Greetings all,
I am one of those trying to learn Python on my own. I have pretty extensive excel function and I am trying to convert a lot of unstructured files into structured ones for downstream analysis. Python has been working well to do this, but I am having trouble producing the results of calculations. Specifically around generating mean values with standard deviations associated with cells containing a target string of letters.

Defining the Problem: I have a single compiled excel file with numerous sheets. One sheet contains multiple columns with data values and very long sample name designations. Within the sample name designations column, there is always a 5 character alphanumeric string within a >20 character sample name designation. There are 8 unique 5 character alphanumeric strings that I would like obtain the average values from experiments for each unique alphanumeric string. A mock example is below (you can see the unique 5 character string in each row[std01, std02, etc]):
Name Data Value 1
this_sample_name_is_really_long_and_inconvenient_std01_date_name_information 300
this_sample_name_is_really_long_and_inconvenient_std02_date_name_information 600
this_sample_name_is_really_long_and_inconvenient_std03_date_name_information 1200
this_sample_name_is_really_long_and_inconvenient_std04_date_name_information 2400
this_sample_name_is_really_long_and_inconvenient_std05_date_name_information 4800
this_sample_name_is_really_long_and_inconvenient_std06_date_name_information 9600
this_sample_name_is_really_long_and_inconvenient_std07_date_name_information 19200
this_sample_name_is_really_long_and_inconvenient_std08_date_name_information 38400
this_sample_name_is_really_long_and_inconvenient_std01_date_name_information_A 302
this_sample_name_is_really_long_and_inconvenient_std02_date_name_information_B 602
this_sample_name_is_really_long_and_inconvenient_std03_date_name_information_C 1202
this_sample_name_is_really_long_and_inconvenient_std04_date_name_information_D 2402
this_sample_name_is_really_long_and_inconvenient_std05_date_name_information_E 4802
this_sample_name_is_really_long_and_inconvenient_std06_date_name_information_F 9602
this_sample_name_is_really_long_and_inconvenient_std07_date_name_information_G 19202
this_sample_name_is_really_long_and_inconvenient_std08_date_name_information_H 38402
this_sample_name_is_really_long_and_inconvenient_std01_date_name_information_A_I 315
this_sample_name_is_really_long_and_inconvenient_std02_date_name_information_B_J 615
this_sample_name_is_really_long_and_inconvenient_std03_date_name_information_C_K 1215
this_sample_name_is_really_long_and_inconvenient_std04_date_name_information_D_L 2415
this_sample_name_is_really_long_and_inconvenient_std05_date_name_information_E_M 4815
this_sample_name_is_really_long_and_inconvenient_std06_date_name_information_F_N 9615
this_sample_name_is_really_long_and_inconvenient_std07_date_name_information_G_O 19215
this_sample_name_is_really_long_and_inconvenient_std08_date_name_information_H_P 38415

Objective: I would like to obtain the average and standard deviation of Data Value 1 from all std01 through std08 experiments. I would also like to deposit the results in a specific worksheet within the same excel file along with specifying the cell locations to generate a results table. The results table would show only the unique 5 character name for each sample along with the mean and stdev values on a row.

Approaches to This Point: I have tried several unsuccessful approaches so far. I have tried using target_letters (1-7 for each unique 5 character string; std01, stdo02, etc) along with filtered_df (also 1-7) and average_value (also 1-7) to obtain the data. This was followed by specifying where to report the results using result_cell (also 1-7) and sheet[result_cell].value = average_value for each std01-08. Using this approach is quite tedious and generates ~ 100 lines of code. Below are the relevant lines of it:

# Specify the text to filter and calculate the average for
target_letters = re.findall(r"[std01]", column_name)
target_letters1 = re.findall(r"[std02]", column_name)
target_letters2 = re.findall(r"[std03]", column_name)
target_letters3 = re.findall(r"[std04]", column_name)
target_letters4 = re.findall(r"[std05]", column_name)
target_letters5 = re.findall(r"[std06]", column_name)
target_letters6 = re.findall(r"[std07]", column_name)
target_letters7 = re.findall(r"[std08]", column_name)

# Filter the DataFrame based on the specific text
filtered_df = df[df[excel_file].contains(target_letters)]
filtered_df1 = df[df["Name"].str.contains(target_letters1)]
filtered_df2 = df[df["Name"].str.contains(target_letters2)]
filtered_df3 = df[df["Name"].str.contains(target_letters3)]
filtered_df4 = df[df["Name"].str.contains(target_letters4)]
filtered_df5 = df[df["Name"].str.contains(target_letters5)]
filtered_df6 = df[df["Name"].str.contains(target_letters6)]
filtered_df7 = df[df["Name"].str.contains(target_letters7)]

# Calculate the average of the filtered values
average_value = filtered_df["Area"].mean()
average_value1 = filtered_df1["Area"].mean()
average_value2 = filtered_df2["Area"].mean()
average_value3 = filtered_df3["Area"].mean()
average_value4 = filtered_df4["Area"].mean()
average_value5 = filtered_df5["Area"].mean()
average_value6 = filtered_df6["Area"].mean()
average_value7 = filtered_df7["Area"].mean()

# Open the workbook
wb = load_workbook(excel_file)

# Get the specific sheet in the workbook
sheet = wb["Standard"]

# Specify the cell where the count will be reported
result_cell = "H2" # Adjust the cell reference as needed
result_cell1 = "H3" # Adjust the cell reference as needed
result_cell2 = "H4" # Adjust the cell reference as needed
result_cell3 = "H5" # Adjust the cell reference as needed
result_cell4 = "H6" # Adjust the cell reference as needed
result_cell5 = "H7" # Adjust the cell reference as needed
result_cell6 = "H8" # Adjust the cell reference as needed
result_cell7 = "H9" # Adjust the cell reference as needed

# Write the word count to the specified cell
sheet[result_cell].value = average_value
sheet[result_cell1].value = average_value1
sheet[result_cell2].value = average_value2
sheet[result_cell3].value = average_value3
sheet[result_cell4].value = average_value4
sheet[result_cell5].value = average_value5
sheet[result_cell6].value = average_value6
sheet[result_cell7].value = average_value7

I am using PyCharm 2023.1.3. Any help is appreciate.
Reply
#2
Not a mathematician, don't know about standard deviation or pandas.

This will get your data as a list of tuples. You can work on it then.

import openpyxl
import re

# open the XL
path2XL = '/home/pedro/myPython/openpyxl/xlsx_files/get_data1.xlsx'
sourceFile = openpyxl.load_workbook(path2XL)
# always good to get the sheet names, you can loop through them if you need to
sourceFilesheets = sourceFile.sheetnames
mysheet = 'Standard'
# you often need these 2 to loop along or down an XL
maxRow = sourceFile[mysheet].max_row
maxCol = sourceFile[mysheet].max_column
# if your column is H, that is column 8
# openpyxl does have a "get column number from letter" function, but H is column 8
# I assume your data starts in row 2 change that if needed
# loop through all the rows and collect your data

results = []
for rowNum in range(2, maxRow + 1):
    mystring = sourceFile[mysheet].cell(row=rowNum, column=8).value
    # split the string on space, the number you want is at the end, separated by a space I think!
    mydata = mystring.split()
    # re.search only returns 1 instance, I believe
    mystd = re.search(r'std\d\d', mydata[0])
    mynumber = int(mydata[1])
    mytup = ('Row ' + str(rowNum), mystd[0], mynumber)
    results.append(mytup)

for data in results:
    print(data)

print('Gotcha!')
Reply
#3
I wasn't able to get that to work.

If I take a step back and ask it another way - How can I filter the data below to select the numerical data in the Area column that is a part of text within the Name column that contains only the 'nte' letters? I have attempted various wild-card symbols (ie, *, ?). If I can select the data sets I want, then I can determine the mean,stdev functions.

Name column Area
this_is_a_sentence 500
the_color_purple 650
this_is_also_a_sentence 750


(Jun-24-2023, 10:55 PM)Pedroski55 Wrote: Not a mathematician, don't know about standard deviation or pandas.

This will get your data as a list of tuples. You can work on it then.

import openpyxl
import re

# open the XL
path2XL = '/home/pedro/myPython/openpyxl/xlsx_files/get_data1.xlsx'
sourceFile = openpyxl.load_workbook(path2XL)
# always good to get the sheet names, you can loop through them if you need to
sourceFilesheets = sourceFile.sheetnames
mysheet = 'Standard'
# you often need these 2 to loop along or down an XL
maxRow = sourceFile[mysheet].max_row
maxCol = sourceFile[mysheet].max_column
# if your column is H, that is column 8
# openpyxl does have a "get column number from letter" function, but H is column 8
# I assume your data starts in row 2 change that if needed
# loop through all the rows and collect your data

results = []
for rowNum in range(2, maxRow + 1):
    mystring = sourceFile[mysheet].cell(row=rowNum, column=8).value
    # split the string on space, the number you want is at the end, separated by a space I think!
    mydata = mystring.split()
    # re.search only returns 1 instance, I believe
    mystd = re.search(r'std\d\d', mydata[0])
    mynumber = int(mydata[1])
    mytup = ('Row ' + str(rowNum), mystd[0], mynumber)
    results.append(mytup)

for data in results:
    print(data)

print('Gotcha!')
Reply
#4
Oh dear. Works great for me in Idle. I never use PyCharm. I tried it, but it has too many extras for me!

Start Idle. Idle comes with Python.

Just run every line 1 by 1, (except for the for-loops), you will find the problem. Probably a path problem.

I often put all the code I am trying in a function called myApp(), then just run myApp()

def myApp():
    import openpyxl
    import re

    # open the XL
    path2XL = '/home/pedro/myPython/openpyxl/xlsx_files/get_data1.xlsx'
    sourceFile = openpyxl.load_workbook(path2XL)
    # always good to get the sheet names, you can loop through them if you need to
    sourceFilesheets = sourceFile.sheetnames
    mysheet = 'Standard'
    # you often need these 2 to loop along or down an XL
    maxRow = sourceFile[mysheet].max_row
    maxCol = sourceFile[mysheet].max_column
    # if your column is H, that is column 8
    # openpyxl does have a "get column number from letter" function, but H is column 8
    # I assume your data starts in row 2 change that if needed
    # loop through all the rows and collect your data
    # mystring = "this_sample_name_is_really_long_and_inconvenient_std01_date_name_information 300"
    results = []
    for rowNum in range(2, maxRow + 1):
        mystring = sourceFile[mysheet].cell(row=rowNum, column=8).value
        # split the string on space, the number you want is at the end, separated by a space I think!
        mydata = mystring.split()
        # re.search only returns 1 instance, I believe
        mystd = re.search(r'std\d\d', mydata[0])
        mynumber = int(mydata[1])
        mytup = ('Row ' + str(rowNum), mystd[0], mynumber)
        results.append(mytup)

    for data in results:
        print(data)

    print('Gotcha!')
That gives this:

Output:
myApp() ('Row 2', 'std01', 300) ('Row 3', 'std02', 600) ('Row 4', 'std03', 1200) ('Row 5', 'std04', 2400) ('Row 6', 'std05', 4800) ('Row 7', 'std06', 9600) ('Row 8', 'std07', 19200) ('Row 9', 'std08', 38400) ('Row 10', 'std01', 302) ('Row 11', 'std02', 602) ('Row 12', 'std03', 1202) ('Row 13', 'std04', 2402) ('Row 14', 'std05', 4802) ('Row 15', 'std06', 9602) ('Row 16', 'std07', 19202) ('Row 17', 'std08', 38402) ('Row 18', 'std01', 315) ('Row 19', 'std02', 615) ('Row 20', 'std03', 1215) ('Row 21', 'std04', 2415) ('Row 22', 'std05', 4815) ('Row 23', 'std06', 9615) ('Row 24', 'std07', 19215) ('Row 25', 'std08', 38415) Gotcha!
Reply
#5
As to your latest question, that is simple

mylist = ['this_is_a_sentence 500', 'the_color_purple 650', 'this_is_also_a_sentence 750']

for seq in mylist:
    if 'nte' in seq:
        print(seq)
        # now process this data somehow
Reply


Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020