Excel - Python - Printable Version +- Python Forum (https://python-forum.io) +-- Forum: Python Coding (https://python-forum.io/forum-7.html) +--- Forum: General Coding Help (https://python-forum.io/forum-8.html) +--- Thread: Excel - Python (/thread-40229.html) |
Excel - Python - bckdw3tci6 - Jun-24-2023 Greetings all, I am one of those trying to learn Python on my own. I have pretty extensive excel function and I am trying to convert a lot of unstructured files into structured ones for downstream analysis. Python has been working well to do this, but I am having trouble producing the results of calculations. Specifically around generating mean values with standard deviations associated with cells containing a target string of letters. Defining the Problem: I have a single compiled excel file with numerous sheets. One sheet contains multiple columns with data values and very long sample name designations. Within the sample name designations column, there is always a 5 character alphanumeric string within a >20 character sample name designation. There are 8 unique 5 character alphanumeric strings that I would like obtain the average values from experiments for each unique alphanumeric string. A mock example is below (you can see the unique 5 character string in each row[std01, std02, etc]): Name Data Value 1 this_sample_name_is_really_long_and_inconvenient_std01_date_name_information 300 this_sample_name_is_really_long_and_inconvenient_std02_date_name_information 600 this_sample_name_is_really_long_and_inconvenient_std03_date_name_information 1200 this_sample_name_is_really_long_and_inconvenient_std04_date_name_information 2400 this_sample_name_is_really_long_and_inconvenient_std05_date_name_information 4800 this_sample_name_is_really_long_and_inconvenient_std06_date_name_information 9600 this_sample_name_is_really_long_and_inconvenient_std07_date_name_information 19200 this_sample_name_is_really_long_and_inconvenient_std08_date_name_information 38400 this_sample_name_is_really_long_and_inconvenient_std01_date_name_information_A 302 this_sample_name_is_really_long_and_inconvenient_std02_date_name_information_B 602 this_sample_name_is_really_long_and_inconvenient_std03_date_name_information_C 1202 this_sample_name_is_really_long_and_inconvenient_std04_date_name_information_D 2402 this_sample_name_is_really_long_and_inconvenient_std05_date_name_information_E 4802 this_sample_name_is_really_long_and_inconvenient_std06_date_name_information_F 9602 this_sample_name_is_really_long_and_inconvenient_std07_date_name_information_G 19202 this_sample_name_is_really_long_and_inconvenient_std08_date_name_information_H 38402 this_sample_name_is_really_long_and_inconvenient_std01_date_name_information_A_I 315 this_sample_name_is_really_long_and_inconvenient_std02_date_name_information_B_J 615 this_sample_name_is_really_long_and_inconvenient_std03_date_name_information_C_K 1215 this_sample_name_is_really_long_and_inconvenient_std04_date_name_information_D_L 2415 this_sample_name_is_really_long_and_inconvenient_std05_date_name_information_E_M 4815 this_sample_name_is_really_long_and_inconvenient_std06_date_name_information_F_N 9615 this_sample_name_is_really_long_and_inconvenient_std07_date_name_information_G_O 19215 this_sample_name_is_really_long_and_inconvenient_std08_date_name_information_H_P 38415 Objective: I would like to obtain the average and standard deviation of Data Value 1 from all std01 through std08 experiments. I would also like to deposit the results in a specific worksheet within the same excel file along with specifying the cell locations to generate a results table. The results table would show only the unique 5 character name for each sample along with the mean and stdev values on a row. Approaches to This Point: I have tried several unsuccessful approaches so far. I have tried using target_letters (1-7 for each unique 5 character string; std01, stdo02, etc) along with filtered_df (also 1-7) and average_value (also 1-7) to obtain the data. This was followed by specifying where to report the results using result_cell (also 1-7) and sheet[result_cell].value = average_value for each std01-08. Using this approach is quite tedious and generates ~ 100 lines of code. Below are the relevant lines of it: # Specify the text to filter and calculate the average for target_letters = re.findall(r"[std01]", column_name) target_letters1 = re.findall(r"[std02]", column_name) target_letters2 = re.findall(r"[std03]", column_name) target_letters3 = re.findall(r"[std04]", column_name) target_letters4 = re.findall(r"[std05]", column_name) target_letters5 = re.findall(r"[std06]", column_name) target_letters6 = re.findall(r"[std07]", column_name) target_letters7 = re.findall(r"[std08]", column_name) # Filter the DataFrame based on the specific text filtered_df = df[df[excel_file].contains(target_letters)] filtered_df1 = df[df["Name"].str.contains(target_letters1)] filtered_df2 = df[df["Name"].str.contains(target_letters2)] filtered_df3 = df[df["Name"].str.contains(target_letters3)] filtered_df4 = df[df["Name"].str.contains(target_letters4)] filtered_df5 = df[df["Name"].str.contains(target_letters5)] filtered_df6 = df[df["Name"].str.contains(target_letters6)] filtered_df7 = df[df["Name"].str.contains(target_letters7)] # Calculate the average of the filtered values average_value = filtered_df["Area"].mean() average_value1 = filtered_df1["Area"].mean() average_value2 = filtered_df2["Area"].mean() average_value3 = filtered_df3["Area"].mean() average_value4 = filtered_df4["Area"].mean() average_value5 = filtered_df5["Area"].mean() average_value6 = filtered_df6["Area"].mean() average_value7 = filtered_df7["Area"].mean() # Open the workbook wb = load_workbook(excel_file) # Get the specific sheet in the workbook sheet = wb["Standard"] # Specify the cell where the count will be reported result_cell = "H2" # Adjust the cell reference as needed result_cell1 = "H3" # Adjust the cell reference as needed result_cell2 = "H4" # Adjust the cell reference as needed result_cell3 = "H5" # Adjust the cell reference as needed result_cell4 = "H6" # Adjust the cell reference as needed result_cell5 = "H7" # Adjust the cell reference as needed result_cell6 = "H8" # Adjust the cell reference as needed result_cell7 = "H9" # Adjust the cell reference as needed # Write the word count to the specified cell sheet[result_cell].value = average_value sheet[result_cell1].value = average_value1 sheet[result_cell2].value = average_value2 sheet[result_cell3].value = average_value3 sheet[result_cell4].value = average_value4 sheet[result_cell5].value = average_value5 sheet[result_cell6].value = average_value6 sheet[result_cell7].value = average_value7 I am using PyCharm 2023.1.3. Any help is appreciate. RE: Excel - Python - Pedroski55 - Jun-24-2023 Not a mathematician, don't know about standard deviation or pandas. This will get your data as a list of tuples. You can work on it then. import openpyxl import re # open the XL path2XL = '/home/pedro/myPython/openpyxl/xlsx_files/get_data1.xlsx' sourceFile = openpyxl.load_workbook(path2XL) # always good to get the sheet names, you can loop through them if you need to sourceFilesheets = sourceFile.sheetnames mysheet = 'Standard' # you often need these 2 to loop along or down an XL maxRow = sourceFile[mysheet].max_row maxCol = sourceFile[mysheet].max_column # if your column is H, that is column 8 # openpyxl does have a "get column number from letter" function, but H is column 8 # I assume your data starts in row 2 change that if needed # loop through all the rows and collect your data results = [] for rowNum in range(2, maxRow + 1): mystring = sourceFile[mysheet].cell(row=rowNum, column=8).value # split the string on space, the number you want is at the end, separated by a space I think! mydata = mystring.split() # re.search only returns 1 instance, I believe mystd = re.search(r'std\d\d', mydata[0]) mynumber = int(mydata[1]) mytup = ('Row ' + str(rowNum), mystd[0], mynumber) results.append(mytup) for data in results: print(data) print('Gotcha!') RE: Excel - Python - bckdw3tci6 - Jun-25-2023 I wasn't able to get that to work. If I take a step back and ask it another way - How can I filter the data below to select the numerical data in the Area column that is a part of text within the Name column that contains only the 'nte' letters? I have attempted various wild-card symbols (ie, *, ?). If I can select the data sets I want, then I can determine the mean,stdev functions. Name column Area this_is_a_sentence 500 the_color_purple 650 this_is_also_a_sentence 750 (Jun-24-2023, 10:55 PM)Pedroski55 Wrote: Not a mathematician, don't know about standard deviation or pandas. RE: Excel - Python - Pedroski55 - Jun-25-2023 Oh dear. Works great for me in Idle. I never use PyCharm. I tried it, but it has too many extras for me! Start Idle. Idle comes with Python. Just run every line 1 by 1, (except for the for-loops), you will find the problem. Probably a path problem. I often put all the code I am trying in a function called myApp(), then just run myApp() def myApp(): import openpyxl import re # open the XL path2XL = '/home/pedro/myPython/openpyxl/xlsx_files/get_data1.xlsx' sourceFile = openpyxl.load_workbook(path2XL) # always good to get the sheet names, you can loop through them if you need to sourceFilesheets = sourceFile.sheetnames mysheet = 'Standard' # you often need these 2 to loop along or down an XL maxRow = sourceFile[mysheet].max_row maxCol = sourceFile[mysheet].max_column # if your column is H, that is column 8 # openpyxl does have a "get column number from letter" function, but H is column 8 # I assume your data starts in row 2 change that if needed # loop through all the rows and collect your data # mystring = "this_sample_name_is_really_long_and_inconvenient_std01_date_name_information 300" results = [] for rowNum in range(2, maxRow + 1): mystring = sourceFile[mysheet].cell(row=rowNum, column=8).value # split the string on space, the number you want is at the end, separated by a space I think! mydata = mystring.split() # re.search only returns 1 instance, I believe mystd = re.search(r'std\d\d', mydata[0]) mynumber = int(mydata[1]) mytup = ('Row ' + str(rowNum), mystd[0], mynumber) results.append(mytup) for data in results: print(data) print('Gotcha!')That gives this:
RE: Excel - Python - Pedroski55 - Jun-25-2023 As to your latest question, that is simple mylist = ['this_is_a_sentence 500', 'the_color_purple 650', 'this_is_also_a_sentence 750'] for seq in mylist: if 'nte' in seq: print(seq) # now process this data somehow |