Python Forum
doing data treatment on a file import-parsing a variable
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
doing data treatment on a file import-parsing a variable
#1
Video 
Hello, I need some code to parse a variable name, so that merging it's data with another source later will occur correctly. I'm a beginner and am not quite sure where to start. I'm thinking some sort of if/else condition. I have it written out how it will probably read but am not sure how to actually execute it. Please advise as soon as possible if anyone can. Thank you in advance for your help!

Data treatment on File Export:

identify if there is a "-" symbol
    if not then add a new variable replicate number = to 1
    if present, then parse the sampleID, new sampleID without the -x and new variable isolate number = x 


File export sampleID starts out looking like this:
694-1
694-2
694-3
694-4
etc...

But need the structure to look like this:
 
SampleID    ISOLATE_NUMBER
694        1
694        2
694 3
694 4
etc...
Reply
#2
You can do that with pandas pretty easily. This example pretends to read the data from a file (using io.StringIO so I don't need to provide a file).
import io
import pandas as pd

data = io.StringIO(
    """694-1
    694-2
    694-3
    694-4
    """
)

df = pd.read_csv(data, delimiter="-", names=["SampleID", "ISOLATE_NUMBER"])
print(df)

# If you want to write to a CSV
df.to_csv("data.csv", sep=",", index=None)
Terminal output
Output:
SampleID ISOLATE_NUMBER 0 694 1 1 694 2 2 694 3 3 694 4
data.csv
Output:
SampleID,ISOLATE_NUMBER 694,1 694,2 694,3 694,4
Your post mentions "variable". If you have a list of strings that you need to split into a list of lists. You can use Pandas for that too.
import pandas as pd

data = ("693", "694-1", "694-2", "694-3", "694-4")

# Split 694-1 into 694 and 1.  List may be ragged
data = [x.split("-") for x in data]

# Pandas will fill in any missing ISOLATE_NUMBER's with None
df = pd.DataFrame(data, columns=["SampleID", "ISOLATE_NUMBER"])
print(df)
Output:
SampleID ISOLATE_NUMBER 0 693 None 1 694 1 2 694 2 3 694 3 4 694 4
Reply
#3
Hi Dean, thanks for the reply. Do you think there is a way to do it where I don't have to list the individual sampleIDs, so it just takes any sampleID with a '-' and separates the number after it into the new column, "ISOLATE_NUMBER"? Thanks again!
------------------------------------------------------------------------------------------------------------------------
(Mar-29-2023, 05:16 PM)deanhystad Wrote: You can do that with pandas pretty easily. This example pretends to read the data from a file (using io.StringIO so I don't need to provide a file).
import io
import pandas as pd

data = io.StringIO(
    """694-1
    694-2
    694-3
    694-4
    """
)

df = pd.read_csv(data, delimiter="-", names=["SampleID", "ISOLATE_NUMBER"])
print(df)

# If you want to write to a CSV
df.to_csv("data.csv", sep=",", index=None)
Terminal output
Output:
SampleID ISOLATE_NUMBER 0 694 1 1 694 2 2 694 3 3 694 4
data.csv
Output:
SampleID,ISOLATE_NUMBER 694,1 694,2 694,3 694,4
Your post mentions "variable". If you have a list of strings that you need to split into a list of lists. You can use Pandas for that too.
import pandas as pd

data = ("693", "694-1", "694-2", "694-3", "694-4")

# Split 694-1 into 694 and 1.  List may be ragged
data = [x.split("-") for x in data]

# Pandas will fill in any missing ISOLATE_NUMBER's with None
df = pd.DataFrame(data, columns=["SampleID", "ISOLATE_NUMBER"])
print(df)
Output:
SampleID ISOLATE_NUMBER 0 693 None 1 694 1 2 694 2 3 694 3 4 694 4
Reply
#4
(Mar-29-2023, 05:16 PM)deanhystad Wrote: You can do that with pandas pretty easily. This example pretends to read the data from a file (using io.StringIO so I don't need to provide a file).
I do not understand your question. Please provide more information. An example of what you want to do is best.

When replying to a post, please edit the reply or don't include a reference at all. There is no reason for you to include my entire post in your reply. Imagine what that it would look like if I just parroted back your message in my replay, then you did the same and back and forth until each post is several screens long. Only include references when a reference is needed.
Reply
#5
Replying automatically includes the whole thing, I thought I should leave it that way per some protocol, otherwise why on Earth would it be designed that way.... as is evident I've not posted or replied on this or any forum here before.

I don't know how to make myself more clear on this. Not listing all the numbers to be parsed, because that can be a long list and a waste of time. Writing code to remove the dash if there is one and separate the number after the dash into the new column, ISOALTE_NUMBER.

Ex: goes from-
sampleID
694-1
694-2
etc.

to-

sampleID ISOLATE_NUMBER
694 1
694 2
etc.
Reply
#6
(Mar-29-2023, 07:05 PM)EmBeck87 Wrote: Replying automatically includes the whole thing, I thought I should leave it that way per some protocol, otherwise why on Earth would it be designed that way.... as is evident I've not posted or replied on this or any forum here before.
Use the "New Reply" at the bottom of the page to create a new post. Use "Reply" if you want to include content from another post in your reply. As I did here.

I am still fuzzy about what do you mean by "Not listing all the numbers to be parsed". I provided two examples that take a list of A-B strings, one from a file (kind of) and one that is a list of strings. It splits the A-B strings into A and B parts and uses those to make a DataFrame (a table-like thing). It names the columns "sampleID" and "ISOLATE_NUMBER". If that is not what you want, tell me how it is wrong.

EDIT: I just read through the posts again and am confused by this:
Quote:Do you think there is a way to do it where I don't have to list the individual sampleIDs,
I don't see where you ever list the individual sampleIDs. In my example where I passed in the sampleID strings as a list of "A-B" strings, I added a couple "A" strings to demonstrate that Pandas can handle "ragged" data.
Reply
#7
694-1, 694-2, 694-3, 694-4, is what I meant by listing numbers
Reply
#8
Output:
694-1, 694-2, 694-3, 694-4, is what I meant by listing numbers
Exactly how is that supposed to provide clarity?

First off, where are these "numbers"? Are they in a file that your program reads, or are they generated by your program?

If they are read from a file, what is the format of the file?

If they are generated by your program, what is their type?

Remember that nobody but you has any context for understanding your question. To help others understand you'll have to provide a level of detail that you might find excessive. One of the best ways to provide context is to post code. Even if the code doesn't work quite right, it usually conveys what it is you are trying to accomplish. So far, the only information we have are examples of pre and post processed values. That is great, but when I provided code that produces those results you implied that the code left something to be desired.

Help us help you.
Reply
#9
There is several more like the below. Will the code you suggested parse sampleIDs like the below without them being included in the code? I'm trying to get it so the code will read any sampleID with a '-' and parse the number after the dash, putting it into a new column in the dataframe, "ISOLATE_NUMBER".
5269-1
5530-1
693-1
7198-1
11407-1
12031-1
14239-1
17377-1
17438-1
CA1-1
5269-2
5530-2
693-2
7198-2
11407-2
12031-2
14239-2
17377-2
17438-2
CA1-2
5269-3
5530-3
693-3
7198-3
11407-3
12031-3
14239-3
17377-3
17438-3
CA1-3
Reply
#10
I see the confusion now (maybe). pandas.read_csv() reads a csv file or file-like object It could be written like this:
import pandas as pd

df = pd.read_csv("test.txt", delimiter="-", names=["sampleID", "ISOLATE_NUMBER"])
print(df)
I copied some values from your last post and pasted into a file named "test.txt"

test.txt
Output:
5269-1 5530-1 693-1 7198-1 11407-1
When I run the program it prints this:
Output:
sampleID ISOLATE_NUMBER 0 5269 1 1 5530 1 2 693 1 3 7198 1 4 11407 1
After you make the DataFrame you could process the data. I copied all the values from you last post into my test.txt file, read the file, and sorted the rows first by sampleID and then by ISOLATE_NUMBER.
import pandas as pd

df = pd.read_csv("test.txt", delimiter="-", names=["sampleID", "ISOLATE_NUMBER"])
df.sort_values(by=["sampleID", "ISOLATE_NUMBER"], inplace=True)
print(df)
Output:
sampleID ISOLATE_NUMBER 4 11407 1 14 11407 2 24 11407 3 5 12031 1 15 12031 2 25 12031 3 6 14239 1 16 14239 2 26 14239 3 7 17377 1 17 17377 2 27 17377 3 8 17438 1 18 17438 2 28 17438 3 0 5269 1 10 5269 2 20 5269 3 1 5530 1 11 5530 2 21 5530 3 2 693 1 12 693 2 22 693 3 3 7198 1 13 7198 2 23 7198 3 9 CA1 1 19 CA1 2 29 CA1 3
The numbers to the left are the row index numbers from the unsorted rows.

You may have notices that the sampleID's don't appear to be sorted numerically. That is because the sampleID's are not numbers. Your data has sampleID's like "CA1-1". This forces pandas.read_csv() to use a different datatype for that column, and the sampleID's end up being sorted alphabetically.

Does that do a better job answering your question?
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Help with writing monitored data to mysql upon change of one particular variable donottrackmymetadata 3 302 Apr-18-2024, 09:55 PM
Last Post: deanhystad
  Plot a pandas data fram via pyqtgraph with an modul import and qt designer widget Nietzsche 0 854 May-29-2023, 02:42 PM
Last Post: Nietzsche
  Python 3.11 data import question love0715 2 818 Mar-05-2023, 06:50 PM
Last Post: snippsat
  Import XML file directly into Excel spreadsheet demdej 0 858 Jan-24-2023, 02:48 PM
Last Post: demdej
  Need help on how to include single quotes on data of variable string hani_hms 5 2,051 Jan-10-2023, 11:26 AM
Last Post: codinglearner
  USE string data as a variable NAME rokorps 1 969 Sep-30-2022, 01:08 PM
Last Post: deanhystad
  Can't import csv data JonWayn 4 1,406 Sep-18-2022, 02:07 AM
Last Post: JonWayn
Question How can I import a variable from another script without executing it ThomasFab 12 7,839 May-06-2022, 03:21 PM
Last Post: bowlofred
  json api data parsing elvis 0 934 Apr-21-2022, 11:59 PM
Last Post: elvis
  Modify values in XML file by data from text file (without parsing) Paqqno 2 1,695 Apr-13-2022, 06:02 AM
Last Post: Paqqno

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020