Python Forum
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Aligning data
#1
Hello. I'm so sorry in advance. I am super new to this. If this question isn't allowed, please discard. I have something as follows:
sample001a.com, -12, sample002b.com, -16, sample002c.com, -18
sample002a.com, -14, sample003b.com, -16, sample004c.com, -22
sample003a.com, -16, sample004b.com, -20, sample005c.com, -16
sample004a.com, -18, sample005b.com, -14
sample005a.com, -12

So far, I have had to manually sort the data in excel like this:

sample002a.com, -14, sample002b.com, -16, sample002c.com, -18
sample003a.com, -16, sample003b.com, -16, sample004c.com, -22
sample005a.com, -12, sample005b.com, -14, sample005c.com, -16
sample004a.com, -18, sample004b.com, -20,
sample005a.com, -12

As you can see the data is sorted by sample number, prioritizing matches with the last data set. I'm pretty new to python only having had a brief introduction to numpy and pandas. I was wondering if anyone had any lead on where to start here.
Reply
#2
Numpy is usually for Numbers and not for strings used.
You can do this with Python itself:

from itertools import chain
from operator import itemgetter

# your data
sample_data = [
        ('sample001a.com', -12, 'sample002b.com', -16, 'sample002c.com', -18),
        ('sample002a.com', -14, 'sample003b.com', -16, 'sample004c.com', -22),
        ('sample003a.com', -16, 'sample004b.com', -20, 'sample005c.com', -16),
        ('sample004a.com', -18, 'sample005b.com', -14),
        ('sample005a.com', -12),
    ]

def sort_column(row):
    # ('sample003a.com', -16, 'sample004b.com', -20, 'sample005c.com', -16)
    # this have to be sorted
    # grouping host and value together
    group_by_two = zip(*[iter(row)]*2)
    # you can do this instead with indexing
    sorted_groups = sorted(group_by_two, key=itemgetter(1))
    # group_by_two is the iterator
    # itemgetter get the second element from iterable (1)
    # the list is sorted by this value, the second value
    # now need to flatten the iterable
    # retunring it as imuatble tuple
    return tuple(chain.from_iterable(sorted_groups))


for row in sample_data:
    print(sort_column(row))
How you get the data, is you task. I put an example data set in the top as a nested list.
Almost dead, but too lazy to die: https://sourceserver.info
All humans together. We don't need politicians!
Reply
#3
maybe try the dataframes .groupby option ?
Reply
#4
(Jun-17-2019, 06:25 PM)Ecniv Wrote: maybe try the dataframes .groupby option ?
Hey! Thank you so much for your suggestion. When I tried the .groupby option, I ran into trouble, as python thinks all of the rows are locked together. I'm now trying to find a way to group every two columns and then try .groupby.
Reply
#5
If the rows arent connected, you should be able to import it as duel columns. Group by should work on it then... although you may need to split the sample into other columns to get sample number an letter.
Reply
#6
# you can do this instead with indexing
Hey! Thank you so much for your helpful response. When I copy and run this code, though, I get the following:
('sample002c.com', -18, 'sample002b.com', -16, 'sample001a.com', -12)
('sample004c.com', -22, 'sample003b.com', -16, 'sample002a.com', -14)
('sample004b.com', -20, 'sample003a.com', -16, 'sample005c.com', -16)
('sample004a.com', -18, 'sample005b.com', -14)
('sample005a.com', -12)
Perhaps I have messed something up. It looks a bit jumbled. So far, I've been importing the data via excel as a data frame. I am a little unclear as to what the zip function is doing in this loop. I think I'm supposed to be "zipping" columns instead of rows maybe? So far I have been importing this data as a dataframe and then modifying it to have merged columns, each containing of the name married to the value:
A B C
0 (sample0001a.com, -12) (sample0002b.com, -16.0) (sample0002c.com, -18.0)
1 (sample0002a.com, -14) (sample0002b.com, -16.0) (sample0004c.com, -22.0)
2 (sample0003a.com, -16) (sample0003b.com, -20.0) (sample0005c.com, -16.0)
3 (sample0004a.com, -18) (sample0004b.com, -14.0) (nan, nan)
4 (sample0005a.com, -12) (nan, nan) (nan, nan

The structure of the function you posted seems helpful, but I think my lack of understanding is preventing me from effectively modifying it.

Agh I realized I posted my question wrong at the very beginning.I am so sorry to anyone involved.
I'm trying to align the sample numbers like so:
sample002a.com, -14, sample002b.com, -16, sample002c.com, -18
sample004a.com, -18, sample004b.com, -20 sample004c.com, -22
sample005a.com, -12, sample005b.com, -14, sample005c.com, -16
sample003a.com, -16, sample003b.com, -16
sample001a.com, -12
I have imported the data and split the string of the sample name into an integer and a letter but am still struggling to systematically move cells based on the integer similarity.
Reply
#7
(Jun-17-2019, 03:37 PM)gat Wrote: Hello. I'm so sorry in advance. I am super new to this. If this question isn't allowed, please discard. I have something as follows:
sample001a.com, -12, sample002b.com, -16, sample002c.com, -18
sample002a.com, -14, sample003b.com, -16, sample004c.com, -22
sample003a.com, -16, sample004b.com, -20, sample005c.com, -16
sample004a.com, -18, sample005b.com, -14
sample005a.com, -12

So far, I have had to manually sort the data in excel like this:

sample002a.com, -14, sample002b.com, -16, sample002c.com, -18
sample003a.com, -16, sample003b.com, -16, sample004c.com, -22
sample005a.com, -12, sample005b.com, -14, sample005c.com, -16
sample004a.com, -18, sample004b.com, -20,
sample005a.com, -12

As you can see the data is sorted by sample number, prioritizing matches with the last data set. I'm pretty new to python only having had a brief introduction to numpy and pandas. I was wondering if anyone had any lead on where to start here.


Hi ..
I am also pretty new to python. While is continue to learn from some tutorial. I also browse some new topics in this EXCELLENT forum with a great knowledgeable gurus and learn from them as well.

I was trying to understand your problem statement but for some reason i could not quite get what you are trying to achieve.
May be dumb but some questions,
1) I see you have "sample001a.com" in the sample set, but your expected set does not have it.
2) Is your second set the expectation of the outcome ?
Reply
#8
(Jun-19-2019, 02:12 PM)vindo Wrote: 2) Is your second set the expectation of the outcome ?

I mistyped. My expected outcome is to align it by sample number. I've been able to extract the sample number using "new['Sample_number'] = data["sample_name"].str[6:10].astype(int" but I'm still stuck on the grouping. The below is the actual expected outcome:
sample002a.com, -14, sample002b.com, -16, sample002c.com, -18
sample004a.com, -18, sample004b.com, -20 sample004c.com, -22
sample005a.com, -12, sample005b.com, -14, sample005c.com, -16
sample003a.com, -16, sample003b.com, -16
sample001a.com, -12
Reply
#9
Whats your python code atm?

I'm still a really novice... but I tried the following. May be it will help or offer a different way
import pandas as pd
sample_data = [
        ('sample001a.com', -12, 'sample002b.com', -16, 'sample002c.com', -18),
        ('sample002a.com', -14, 'sample003b.com', -16, 'sample004c.com', -22),
        ('sample003a.com', -16, 'sample004b.com', -20, 'sample005c.com', -16),
        ('sample004a.com', -18, 'sample005b.com', -14),
        ('sample005a.com', -12),
    ]
tdf = pd.DataFrame(sample_data,columns=['sample','total','s1','t1','s2','t2'])
df = tdf[['sample','total']]
a = tdf[['s1','t1']].rename(columns={'s1':'sample','t1':'total'})
df = df.append(a)
a = tdf[['s2','t2']].rename(columns={'s1':'sample','t1':'total'})
df = df.append(a)
df.groupby(by='sample',sort=True).sum()
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Aligning excel data gat 1 2,181 Jun-17-2019, 07:05 PM
Last Post: michalmonday

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020