Python Forum
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Help on String variable
#1
Hi everyone

I am new to learn python but have experience in dealing data management and data analysis in Stata for last four years. While learning the codes in python (that already been developed in Stata) I got stuck in developing code, details are as follows:


In stata, I have a local macro called methods which contains 8 family planning method names separated with space: local methods "female_condoms emergency male_condoms pill injectables iud male_sterilization female_sterilization". Further I have a string variable called method_discussed may contain no method name (blank), 1 to 8 method names separated with space from above macro based upon the responses from respondents from a survey i.e., method_discussed is multiple choice question. A sample of 5 observations is as follows where index 3 is blank (Assume respondent did not tell the any method name:

index method_discussed
1 iud male_condoms pill
2 male_condoms
3
4 female_sterilization male_sterilization
5 male_sterilization iud injectables
.
.
.
.
so on.

While jumping to Python from Stata, I made a list,say, method_name=['female_condoms' 'emergency' 'male_condoms' 'pill' 'injectables' 'iud' 'male_sterilization' 'female_sterilization']. What I want to do is I want to generate 8 variables based on the name of items in list (method name actually) have data points as yes or no (1 or 0) if item of list is present in variable method_discussed. For eaxample


index method_discussed female_sterilization male_sterilization iud injectables antra_inj chhaya_pill pill male_condoms emergency female_condoms
0 0 0 0 1 0 0 0 0 0
0 0 0 0 1 0 0 0 0 0
0 0 0 0 0 0 1 0 0 0
0 0 0 0 0 1 1 1 1 0
0 0 1 0 1 0 0 0 0 0

1 iud male_condoms pill
2 male_condoms
3
4 female_sterilization male_sterilization
5 male_sterilization iud injectables

Hi everyone

I am new to learn python 3.6, but, have experience in dealing data management and data analysis in Stata for last four years. While learning the codes in python (that already been developed in Stata) I got stuck in developing code, details are as follows:


In stata 15, I have a local macro called methods which contains 8 family planning method names separated with space: local methods "female_condoms emergency male_condoms pill injectables iud male_sterilization female_sterilization". Further I have a string variable called method_discussed may contain no method name (blank), 1 to 8 method names separated with space from above macro based upon the responses from respondents from a survey i.e., method_discussed is multiple choice question. A sample of 5 observations is as follows where index 3 is blank (Assume respondent did not tell the any method name:

index method_discussed
1 iud male_condoms pill
2 male_condoms
3
4 female_sterilization male_sterilization
5 male_sterilization iud injectables
.
.
.
.
so on.

While jumping to Python from Stata, I made a list,say, method_name=['female_condoms' 'emergency' 'male_condoms' 'pill' 'injectables' 'iud' 'male_sterilization' 'female_sterilization']. What I want to do is I want to generate 8 variables based on the name of items in list (method name actually) have data points as yes or no (1 or 0) if item of list is present in variable method_discussed. For eaxample, expected output should be like this


Data Input Expected output
index method_discussed female_condoms emergency male_condoms pill injectables iud male_sterilization female_sterilization
1 iud male_condoms pill 0 0 1 1 0 1 0 0
2 male_condoms 0 0 1 0 0 0 0 0
3
4 female_sterilization male_sterilization 0 0 0 0 0 0 1 1
5 male_sterilization iud injectables 0 0 0 0 1 1 1 0
.
.
.
.
so on.

I am not able to understand how to proceed.

Anticipating help from your side

Ashish
Reply
#2
If i understand correctly, the following might be along the lines of what you are looking for.
method_names = ['female_condoms', 'emergency', 'male_condoms', 'pill',
                'injectables', 'iud', 'male_sterilization', 'female_sterilization']

methods_discussed = [['iud', 'male_condoms', 'pill'],
                     ['male_condoms'],
                     [],
                     ['female_sterilization', 'male_sterilization'],
                     ['male_sterilization', 'iud', 'injectables']]

data_points = []

for method_dicuseed in methods_discussed:
    points = []
    for method_name in method_names:
        points.append(int(method_name in method_dicuseed))
    data_points.append(points)

print(data_points)
Output:
[[0, 0, 1, 1, 0, 1, 0, 0], [0, 0, 1, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 1, 1], [0, 0, 0, 0, 1, 1, 1, 0]]


With the additional comments
import pprint

method_names = ['female_condoms', 'emergency', 'male_condoms', 'pill',
                'injectables', 'iud', 'male_sterilization', 'female_sterilization']

methods_discussed = [['iud', 'male_condoms', 'pill'],
                     ['male_condoms'],
                     [],
                     ['female_sterilization', 'male_sterilization'],
                     ['male_sterilization', 'iud', 'injectables']]

data_points = []

for index, method_dicuseed in enumerate(methods_discussed):
    points = [index+1, method_dicuseed]
    for method_name in method_names:
        points.append(int(method_name in method_dicuseed))
    data_points.append(points)


pprint.pprint(data_points)
Output:
[[1, ['iud', 'male_condoms', 'pill'], 0, 0, 1, 1, 0, 1, 0, 0], [2, ['male_condoms'], 0, 0, 1, 0, 0, 0, 0, 0], [3, [], 0, 0, 0, 0, 0, 0, 0, 0], [4, ['female_sterilization', 'male_sterilization'], 0, 0, 0, 0, 0, 0, 1, 1], [5, ['male_sterilization', 'iud', 'injectables'], 0, 0, 0, 0, 1, 1, 1, 0]]
Reply
#3
Thanks Yoriz for prompt help.

Actually data is in csv file came from more than 5000 respondents. one of the variable is method_discussed having more than 5000 data points and these data points may be of any/all combination of items from dictionary
method_names = ['female_condoms', 'emergency', 'male_condoms', 'pill', 'injectables', 'iud', 'male_sterilization', 'female_sterilization'].
For eaxample

rspondent method_discussed
respondent1 female_condoms injectables
respondent2 male_sterilization pill
respondent3 blank (no method)
.
.
.
so on
respondent5000 male_sterilization female_sterilization
.
.

I imported pandas as pd read the csv file and made a dictionary of these 8 methods. I want to generate 8 variables based on name of these 8 items in dictionary whose data points are 0 (absence of particular item in method_discussed) and 1 (presence of particular item in method_discussed), as you have done but not in memory but in same csv file and save it.

I dont want these results in memory as you have done bit in dataframe. Second thing I want to bring in your notice that I dont want to assign method_discuused as you have done for only 5 cases

methods_discussed = [['iud', 'male_condoms', 'pill'],
['male_condoms'],
[],
['female_sterilization', 'male_sterilization'],
['male_sterilization', 'iud', 'injectables']]

as I said it has more than 5000 cases (data points), in other words, method_discussed take any combination of items from dictionary above.

If you need I can send the csv file with expected outcome in EXCEL.

Thanks

Ashish
Reply
#4
Hi Yoriz

little updates from my side whatever I have done.

**********************
import pandas as pd
import numpy as np
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)

csv1="multiple_responses.csv"
df1 = pd.read_csv(csv1, index_col='id' , na_values = [' '] , low_memory=False)

method_names = ['female_condoms', 'emergency', 'male_condoms', 'pill', 'injectables', 'iud', 'male_sterilization', 'female_sterilization']
for method in method_names:
print(method)

for method in method_names:
df1[method]=df1["methods_discussed"].str.contains(pat = method)
df1.head(10)

output
id | methods_discussed | female_condoms | emergency | male_condoms | pill | injectables | iud | male_sterilization | female_sterilization
1 | emergency | FALSE | TRUE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE
2 | female_sterilization | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | TRUE | TRUE
3 | male_sterilization | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | TRUE | FALSE
4 | iud | FALSE | FALSE | FALSE | FALSE | FALSE | TRUE | FALSE | FALSE
5 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN
6 | injectables male_condoms | FALSE | FALSE | TRUE | FALSE | TRUE | FALSE | FALSE | FALSE
7 | male_condoms | FALSE | FALSE | TRUE | FALSE | FALSE | FALSE | FALSE | FALSE
8 | female_sterilization male_sterilization | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | TRUE | TRUE
9 | injectables | FALSE | FALSE | FALSE | FALSE | TRUE | FALSE | FALSE | FALSE
10 | iud male_condoms | FALSE | FALSE | TRUE | FALSE | FALSE | TRUE | FALSE | FALSE

Problem description
I used CSV file (link of CSV file is https://github.com/pandas-dev/pandas/fil...ponses.zip)
which contains two columns "id" and "methods_discussed". After running above code the ouput shown is wrong as at index [2] column male_sterilization shows TRUE (I have made it bold and italic. It should be FALSE as "methods_discussed" contains only female_sterilization.

Expected Output
id | methods_discussed | female_condoms | emergency | male_condoms | pill | injectables | iud | male_sterilization | female_sterilization
1 | emergency | FALSE | TRUE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE
2 | female_sterilization | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | TRUE
3 | male_sterilization | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | TRUE | FALSE
4 | iud | FALSE | FALSE | FALSE | FALSE | FALSE | TRUE | FALSE | FALSE
5 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN
6 | injectables male_condoms | FALSE | FALSE | TRUE | FALSE | TRUE | FALSE | FALSE | FALSE
7 | male_condoms | FALSE | FALSE | TRUE | FALSE | FALSE | FALSE | FALSE | FALSE
8 | female_sterilization male_sterilization | FALSE | FALSE | FALSE | FALSE | FALSE | FALSE | TRUE | TRUE
9 | injectables | FALSE | FALSE | FALSE | FALSE | TRUE | FALSE | FALSE | FALSE
10 | iud male_condoms | FALSE | FALSE | TRUE | FALSE | FALSE | TRUE | FALSE | FALSE

I have also used str.match but it did not work for me.
Any idea if I don't want to generate values if methods_discussed contains NaN.

Thanks

Ashish
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Replacing String Variable with a new String Name kevv11 2 771 Jul-29-2023, 12:03 PM
Last Post: snippsat
  Need help on how to include single quotes on data of variable string hani_hms 5 2,011 Jan-10-2023, 11:26 AM
Last Post: codinglearner
  python r string for variable mg24 3 2,785 Oct-28-2022, 04:19 AM
Last Post: deanhystad
  USE string data as a variable NAME rokorps 1 956 Sep-30-2022, 01:08 PM
Last Post: deanhystad
  Removing Space between variable and string in Python coder_sw99 6 6,266 Aug-23-2022, 01:15 PM
Last Post: louries
  Remove a space between a string and variable in print sie 5 1,764 Jul-27-2022, 02:36 PM
Last Post: deanhystad
  Split string using variable found in a list japo85 2 1,295 Jul-11-2022, 08:52 AM
Last Post: japo85
  Can you print a string variable to printer hammer 2 1,935 Apr-30-2022, 11:48 PM
Last Post: hammer
Question How to convert string to variable? chatguy 5 2,371 Apr-12-2022, 08:31 PM
Last Post: buran
  I want to search a variable for a string D90 lostbit 3 2,615 Mar-31-2021, 07:14 PM
Last Post: lostbit

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020