finding items/comparison in/with a dictionary

AGC · (This post was last modified: Mar-29-2018, 09:11 PM by AGC.)

Hello,
I have file 1 with this kind of information: Name Definition for about 2,000 names --> I have created a dictionary with this file (name:definition)

I have file 2 with this kind of information: >Name \n Information about name for the same 2,000 names

The goal is to add the definition of file 1 to the correct name in file 2: >Name Definition \n Information

I generated the code below where it is supposed to verify if the name is the same as in the dictionary:
if line(name) in dict.keys() and it doesn't seem to work. I have printed separately line(name) and dict.keys() and they are the same list of characters,except for that when I print the dictionary it adds brackets as in ['name'].

(I also know that I make my code more complicated than it should be ... I need to learn how to generate functions!)

file1= open('c:/python27/annotation.txt', 'r')
dict={}
file2= open('c:/python27/protein_file.faa', 'r')
outfile=open('c:/python27/proteinandannotation.faa', 'w')
for line in file1:
	name=line.strip().split()
	dict={name[0]:name[1:]}
	for line in file2:
		if line.startswith('>'):
#		line=line.strip
			if line[1:28] in dict.keys():
				line=line.strip('\n')
				outfile.write(line + dict.values() + '\n')
		else:
			outfile.write(line)

woooee · (This post was last modified: Mar-29-2018, 09:18 PM by woooee.)

Print what you are comparing so you know what is happening. Also Python uses dict already so name your dictionaries something else.

print("\n", line[1:28])
print(dict.keys)
if line[1:28] in dict.keys():

AGC · Apr-02-2018, 06:05 PM

Thanks!
I figured that I had to strip the return line character at the end;
THis is my new code:

for key in annotation:
		for line in file2:
			if line.startswith('>'):
				gene=line[1:]
				gene=gene.strip('\n')
				if gene == key:

AGC · (This post was last modified: Apr-02-2018, 07:17 PM by buran.)

Hello,
I have file 1 with this kind of information: "ID" "Definition" for about 2,000 IDs --> I have created a dictionary with this file (id:definition) --> dictionary is called annotation(name1:name2)

I have file 2 with this kind of information: ">" + "ID" + \n + "Information about ID" for the same 2,000 IDs

The goal is to add the definition of file 1 to the correct ID in file 2: ">" + "ID" + "Definition" + \n + "Information about ID"

I generated the code below where it is supposed to verify if the name is the same as in the dictionary:
however it only checks the first entry in the dictionary.

file1= open('c:/python27/annotation.txt', 'r')  #this has the ID and definition
annotation={}  #define the dictionary
file2= open('c:/python27/protein_file.faa', 'r') #this has the ID with information
outfile=open('c:/python27/proteinandannotation.faa', 'w') 
for line in file1: #here I populate the dictionary with ID:definition
	name=line.strip().split()
	annotation={name[0]:name[1:]}
	print annotation #the entire dictionary prints, so it is all there
	for key in annotation: #here I go through every key in the dictionary, but it only seems to go through 1
		print key #all the keys print
        for line in file2: #will check this file for the first line of each entry with the ">" +"ID" format
			print key #only the first key prints now
            if line.startswith('>'): #here I make sure that I am on the line with ID 
				gene=line[1:] #here I extract ID from line
				gene=gene.strip('\n') #and get rid of line return
				if gene == key: #here I compare the ID to the key value in the dictionary; it works well for 1
					line=line.strip('\n')
					outfile.write(line + str(annotation[key]) + '\n') #here I write the ID with the definition
			else:
				outfile.write(line) #If the line doesn't have an Id and it is just the information I just write it

**buran** · Apr-02-2018, 07:00 PM

on line#7 with each iteration you overwrite existing annotation dict with a new one with single element

Apart from that your approach is ineffective. better approach would be to loop trough file 1 and create a dict.
then iterate over file 2, reading 2 lines (there are different possible approaches to do this), parse them to extract ID and info, then just get the respective element from the dict, created from file 1 and write to new file (file 3)

Not to mention that lines 19-20 will print lines that not start with > many many times

**buran** · Apr-02-2018, 07:13 PM

Can you upload sample of the files?

AGC · Apr-02-2018, 07:25 PM

Thanks for your time and help.

Here is a bit of file 1 with ID plus definitions:
fig|6666666.213038.peg.1 Name=Leucyl-tRNA synthetase (EC 6.1.1.4) Ontology_term=KEGG_ENZYME:6.1.1.4
fig|6666666.213038.peg.2 Name=hypothetical protein
fig|6666666.213038.peg.3 Name=peptidoglycan-associated lipoprotein%2C OmpA family
fig|6666666.213038.peg.4 Name=Crossover junction endodeoxyribonuclease RuvC (EC 3.1.22.4) Ontology_term=KEGG_ENZYME:3.1.22.4
fig|6666666.213038.peg.5 Name=FIG000859: hypothetical protein YebC
fig|6666666.213038.peg.6 Name=Phage protein

A bit of file 2, with >ID plus info about ID (this is a biology file)
>fig|6666666.213038.peg.1
MQPNGVDLYVGGAEHAVLHLLYARFWHKVLYDYGYVSTPEPFYKLFHQGLILGEDGRKMS
KSLGNVVNPDEVVQKYGADSLRMFEMFLGPLQDTKPWSTTGIEGINRFLNRIWRLFIDEN
GNLNPNIEDLPLTPEQEYILHSTIKKVTEDIENLRFNTAIAQMMIFVNEFYKFEKKPKEA
LKKFLLILAPFAPHISEELWHKLGYSESIFTYSFPEFDEHKAIKKEVEIVVQINSKIRAR
INVPIDTPENEVLDIAKSEPNVQKYLAGKEIRKVIFVPNKILNIII
>fig|6666666.213038.peg.2
MIIPFQFIQNLQKYNKNFCSKNIDKHFQNKFTREMSLSLYFEIYVKFFC
>fig|6666666.213038.peg.3
MIYRNXFTIIFAVIFLSVSLLSAKPNXKSSKTENASEQYFLWGXVIEDANSPNTPNTPTQ
DQPQIFDKSKIRIINLGPVVNWKGLDYAPTISADGKTLYFVSNRPGSKINPKTDKPSHDF
WATKKNDRLDTIFFKPFNIDTTTIWGYQGVNTPENEGVASIAADGQTLYFTACSRPDGFG
DCDIYKTTIEGDKWSRPYNLGPNVNSKYFDAQPSIAPDQSRLYFVSTRPGPNSDGNNENW
DNMDIWYSDFDPETEEWLPAKNLTEINTPVQDCGPFIAADNQTLFFSSKGHQPNYGGLDF
YVTRYDPVTKKWSKPENLGIPLNTPQDDQFITLPASGDVLYFSSRRKDIPGYQGDFDIFM
AFVPSYFRAVVVKTTVIDECSGENIPAIVTIKSPIINRVVVDTLKATRTEIDFVVSNTDY
GDPRDSIKFVNLEITAENPKYGKTTKIVRVDKPKPTTDPEEAKKYADVINVVIPLGQRPV
IGAEIEEAKYVAENKKIKPEIANFRGLVMEQFQTWDLYPLLNYVFFDAGSSKIPDRYILF
KSPDDKFKKAFTDTTIRGGTLEKYYHILNIYGYRLNKYPEAKIEIVGCNDGKTPEEKRPN
LSKERAEAVFKYLRDVWGIDEKRMKITVRNQPAVVSNLNDSLGIVENRRVEILCNDWNIM
KPVFDKDPKTFPQPETMNFTLKNGIEDALVKARRIEVKRGGREWNTLKDIGVVENKYTWD
WKSSEGEYPKDEVPFTAQLIVTTINDKECSSDPIMIPVMQVTTEQKKVDIQKGAKDSTIE
RYSLILFPFDRSDAGPINERIMREYVYNRVLPTSYVEVVGHTDVVGLYEHNQALSERRAT
TVYNGIMQQTKGKVGYINKRGVGEDEPLYDNSLPEGRFYNRTVQVIIKTPVESWEQLGGG
K
>fig|6666666.213038.peg.4
MLILGIDPGSVKCGFGIVDFEGFPPKIIKVGLIRPKFKDKFHFLDKLKFIYDELNSLLDY
FDIVETAVESQFYSKNPQSLMKLTQAKTVVELAMLNRNIPVFEYSPREIKLAITGRGGAT
KKSVQYMVESIFDVNLKNKTTDISDALAVALCHISRKVNLNTKRNSPRNWREFVQMNPER
VISQ

And this is what I get with my code. It works beautifully with the first entry, but then since it is only looking at the first key it never matches with the ID and doesn't print it

>fig|6666666.213038.peg.1['Name=Leucyl-tRNA', 'synthetase', '(EC', '6.1.1.4)', 'Ontology_term=KEGG_ENZYME:6.1.1.4']
MQPNGVDLYVGGAEHAVLHLLYARFWHKVLYDYGYVSTPEPFYKLFHQGLILGEDGRKMS
KSLGNVVNPDEVVQKYGADSLRMFEMFLGPLQDTKPWSTTGIEGINRFLNRIWRLFIDEN
GNLNPNIEDLPLTPEQEYILHSTIKKVTEDIENLRFNTAIAQMMIFVNEFYKFEKKPKEA
LKKFLLILAPFAPHISEELWHKLGYSESIFTYSFPEFDEHKAIKKEVEIVVQINSKIRAR
INVPIDTPENEVLDIAKSEPNVQKYLAGKEIRKVIFVPNKILNIII
MIIPFQFIQNLQKYNKNFCSKNIDKHFQNKFTREMSLSLYFEIYVKFFC
MIYRNXFTIIFAVIFLSVSLLSAKPNXKSSKTENASEQYFLWGXVIEDANSPNTPNTPTQ
DQPQIFDKSKIRIINLGPVVNWKGLDYAPTISADGKTLYFVSNRPGSKINPKTDKPSHDF
WATKKNDRLDTIFFKPFNIDTTTIWGYQGVNTPENEGVASIAADGQTLYFTACSRPDGFG
DCDIYKTTIEGDKWSRPYNLGPNVNSKYFDAQPSIAPDQSRLYFVSTRPGPNSDGNNENW
DNMDIWYSDFDPETEEWLPAKNLTEINTPVQDCGPFIAADNQTLFFSSKGHQPNYGGLDF
YVTRYDPVTKKWSKPENLGIPLNTPQDDQFITLPASGDVLYFSSRRKDIPGYQGDFDIFM
AFVPSYFRAVVVKTTVIDECSGENIPAIVTIKSPIINRVVVDTLKATRTEIDFVVSNTDY
GDPRDSIKFVNLEITAENPKYGKTTKIVRVDKPKPTTDPEEAKKYADVINVVIPLGQRPV
IGAEIEEAKYVAENKKIKPEIANFRGLVMEQFQTWDLYPLLNYVFFDAGSSKIPDRYILF
KSPDDKFKKAFTDTTIRGGTLEKYYHILNIYGYRLNKYPEAKIEIVGCNDGKTPEEKRPN
LSKERAEAVFKYLRDVWGIDEKRMKITVRNQPAVVSNLNDSLGIVENRRVEILCNDWNIM
KPVFDKDPKTFPQPETMNFTLKNGIEDALVKARRIEVKRGGREWNTLKDIGVVENKYTWD
WKSSEGEYPKDEVPFTAQLIVTTINDKECSSDPIMIPVMQVTTEQKKVDIQKGAKDSTIE
RYSLILFPFDRSDAGPINERIMREYVYNRVLPTSYVEVVGHTDVVGLYEHNQALSERRAT
TVYNGIMQQTKGKVGYINKRGVGEDEPLYDNSLPEGRFYNRTVQVIIKTPVESWEQLGGG
K
MLILGIDPGSVKCGFGIVDFEGFPPKIIKVGLIRPKFKDKFHFLDKLKFIYDELNSLLDY
FDIVETAVESQFYSKNPQSLMKLTQAKTVVELAMLNRNIPVFEYSPREIKLAITGRGGAT
KKSVQYMVESIFDVNLKNKTTDISDALAVALCHISRKVNLNTKRNSPRNWREFVQMNPER
VISQ
MSGHSKWANIKHKKAAKDAKRGKLFTRLAKEITIAAREGGGDPEANPRLRLAIQNAKAEN
MPMENIKRAIQRGTGEIQGENYEEVIYEGYAPLGVAVILEAITDNRNRTYPLIRSEVNKL
GGSIGEPGSVMWNFTRKGVIYIDPQGLTEEQMLEHILEAGCEDMEYDEERTRVICAFEDM
VACQKYFEDKKFKILESKFEYIPKTTVKIDNIEAARKVLKFFDTLEELDDVQNVYGNYEF
TDEVLSQLEKEQN

**buran** · Apr-02-2018, 07:28 PM

do the files have headers?
Am I right that delimiter in first file is |?

AGC · Apr-02-2018, 07:32 PM

Nope, no headers.

I am not sure what you mean by delimiter. The first ID (key) would be: fig|6666666.213038.peg.1 and the first definition (value): Name=Leucyl-tRNA synthetase (EC 6.1.1.4) Ontology_term=KEGG_ENZYME:6.1.1.4

**buran** · Apr-02-2018, 07:32 PM

Also, in the second file this:
[inline]MQPNGVDLYVGGAEHAVLHLLYARFWHKVLYDYGYVSTPEPFYKLFHQGLILGEDGRKMS
KSLGNVVNPDEVVQKYGADSLRMFEMFLGPLQDTKPWSTTGIEGINRFLNRIWRLFIDEN
GNLNPNIEDLPLTPEQEYILHSTIKKVTEDIENLRFNTAIAQMMIFVNEFYKFEKKPKEA
LKKFLLILAPFAPHISEELWHKLGYSESIFTYSFPEFDEHKAIKKEVEIVVQINSKIRAR
INVPIDTPENEVLDIAKSEPNVQKYLAGKEIRKVIFVPNKILNIII[/inline]
is a single line or multiple lines?
Can you attach the files?

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Finding combinations of list of items (30 or so)	LynnS	1	880	Jan-25-2023, 02:57 PM Last Post: deanhystad
	how to assign items from a list to a dictionary	CompleteNewb	3	1,586	Mar-19-2022, 01:25 AM Last Post: deanhystad
	Calculating frequency of items in a dictionary	markellefultz20	1	1,728	Nov-27-2019, 04:21 AM Last Post: scidam
	Python find the minimum length of string to differentiate dictionary items	zydjohn	3	3,627	Mar-03-2018, 05:23 PM Last Post: Gribouillis

finding items/comparison in/with a dictionary

User Panel Messages

Announcements