Jul-07-2017, 03:32 AM
I have about two files one html like below with 100k entries
********************************************************************************
SeqID: sp|Q72W11|ENGB_LEPIC
CELLO prediction:
(predictor location reliable-index)
Composition Cytoplasmic 0.762
Di-peptide Cytoplasmic 0.744
part-Comp. Cytoplasmic 0.813
chemo-typy Periplasmic 0.499
Neighboring Cytoplasmic 0.539
Combined SVM classifier:
Extracellular 0.357
OuterMembrane 0.426
Periplasmic 0.766
InnerMembrane 0.141
Cytoplasmic 3.310 *
*********************************************************************************
SeqID: sp|Q72RN8|F16PA_LEPIC
CELLO prediction:
(predictor location reliable-index)
Composition OuterMembrane 0.644
Di-peptide Cytoplasmic 0.612
part-Comp. Extracellular 0.438
chemo-typy Periplasmic 0.365
Neighboring Extracellular 0.531
Combined SVM classifier:
Extracellular 1.759 *
OuterMembrane 1.391 *
Periplasmic 0.738
InnerMembrane 0.045
Cytoplasmic 1.067 *
Other fasta file format like containing 100k entries
>sp|Q72Q29|HPPA_LEPIC
MNSVTIIIAMSILAIVTAVVYTLKVTSIKVGTLGGNEKETKKLLEISSAISEGAMAFLVR
EYKVISLFIAFMAVLIVLLLDNPGSEGFNDGIYTAIAFVSGALISCISGFIGMKIATAGN
VRTAEAAKSSMAKAFRVAFDSGAVMGFGLVGLAILGMIVLFLVFTGMYPGVEKHFLMESL
AGFGLGGSAVALFGRVGGGIYTKAADVGADLVGKVEKGIPEDDPRNPATIADNVGDNVGD
VAGMGADLFGSCAEATCAALVIGATASALSGSVDALLYPLLISAFGIPASILTSFLARVK
EDGNVESALKVQLWVSTLLVAGIMYFVTKTFMVDSFEIAGKTITKWDVYISMVVGLFSGM
FIGIVTEYYTSHSYKPVREVAEASNTGAATNIIYGLSLGYHSSVIPVILLVITIVTANLL
AGMYGIAIAALGMISTIAIGLTIDAYGPVSDNAGGIAEMAELGKEVRDRTDTLDAAGNTT
AAIGKGFAIGSAALTSLALFAAFITRTHTTSLEVLNAEVFGGLMFGAMLPFLFTAMTMKS
VGKAAVDMVEEVRKQFKEIPGIMEGKNKPDYKRCVDISTSAALREMILPGLLVLLTPILV
GYLFGVKTLAGVLAGALVAGVVLAISAANSGGGWDNAKKYIEKKAGGKGSDQHKAAVVGD
TVGDPFKDTSGPSINILIKLMAITSLVFAEFFVQQGGLIFKIFH
>sp|Q72QZ8|ENO_LEPIC
MSHHSQIQKIQAREIMDSRGNPTVEVDVILLDGSFGRAAVPSGASTGEYEAVELRDGDKH
RYLGKGVLKAVEHVNLKIQEVLKGENAIDQNRIDQLMLDADGTKNKGKLGANAILGTSLA
VAKAAAAHSKLPLYRYIGGNFARELPVPMMNIINGGAHADNNVDFQEFMILPVGAKSFRE
ALRMGAEIFHSLKSVLKGKKLNTAVGDEGGFAPDLTSNVEAIEVILQAIEKAGYKPEKDV
LLGLDAASSEFYDKSKKKYVLGAENNKEFSSAELVDYYANLVSKYPIITIEDGLDENDWD
GWKLLSEKLGKKIQLVGDDLFVTNIEKLSKGISSGVGNSILIKVNQIGSLSETLSSIEMA
KKAKYTNVVSHRSGETEDVTISHIAVATNAGQIKTGSLSRTDRIAKYNELLRIEEELGKS
AVYKGRETFYNL
I need to find and group about 100k entries into groups
if the HTML and fasta file " sp| "entry match than group into following groups in which * sign is having. If multiple * are there, then group it in a new multi_star group and write all the group into different files.
Extracellular
OuterMembrane
Periplasmic
InnerMembrane
Cytoplasmic
Multi_star
Any form of help would be appreciated
, as am new to python doesn't know much
.
Thank You

********************************************************************************
SeqID: sp|Q72W11|ENGB_LEPIC
CELLO prediction:
(predictor location reliable-index)
Composition Cytoplasmic 0.762
Di-peptide Cytoplasmic 0.744
part-Comp. Cytoplasmic 0.813
chemo-typy Periplasmic 0.499
Neighboring Cytoplasmic 0.539
Combined SVM classifier:
Extracellular 0.357
OuterMembrane 0.426
Periplasmic 0.766
InnerMembrane 0.141
Cytoplasmic 3.310 *
*********************************************************************************
SeqID: sp|Q72RN8|F16PA_LEPIC
CELLO prediction:
(predictor location reliable-index)
Composition OuterMembrane 0.644
Di-peptide Cytoplasmic 0.612
part-Comp. Extracellular 0.438
chemo-typy Periplasmic 0.365
Neighboring Extracellular 0.531
Combined SVM classifier:
Extracellular 1.759 *
OuterMembrane 1.391 *
Periplasmic 0.738
InnerMembrane 0.045
Cytoplasmic 1.067 *
Other fasta file format like containing 100k entries
>sp|Q72Q29|HPPA_LEPIC
MNSVTIIIAMSILAIVTAVVYTLKVTSIKVGTLGGNEKETKKLLEISSAISEGAMAFLVR
EYKVISLFIAFMAVLIVLLLDNPGSEGFNDGIYTAIAFVSGALISCISGFIGMKIATAGN
VRTAEAAKSSMAKAFRVAFDSGAVMGFGLVGLAILGMIVLFLVFTGMYPGVEKHFLMESL
AGFGLGGSAVALFGRVGGGIYTKAADVGADLVGKVEKGIPEDDPRNPATIADNVGDNVGD
VAGMGADLFGSCAEATCAALVIGATASALSGSVDALLYPLLISAFGIPASILTSFLARVK
EDGNVESALKVQLWVSTLLVAGIMYFVTKTFMVDSFEIAGKTITKWDVYISMVVGLFSGM
FIGIVTEYYTSHSYKPVREVAEASNTGAATNIIYGLSLGYHSSVIPVILLVITIVTANLL
AGMYGIAIAALGMISTIAIGLTIDAYGPVSDNAGGIAEMAELGKEVRDRTDTLDAAGNTT
AAIGKGFAIGSAALTSLALFAAFITRTHTTSLEVLNAEVFGGLMFGAMLPFLFTAMTMKS
VGKAAVDMVEEVRKQFKEIPGIMEGKNKPDYKRCVDISTSAALREMILPGLLVLLTPILV
GYLFGVKTLAGVLAGALVAGVVLAISAANSGGGWDNAKKYIEKKAGGKGSDQHKAAVVGD
TVGDPFKDTSGPSINILIKLMAITSLVFAEFFVQQGGLIFKIFH
>sp|Q72QZ8|ENO_LEPIC
MSHHSQIQKIQAREIMDSRGNPTVEVDVILLDGSFGRAAVPSGASTGEYEAVELRDGDKH
RYLGKGVLKAVEHVNLKIQEVLKGENAIDQNRIDQLMLDADGTKNKGKLGANAILGTSLA
VAKAAAAHSKLPLYRYIGGNFARELPVPMMNIINGGAHADNNVDFQEFMILPVGAKSFRE
ALRMGAEIFHSLKSVLKGKKLNTAVGDEGGFAPDLTSNVEAIEVILQAIEKAGYKPEKDV
LLGLDAASSEFYDKSKKKYVLGAENNKEFSSAELVDYYANLVSKYPIITIEDGLDENDWD
GWKLLSEKLGKKIQLVGDDLFVTNIEKLSKGISSGVGNSILIKVNQIGSLSETLSSIEMA
KKAKYTNVVSHRSGETEDVTISHIAVATNAGQIKTGSLSRTDRIAKYNELLRIEEELGKS
AVYKGRETFYNL
I need to find and group about 100k entries into groups

Extracellular
OuterMembrane
Periplasmic
InnerMembrane
Cytoplasmic
Multi_star
Any form of help would be appreciated


Thank You
