Python Forum
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Python Challenge ~ Help
#1
I have about two files one html like below with 100k entries Doh
********************************************************************************
SeqID: sp|Q72W11|ENGB_LEPIC
CELLO prediction:
(predictor location   reliable-index)
Composition Cytoplasmic 0.762
Di-peptide Cytoplasmic 0.744
part-Comp. Cytoplasmic 0.813
chemo-typy Periplasmic 0.499
Neighboring Cytoplasmic 0.539

Combined SVM classifier:
Extracellular 0.357
OuterMembrane 0.426
Periplasmic 0.766
InnerMembrane 0.141
Cytoplasmic 3.310 *


*********************************************************************************
SeqID: sp|Q72RN8|F16PA_LEPIC
CELLO prediction:
(predictor location   reliable-index)
Composition OuterMembrane 0.644
Di-peptide Cytoplasmic 0.612
part-Comp. Extracellular 0.438
chemo-typy Periplasmic 0.365
Neighboring Extracellular 0.531

Combined SVM classifier:
Extracellular 1.759 *
OuterMembrane 1.391 *
Periplasmic 0.738
InnerMembrane 0.045
Cytoplasmic 1.067 *


Other fasta file format like containing 100k entries
>sp|Q72Q29|HPPA_LEPIC
MNSVTIIIAMSILAIVTAVVYTLKVTSIKVGTLGGNEKETKKLLEISSAISEGAMAFLVR
EYKVISLFIAFMAVLIVLLLDNPGSEGFNDGIYTAIAFVSGALISCISGFIGMKIATAGN
VRTAEAAKSSMAKAFRVAFDSGAVMGFGLVGLAILGMIVLFLVFTGMYPGVEKHFLMESL
AGFGLGGSAVALFGRVGGGIYTKAADVGADLVGKVEKGIPEDDPRNPATIADNVGDNVGD
VAGMGADLFGSCAEATCAALVIGATASALSGSVDALLYPLLISAFGIPASILTSFLARVK
EDGNVESALKVQLWVSTLLVAGIMYFVTKTFMVDSFEIAGKTITKWDVYISMVVGLFSGM
FIGIVTEYYTSHSYKPVREVAEASNTGAATNIIYGLSLGYHSSVIPVILLVITIVTANLL
AGMYGIAIAALGMISTIAIGLTIDAYGPVSDNAGGIAEMAELGKEVRDRTDTLDAAGNTT
AAIGKGFAIGSAALTSLALFAAFITRTHTTSLEVLNAEVFGGLMFGAMLPFLFTAMTMKS
VGKAAVDMVEEVRKQFKEIPGIMEGKNKPDYKRCVDISTSAALREMILPGLLVLLTPILV
GYLFGVKTLAGVLAGALVAGVVLAISAANSGGGWDNAKKYIEKKAGGKGSDQHKAAVVGD
TVGDPFKDTSGPSINILIKLMAITSLVFAEFFVQQGGLIFKIFH
>sp|Q72QZ8|ENO_LEPIC
MSHHSQIQKIQAREIMDSRGNPTVEVDVILLDGSFGRAAVPSGASTGEYEAVELRDGDKH
RYLGKGVLKAVEHVNLKIQEVLKGENAIDQNRIDQLMLDADGTKNKGKLGANAILGTSLA
VAKAAAAHSKLPLYRYIGGNFARELPVPMMNIINGGAHADNNVDFQEFMILPVGAKSFRE
ALRMGAEIFHSLKSVLKGKKLNTAVGDEGGFAPDLTSNVEAIEVILQAIEKAGYKPEKDV
LLGLDAASSEFYDKSKKKYVLGAENNKEFSSAELVDYYANLVSKYPIITIEDGLDENDWD
GWKLLSEKLGKKIQLVGDDLFVTNIEKLSKGISSGVGNSILIKVNQIGSLSETLSSIEMA
KKAKYTNVVSHRSGETEDVTISHIAVATNAGQIKTGSLSRTDRIAKYNELLRIEEELGKS
AVYKGRETFYNL


I need to find and group about 100k entries into groups Wall  if the HTML and fasta file " sp| "entry match than group into following groups in which * sign is having. If multiple * are there, then group it in a new multi_star group and write all the group into different files.
                                       Extracellular
OuterMembrane
Periplasmic
InnerMembrane
Cytoplasmic
                                       Multi_star

Any form of help would be appreciated  Idea , as am new to python doesn't know much Sad .
Thank You Pray
Reply


Messages In This Thread
Python Challenge ~ Help - by Takshan - Jul-07-2017, 03:32 AM
RE: Python Challenge ~ Help - by Larz60+ - Jul-07-2017, 04:01 AM
RE: Python Challenge ~ Help - by Takshan - Jul-07-2017, 04:09 AM
RE: Python Challenge ~ Help - by Larz60+ - Jul-07-2017, 10:33 AM
RE: Python Challenge ~ Help - by Takshan - Jul-07-2017, 11:01 AM

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020