Python Forum

I have about two files one html like below with 100k entries Doh

********************************************************************************
SeqID: sp|Q72W11|ENGB_LEPIC
CELLO prediction:
(predictor location reliable-index)
Composition Cytoplasmic 0.762
Di-peptide Cytoplasmic 0.744
part-Comp. Cytoplasmic 0.813
chemo-typy Periplasmic 0.499
Neighboring Cytoplasmic 0.539

Combined SVM classifier:
Extracellular 0.357
OuterMembrane 0.426
Periplasmic 0.766
InnerMembrane 0.141
Cytoplasmic 3.310 *

*********************************************************************************
SeqID: sp|Q72RN8|F16PA_LEPIC
CELLO prediction:
(predictor location reliable-index)
Composition OuterMembrane 0.644
Di-peptide Cytoplasmic 0.612
part-Comp. Extracellular 0.438
chemo-typy Periplasmic 0.365
Neighboring Extracellular 0.531

Combined SVM classifier:
Extracellular 1.759 *
OuterMembrane 1.391 *
Periplasmic 0.738
InnerMembrane 0.045
Cytoplasmic 1.067 *

Other fasta file format like containing 100k entries
>sp|Q72Q29|HPPA_LEPIC
MNSVTIIIAMSILAIVTAVVYTLKVTSIKVGTLGGNEKETKKLLEISSAISEGAMAFLVR
EYKVISLFIAFMAVLIVLLLDNPGSEGFNDGIYTAIAFVSGALISCISGFIGMKIATAGN
VRTAEAAKSSMAKAFRVAFDSGAVMGFGLVGLAILGMIVLFLVFTGMYPGVEKHFLMESL
AGFGLGGSAVALFGRVGGGIYTKAADVGADLVGKVEKGIPEDDPRNPATIADNVGDNVGD
VAGMGADLFGSCAEATCAALVIGATASALSGSVDALLYPLLISAFGIPASILTSFLARVK
EDGNVESALKVQLWVSTLLVAGIMYFVTKTFMVDSFEIAGKTITKWDVYISMVVGLFSGM
FIGIVTEYYTSHSYKPVREVAEASNTGAATNIIYGLSLGYHSSVIPVILLVITIVTANLL
AGMYGIAIAALGMISTIAIGLTIDAYGPVSDNAGGIAEMAELGKEVRDRTDTLDAAGNTT
AAIGKGFAIGSAALTSLALFAAFITRTHTTSLEVLNAEVFGGLMFGAMLPFLFTAMTMKS
VGKAAVDMVEEVRKQFKEIPGIMEGKNKPDYKRCVDISTSAALREMILPGLLVLLTPILV
GYLFGVKTLAGVLAGALVAGVVLAISAANSGGGWDNAKKYIEKKAGGKGSDQHKAAVVGD
TVGDPFKDTSGPSINILIKLMAITSLVFAEFFVQQGGLIFKIFH
>sp|Q72QZ8|ENO_LEPIC
MSHHSQIQKIQAREIMDSRGNPTVEVDVILLDGSFGRAAVPSGASTGEYEAVELRDGDKH
RYLGKGVLKAVEHVNLKIQEVLKGENAIDQNRIDQLMLDADGTKNKGKLGANAILGTSLA
VAKAAAAHSKLPLYRYIGGNFARELPVPMMNIINGGAHADNNVDFQEFMILPVGAKSFRE
ALRMGAEIFHSLKSVLKGKKLNTAVGDEGGFAPDLTSNVEAIEVILQAIEKAGYKPEKDV
LLGLDAASSEFYDKSKKKYVLGAENNKEFSSAELVDYYANLVSKYPIITIEDGLDENDWD
GWKLLSEKLGKKIQLVGDDLFVTNIEKLSKGISSGVGNSILIKVNQIGSLSETLSSIEMA
KKAKYTNVVSHRSGETEDVTISHIAVATNAGQIKTGSLSRTDRIAKYNELLRIEEELGKS
AVYKGRETFYNL

I need to find and group about 100k entries into groups Wall

if the HTML and fasta file " sp| "entry match than group into following groups in which * sign is having. If multiple * are there, then group it in a new multi_star group and write all the group into different files.
Extracellular
OuterMembrane
Periplasmic
InnerMembrane
Cytoplasmic
Multi_star

Any form of help would be appreciated Idea

, as am new to python doesn't know much Sad

.
Thank You Pray

What have you written so far?

(Jul-07-2017, 04:01 AM)Larz60+ Wrote: [ -> ]What have you written so far?

As I mentioned m new to python programming, I don't have an idea where and how to start Huh

any tutorials or materials how to do so would be helpful.

THank you

Quote:one HTML like below with 100k entries

This data is not HTML as presented, perhaps browser output of HTML?
If it is the result of an HTML file being displayed in a browser, you need to supply the actual HTML file,
or supply the URL of one that can be used for testing.

If it is indeed an HTML file, you're going to need to learn about parsing HTML with BeautifulSoup
The best way to do this is use the web scraping tutorials written by snippsat
part 1 here
part 2 here

Even if this is an internal html file, the BeautifulSoup parts of these tutorials apply.

As far as reading the file, There are fasta files available on line,
you can try a simple read on it like:

filename = input('Enter file name: ')
count = 0
maxcount = 20
with open filename as f:
   for line in f.readlines():
       print(line)
   count += 1
   if count > maxcount:
       break

my guess is that these are newline separated text records.
the first being a sequence id ('>seq0')
and any additional the sequence itself (if this is multi-line, it will show up in above snippet)
this will simply read the file line by line, and quit after maxcount lines,
and be used as the basis for other operations.

It's all doable with relative ease, just need all the ingredients up front.

(Jul-07-2017, 10:33 AM)Larz60+ Wrote: [ -> ]
Quote:one HTML like below with 100k entries
This data is not HTML as presented, perhaps browser output of HTML?
If it is the result of an HTML file being displayed in a browser, you need to supply the actual HTML file,
or supply the URL of one that can be used for testing.

If it is indeed an HTML file, you're going to need to learn about parsing HTML with BeautifulSoup
The best way to do this is use the web scraping tutorials written by snippsat
part 1 here
part 2 here

Even if this is an internal html file, the BeautifulSoup parts of these tutorials apply.

As far as reading the file, There are fasta files available on line,
you can try a simple read on it like:
filename = input('Enter file name: ')
count = 0
maxcount = 20
with open filename as f:
   for line in f.readlines():
       print(line)
   count += 1
   if count > maxcount:
       break
my guess is that these are newline separated text records.
the first being a sequence id ('>seq0')
and any additional the sequence itself (if this is multi-line, it will show up in above snippet)
this will simply read the file line by line, and quit after maxcount lines,
and be used as the basis for other operations.

It's all doable with relative ease, just need all the ingredients up front.

Learn Parsing HTML with BeautifulSoup, Ok got it. Thanks for your time.

Takshan

Larz60+

Takshan

Larz60+

Takshan