Apr-28-2018, 04:20 AM
So Parsing a large text file with 56,900 book titles with authors and a etext no. Trying to find the authors. By parsing the file. The file is a like this:
Author name usually starts after "by" or when there is no "by" in line then author name starts after a comma ","...However the "," can be a part of the title if the line has a by.
So, I parsed it for by first then for comma.
Here is what I tried:
Quote:TITLE and AUTHOR ETEXT NO.
Aspects of plant life; with special reference to the British flora, 56900
by Robert Lloyd Praeger
The Vicar of Morwenstow, by Sabine Baring-Gould 56899
[Subtitle: Being a Life of Robert Stephen Hawker, M.A.]
Raamatun tutkisteluja IV, mennessä Charles T. Russell 56898
[Subtitle: Harmagedonin taistelu]
[Language: Finnish]
Raamatun tutkisteluja III, mennessä Charles T. Russell 56897
[Subtitle: Tulkoon valtakuntasi]
[Language: Finnish]
Tom Thatcher's Fortune, by Horatio Alger, Jr. 56896
A Yankee Flier in the Far East, by Al Avery 56895
and George Rutherford Montgomery
[Illustrator: Paul Laune]
Nancy Brandon's Mystery, by Lillian Garis 56894
Nervous Ills, by Boris Sidis 56893
[Subtitle: Their Cause and Cure]
Pensées sans langage, par Francis Picabia 56892
[Language: French]
Helon's Pilgrimage to Jerusalem, Volume 2 of 2, by Frederick Strauss 56891
[Subtitle: A picture of Judaism, in the century
which preceded the advent of our Savior]
Fra Tommaso Campanella, Vol. 1, di Luigi Amabile 56890
[Subtitle: la sua congiura, i suoi processi e la sua pazzia]
[Language: Italian]
The Blue Star, by Fletcher Pratt 56889
Importanza e risultati degli incrociamenti in avicoltura, 56888
di Teodoro Pascal
[Language: Italian]
The Junior Classics, Volume 3: Tales from Greece and Rome, by Various 56887
~ ~ ~ ~ Posting Dates for the below eBooks: 1 Mar 2018 to 31 Mar 2018 ~ ~ ~ ~
TITLE and AUTHOR ETEXT NO.
The American Missionary, Volume 41, No. 1, January, 1887, by Various 56886
Morganin miljoonat, mennessä Sven Elvestad 56885
[Author a.k.a. Stein Riverton]
[Subtitle: Salapoliisiromaani]
[Language: Finnish]
"Trip to the Sunny South" in March, 1885, by L. S. D 56884
Balaam and His Master, by Joel Chandler Harris 56883
[Subtitle: and Other Sketches and Stories]
Susien saaliina, mennessä Jack London 56882
[Language: Finnish]
Forged Egyptian Antiquities, by T. G. Wakeling 56881
The Secret Doctrine, Vol. 3 of 4, by Helena Petrovna Blavatsky 56880
[Subtitle: Third Edition]
No Posting 56879
Author name usually starts after "by" or when there is no "by" in line then author name starts after a comma ","...However the "," can be a part of the title if the line has a by.
So, I parsed it for by first then for comma.
Here is what I tried:
def search_by_author(): fhand = open('GUTINDEX.ALL') print("Search by Author:") for line in fhand: if not line.startswith(" [") and not line.startswith("TITLE"): if not line.startswith("~"): words = line.rstrip() words = line.lstrip() words = words[:-6] if ", by" in words: words = words[words.find(', by'):] words = words[5:] print (words) else: words = words[words.find(', '):] words = words[5:] if "," in words: words = words[words.find(', '):] if words.startswith(','): words =words[words.find(','):] print (words) else: print (words) else: print (words) if " by" in words: words = words[words.find('by')] print(words) search_by_author()However it can't seem to find the author name for lines like:
Quote:Aspects of plant life; with special reference to the British flora, 56900
by Robert Lloyd Praeger