Python Forum
Python file parsing, can't catch strings in new line
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Python file parsing, can't catch strings in new line
#1
So Parsing a large text file with 56,900 book titles with authors and a etext no. Trying to find the authors. By parsing the file. The file is a like this:
Quote:TITLE and AUTHOR ETEXT NO.

Aspects of plant life; with special reference to the British flora, 56900
by Robert Lloyd Praeger

The Vicar of Morwenstow, by Sabine Baring-Gould 56899
[Subtitle: Being a Life of Robert Stephen Hawker, M.A.]

Raamatun tutkisteluja IV, mennessä Charles T. Russell 56898
[Subtitle: Harmagedonin taistelu]
[Language: Finnish]

Raamatun tutkisteluja III, mennessä Charles T. Russell 56897
[Subtitle: Tulkoon valtakuntasi]
[Language: Finnish]

Tom Thatcher's Fortune, by Horatio Alger, Jr. 56896

A Yankee Flier in the Far East, by Al Avery 56895
and George Rutherford Montgomery
[Illustrator: Paul Laune]

Nancy Brandon's Mystery, by Lillian Garis 56894

Nervous Ills, by Boris Sidis 56893
[Subtitle: Their Cause and Cure]

Pensées sans langage, par Francis Picabia 56892
[Language: French]

Helon's Pilgrimage to Jerusalem, Volume 2 of 2, by Frederick Strauss 56891
[Subtitle: A picture of Judaism, in the century
which preceded the advent of our Savior]

Fra Tommaso Campanella, Vol. 1, di Luigi Amabile 56890
[Subtitle: la sua congiura, i suoi processi e la sua pazzia]
[Language: Italian]

The Blue Star, by Fletcher Pratt 56889

Importanza e risultati degli incrociamenti in avicoltura, 56888
di Teodoro Pascal
[Language: Italian]

The Junior Classics, Volume 3: Tales from Greece and Rome, by Various 56887


~ ~ ~ ~ Posting Dates for the below eBooks: 1 Mar 2018 to 31 Mar 2018 ~ ~ ~ ~

TITLE and AUTHOR ETEXT NO.

The American Missionary, Volume 41, No. 1, January, 1887, by Various 56886

Morganin miljoonat, mennessä Sven Elvestad 56885
[Author a.k.a. Stein Riverton]
[Subtitle: Salapoliisiromaani]
[Language: Finnish]

"Trip to the Sunny South" in March, 1885, by L. S. D 56884

Balaam and His Master, by Joel Chandler Harris 56883
[Subtitle: and Other Sketches and Stories]

Susien saaliina, mennessä Jack London 56882
[Language: Finnish]

Forged Egyptian Antiquities, by T. G. Wakeling 56881

The Secret Doctrine, Vol. 3 of 4, by Helena Petrovna Blavatsky 56880
[Subtitle: Third Edition]

No Posting 56879

Author name usually starts after "by" or when there is no "by" in line then author name starts after a comma ","...However the "," can be a part of the title if the line has a by.
So, I parsed it for by first then for comma.

Here is what I tried:
def search_by_author():

    fhand = open('GUTINDEX.ALL')
    print("Search by Author:")

    for line in fhand:
        if not line.startswith(" [") and not line.startswith("TITLE"):
            if not line.startswith("~"):
                words = line.rstrip()
                words = line.lstrip()
                words = words[:-6] 
                if ", by" in words:

                    words = words[words.find(', by'):]
                    words = words[5:]
                    print (words)

                else:
                    words = words[words.find(', '):]
                    words = words[5:]
                    if "," in words:
                        words = words[words.find(', '):]
                        if words.startswith(','):
                            words =words[words.find(','):]
                            print (words)
                        else:
                            print (words)
                    else:
                        print (words)
                if " by" in words:
                    words = words[words.find('by')]
                    print(words)

search_by_author()
However it can't seem to find the author name for lines like:
Quote:Aspects of plant life; with special reference to the British flora, 56900
by Robert Lloyd Praeger
Reply
#2
if ", by" in words or words.startswtih("by"): 
will find that one exeception but that gives you the author only and so you will have to back up one record if you want the title. Note that you lose your rstrip because on the next line your lstrip overwrites the variable returned by rstrip. Just use strip() instead of the 2 statements.
Reply
#3
Well I solved the code by this:

import re

def search_by_author():

    fhand = open('GUTINDEX.ALL')
    book_info = ''
    for line in fhand:
        line = line.rstrip()

        if (line.startswith('TITLE') or line.startswith('~')):
            continue
        
        if (len(line) == 0):
            # remove info in square bracket from book_info
            book_info = re.sub(r'\[.*$', '', book_info)

            if ('by ' in book_info):
                tokens = book_info.split('by ')
            else:
                tokens = book_info.split(',')

            if (len(tokens) > 1):
                authors = tokens[-1].strip()
                
                print (line)
                book_info = ''

        else:
            # remove ETEXT NO. from line
            line = re.sub(r'\d+$', '', line)
            book_info +=  ' ' + line.rstrip()
            

search_by_author()
But now I want to search for authors.....Like take an input from user...and if that input matches author name, it should view the whole record.

like
Quote:Enter Author name:
Praeger
Output:
Aspects of plant life; with special reference to the British flora, 56900
by Robert Lloyd Praeger
Here's what I tried:

import re

def search_by_author():

    fhand = open('GUTINDEX.ALL')
    book_info = ''
    for line in fhand:
        line = line.rstrip()

        if (line.startswith('TITLE') or line.startswith('~')):
            continue
        
        if (len(line) == 0):
            # remove info in square bracket from book_info
            book_info = re.sub(r'\[.*$', '', book_info)

            if ('by ' in book_info):
                tokens = book_info.split('by ')
            else:
                tokens = book_info.split(',')

            if (len(tokens) > 1):
                authors = tokens[-1].strip()
                x = input("Author Name:")
                if x in authors():
                    print (line)
                book_info = ''

        else:
            # remove ETEXT NO. from line
            line = re.sub(r'\d+$', '', line)
            book_info +=  ' ' + line.rstrip()
            

search_by_author()
The change:
if (len(tokens) > 1):
                authors = tokens[-1].strip()
                x = input("Author Name:")
                if x in authors():
                    print (line)
                book_info = ''
But it doesn't work...
Reply
#4
This isn't 100% there, but almost I think you can finish:
class StringParse:
    def __init__(self, filename):
        self.filename = filename
        self.parse_books()

    def parse_books(self):
        skip = 0
        titlelineno = 0
        with open(self.filename, 'r') as f:
            for line in f:
                if skip < 2:
                    skip += 1
                    continue
                line = line.strip()
                if len(line) == 0:
                    continue
                print(f'line: {line}')
                if line.startswith('['):
                    titlelineno = 0
                    line = line.split(':')
                    for item in line:
                        print('{}'.format(item), end='')
                    print()
                else:
                    if titlelineno == 0:
                        print(line)
                        idx = line.rindex(' ')
                        if idx:
                            etext_no = line[idx+1:]
                            line = line.split(',')
                            title = line[0]
                            print('title: {}'.format(title))
                            print('Etext No: {}'.format(etext_no))
                            if 'by' in line:
                                idx = line.index('by')
                                author = line[idx + 1]
                        titlelineno += 1
                    else:
                        if 'by' in line:
                            idx = line.index('by')
                            author = line[idx+3:]
                            print('author: {}'.format(author))


if __name__ == '__main__':
    StringParse('GUTINDEX.ALL')
results (partial):
Output:
line: Aspects of plant life; with special reference to the British flora, 56900 Aspects of plant life; with special reference to the British flora, 56900 title: Aspects of plant life; with special reference to the British flora Etext No: 56900 line: by Robert Lloyd Praeger author: Robert Lloyd Praeger line: The Vicar of Morwenstow, by Sabine Baring-Gould 56899 author: Sabine Baring-Gould 56899 line: [Subtitle: Being a Life of Robert Stephen Hawker, M.A.] [Subtitle Being a Life of Robert Stephen Hawker, M.A.] line: Raamatun tutkisteluja IV, mennessä Charles T. Russell 56898 Raamatun tutkisteluja IV, mennessä Charles T. Russell 56898 title: Raamatun tutkisteluja IV Etext No: 56898 line: [Subtitle: Harmagedonin taistelu] [Subtitle Harmagedonin taistelu] line: [Language: Finnish] [Language Finnish] line: Raamatun tutkisteluja III, mennessä Charles T. Russell 56897 Raamatun tutkisteluja III, mennessä Charles T. Russell 56897 title: Raamatun tutkisteluja III Etext No: 56897 line: [Subtitle: Tulkoon valtakuntasi] [Subtitle Tulkoon valtakuntasi] line: [Language: Finnish] [Language Finnish] line: Tom Thatcher's Fortune, by Horatio Alger, Jr. 56896 Tom Thatcher's Fortune, by Horatio Alger, Jr. 56896 title: Tom Thatcher's Fortune Etext No: 56896 line: A Yankee Flier in the Far East, by Al Avery 56895 author: Al Avery 56895 line: and George Rutherford Montgomery line: [Illustrator: Paul Laune] [Illustrator Paul Laune] line: Nancy Brandon's Mystery, by Lillian Garis 56894 Nancy Brandon's Mystery, by Lillian Garis 56894 title: Nancy Brandon's Mystery Etext No: 56894 line: Nervous Ills, by Boris Sidis 56893 author: Boris Sidis 56893 line: [Subtitle: Their Cause and Cure] [Subtitle Their Cause and Cure] line: Pensées sans langage, par Francis Picabia 56892 Pensées sans langage, par Francis Picabia 56892 title: Pensées sans langage Etext No: 56892 line: [Language: French] [Language French] line: Helon's Pilgrimage to Jerusalem, Volume 2 of 2, by Frederick Strauss 56891 Helon's Pilgrimage to Jerusalem, Volume 2 of 2, by Frederick Strauss 56891 title: Helon's Pilgrimage to Jerusalem Etext No: 56891 line: [Subtitle: A picture of Judaism, in the century [Subtitle A picture of Judaism, in the century line: which preceded the advent of our Savior] which preceded the advent of our Savior] title: which preceded the advent of our Savior]
Reply
#5
Okay I tried to search for a string in a file and view the whole record for it.

This is what I did:
import re

def search():

    fhand = open('GUTINDEX.ALL')
    x = input("Search:")
    for line in fhand:
        words = line.strip()
        words= re.sub(r"\n"," ",words)
        
        if x in words:
            print (words)
            

search()
However when I search for the string 'flora' it gives me
Aspects of plant life; with special reference to the British flora,      56900
But it should give
Aspects of plant life; with special reference to the British flora,      56900
 by Robert Lloyd Praeger

Okay there was a mistake posting inputs from file in the opening post. A mistake in blank sapce. The file is like this:
TITLE and AUTHOR                                                     ETEXT NO.

Aspects of plant life; with special reference to the British flora,      56900
 by Robert Lloyd Praeger

The Vicar of Morwenstow, by Sabine Baring-Gould                          56899
 [Subtitle: Being a Life of Robert Stephen Hawker, M.A.]

Raamatun tutkisteluja IV, mennessä Charles T. Russell                    56898
 [Subtitle: Harmagedonin taistelu]
 [Language: Finnish]

Raamatun tutkisteluja III, mennessä Charles T. Russell                   56897
 [Subtitle: Tulkoon valtakuntasi]
 [Language: Finnish]

Tom Thatcher's Fortune, by Horatio Alger, Jr.                            56896

A Yankee Flier in the Far East, by Al Avery                              56895
 and George Rutherford Montgomery
 [Illustrator: Paul Laune]

Nancy Brandon's Mystery, by Lillian Garis                                56894

Nervous Ills, by Boris Sidis                                             56893
 [Subtitle: Their Cause and Cure]

@Larz60+ @woooee
Reply
#6
Quote:However when I search for the string 'flora' it gives me
Aspects of plant life; with special reference to the British flora, 56900
But it should give
Output:
Aspects of plant life; with special reference to the British flora, 56900 by Robert Lloyd Praeger
Have to match line before.
>>> import re
>>> 
>>> search_word = 'Praeger'
>>> r = re.search(fr'.*\n.*(\b{search_word}\b)', books)
>>> print(r.group())
Aspects of plant life; with special reference to the British flora,      56900
 by Robert Lloyd Praeger

>>> print(r.group(1))
Praeger
Should work with most search,as there is blank line that can be cleaned up if search_word is in first line.
>>> import re
>>> 
>>> search_word = 'Frederick Strauss'
>>> r = re.search(fr'.*\n.*(\b{search_word}\b)', books)
>>> print(r.group().lstrip())
Helon's Pilgrimage to Jerusalem, Volume 2 of 2, by Frederick Strauss
>>> print(r.group(1))
Frederick Strauss
>>> 
>>> search_word = 'Wakeling'
>>> r = re.search(fr'.*\n.*(\b{search_word}\b)', books)
>>> print(r.group().lstrip())
Forged Egyptian Antiquities, by T. G. Wakeling
>>> 
>>> search_word = 'Praeger'
>>> r = re.search(fr'.*\n.*(\b{search_word}\b)', books)
>>> print(r.group().lstrip())
Aspects of plant life; with special reference to the British flora,      56900
 by Robert Lloyd Praeger
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Trying to understand strings and lists of strings Konstantin23 2 761 Aug-06-2023, 11:42 AM
Last Post: deanhystad
Video doing data treatment on a file import-parsing a variable EmBeck87 15 2,873 Apr-17-2023, 06:54 PM
Last Post: EmBeck87
  File "<string>", line 19, in <module> error is related to what? Frankduc 9 12,573 Mar-09-2023, 07:22 AM
Last Post: LocklearSusan
  Getting last line of each line occurrence in a file tester_V 1 868 Jan-31-2023, 09:29 PM
Last Post: deanhystad
  try catch not working? korenron 2 845 Jan-15-2023, 01:54 PM
Last Post: korenron
  Writing string to file results in one character per line RB76SFJPsJJDu3bMnwYM 4 1,377 Sep-27-2022, 01:38 PM
Last Post: buran
  Multiprocessing queue catch get timeout Pythocodras 1 2,322 Apr-22-2022, 06:01 PM
Last Post: Pythocodras
  Modify values in XML file by data from text file (without parsing) Paqqno 2 1,673 Apr-13-2022, 06:02 AM
Last Post: Paqqno
  Parsing xml file deletes whitespaces. How to avoid it? Paqqno 0 1,039 Apr-01-2022, 10:20 PM
Last Post: Paqqno
  Print to a New Line when Appending File DaveG 0 1,222 Mar-30-2022, 04:14 AM
Last Post: DaveG

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020