"EOL While Scanning String Literal"

**Larz60+** · May-03-2018, 10:21 PM

You need to learn how to do things like this yourself.

examine the method: parse_and_save
This function is called from __init__
with the command:

self.parse_and_save(getpdfs=True)

after all of the html files (based on the contents of apis.txt) have been downloaded to the xx_completions_xx directory

Here's a breakdown of the code:

    def parse_and_save(self, getpdfs=False):
        filelist = [file for file in self.completionspath.iterdir() if file.is_file()]
        for file in filelist:
            with file.open('r') as f:
                soup = BeautifulSoup(f.read(), 'lxml')
            if getpdfs:
                links = soup.find_all('a')
                for link in links:
                    url = link['href']
                    if 'www' in url:
                        continue
                    print('downloading pdf at: {}'.format(url))
                    p = url.index('=')
                    filename = self.log_pdfpath / 'comp{}.pdf'.format(url[p+1:])
                    response = requests.get(url, stream=True, allow_redirects=False)
                    if response.status_code == 200:
                        with filename.open('wb') as f:
                            f.write(response.content)
            sfname = self.textpath / 'summary_{}.txt'.format((file.name.split('_'))[1].split('.')[0][3:10])
            tds = soup.find_all('td')
            with sfname.open('w') as f:
                for td in tds:
                    if td.text:
                        if any(field in td.text for field in self.fields):
                            f.write('{}\n'.format(td.text))

        with self.infile.open('w') as f:
            for item in apis:
                f.write(f'{item}\n')

line 1 called from __init__ with getpdfs=True
line 2 This is a list comprehension which iterates over a directory using pathlib and creates a list containing all of the
files (not sub-directories) in the target directory (self.completionspath) which is declared as 'xx_completions_xx'
directory in __init__
line 3 Will iterate over each file in filelist
lines 4 and 5: read in file just created, and runs each through BeautifulSoup using the lxml parser, creating 'soup'
this portion gets all 'a' html tags (which are links to web pages) from the 'soup' just created, only if getpdfs is True (which it is)
lines 3 and 4: for each html link find and extract the 'href' value (the actual URL) and save in variable url
lines 6 - 18 will execute if getpdfs is True which is the case fro this example
line 7: finds all 'a' tags and saves in list 'links', each entry is an html segment
line 8: extracts the href value from link and saves in 'url'
lines 10 and 11: check url for 'www', which will be true for the link: 'http://www.state.wy.us', which is not something needed to find pdf file, so continues with next link. Next link is: 'http://wugiwus.state.wy.us/whatupcomps.cfm?nautonum=17696' this link contains the pdf file.
The unfortunate thing here is that the url doesn't contain the actulal pdf name, it is accessed by an id number, which is what I used to assign the name: (lines 13 and 14)
The good thing is that response has the information needed to name the pdf file appropriately in response.headers['Content-Disposition'] (code to take advantage of this will be added later) This entry looks like: response.headers['Content-Disposition'] = 'attachment; filename=906576001.pdf'
lines 12 - 14 explained above
line 15 fetches the pdf file
line 16 assures that the status_code is equal to 200, which is successful
lines 17 and 18 write the pdf file
lines 19 - 25 write the summary file

I have adjusted my original code to extract and use stored file names. Please note that there are a few files that didn't provide header information. These files are noted in the display (while downloading pdf files),
and their corresponding info is displayed as well, these will will be saved, always beginning with 'comp...'

PLEASE NOTE: Backup your code before replacing with new as I have not included any changes you might have made.

New code:

import requests
from bs4 import BeautifulSoup
from pathlib import Path
import sys

class GetCompletions:
    def __init__(self, infile):
        self.homepath = Path('.')
        self.completionspath = self.homepath / 'xx_completions_xx'
        self.completionspath.mkdir(exist_ok=True)
        self.log_pdfpath = self.homepath / 'logpdfs'
        self.log_pdfpath.mkdir(exist_ok=True)
        self.textpath = self.homepath / 'text'
        self.textpath.mkdir(exist_ok=True)

        self.infile = self.textpath / infile
        self.apis = []

        with self.infile.open() as f:
            for line in f:
                self.apis.append(line.strip())

        self.fields = ['Spud Date', 'Total Depth', 'IP Oil Bbls', 'Reservoir Class', 'Completion Date',
                       'Plug Back', 'IP Gas Mcf', 'TD Formation', 'Formation', 'IP Water Bbls']
        # self.get_all_pages()
        self.parse_and_save(getpdfs=True)

    def get_url(self):
        for entry in self.apis:
            yield (entry, "http://wogcc.state.wy.us/wyocomp.cfm?nAPI={}".format(entry[3:10]))

    def get_all_pages(self):
        for entry, url in self.get_url():
            print('Fetching main page for entry: {}'.format(entry))
            response = requests.get(url)
            if response.status_code == 200:
                filename = self.completionspath / 'api_{}.html'.format(entry)
                with filename.open('w') as f:
                    f.write(response.text)
            else:
                print('error downloading {}'.format(entry))

    def parse_and_save(self, getpdfs=False):
        filelist = [file for file in self.completionspath.iterdir() if file.is_file()]
        for file in filelist:
            with file.open('r') as f:
                soup = BeautifulSoup(f.read(), 'lxml')
            if getpdfs:
                links = soup.find_all('a')
                for link in links:
                    url = link['href']
                    if 'www' in url:
                        continue
                    print('downloading pdf at: {}'.format(url))
                    p = url.index('=')
                    response = requests.get(url, stream=True, allow_redirects=False)
                    if response.status_code == 200:
                        try:
                            header_info = response.headers['Content-Disposition']
                            idx = header_info.index('filename')
                            filename = self.log_pdfpath / header_info[idx+9:]
                        except ValueError:
                            filename = self.log_pdfpath / 'comp{}.pdf'.format(url[p + 1:])
                            print("couldn't locate filename for {} will use: {}".format(file, filename))
                        except KeyError:
                            filename = self.log_pdfpath / 'comp{}.pdf'.format(url[p + 1:])
                            print('got KeyError on {}, response.headers = {}'.format(file, response.headers))
                            print('will use name: {}'.format(filename))
                            print(response.headers)
                        with filename.open('wb') as f:
                            f.write(response.content)
            sfname = self.textpath / 'summary_{}.txt'.format((file.name.split('_'))[1].split('.')[0][3:10])
            tds = soup.find_all('td')
            with sfname.open('w') as f:
                for td in tds:
                    if td.text:
                        if any(field in td.text for field in self.fields):
                            f.write('{}\n'.format(td.text))

if __name__ == '__main__':
    GetCompletions('apis.txt')

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Literal beginner - needs help	warriordazza	2	1,875	Apr-27-2020, 11:15 AM Last Post: warriordazza

"EOL While Scanning String Literal"

User Panel Messages

Announcements