"EOL While Scanning String Literal"

**Larz60+** · May-03-2018, 09:23 PM

I am preparing a step by step analysis. This will take a while.
I'll post as soon as done.
After this, you should be able to change pdf names.

**Larz60+** · May-03-2018, 10:21 PM

You need to learn how to do things like this yourself.

examine the method: parse_and_save
This function is called from __init__
with the command:

self.parse_and_save(getpdfs=True)

after all of the html files (based on the contents of apis.txt) have been downloaded to the xx_completions_xx directory

Here's a breakdown of the code:

    def parse_and_save(self, getpdfs=False):
        filelist = [file for file in self.completionspath.iterdir() if file.is_file()]
        for file in filelist:
            with file.open('r') as f:
                soup = BeautifulSoup(f.read(), 'lxml')
            if getpdfs:
                links = soup.find_all('a')
                for link in links:
                    url = link['href']
                    if 'www' in url:
                        continue
                    print('downloading pdf at: {}'.format(url))
                    p = url.index('=')
                    filename = self.log_pdfpath / 'comp{}.pdf'.format(url[p+1:])
                    response = requests.get(url, stream=True, allow_redirects=False)
                    if response.status_code == 200:
                        with filename.open('wb') as f:
                            f.write(response.content)
            sfname = self.textpath / 'summary_{}.txt'.format((file.name.split('_'))[1].split('.')[0][3:10])
            tds = soup.find_all('td')
            with sfname.open('w') as f:
                for td in tds:
                    if td.text:
                        if any(field in td.text for field in self.fields):
                            f.write('{}\n'.format(td.text))

        with self.infile.open('w') as f:
            for item in apis:
                f.write(f'{item}\n')

line 1 called from __init__ with getpdfs=True
line 2 This is a list comprehension which iterates over a directory using pathlib and creates a list containing all of the
files (not sub-directories) in the target directory (self.completionspath) which is declared as 'xx_completions_xx'
directory in __init__
line 3 Will iterate over each file in filelist
lines 4 and 5: read in file just created, and runs each through BeautifulSoup using the lxml parser, creating 'soup'
this portion gets all 'a' html tags (which are links to web pages) from the 'soup' just created, only if getpdfs is True (which it is)
lines 3 and 4: for each html link find and extract the 'href' value (the actual URL) and save in variable url
lines 6 - 18 will execute if getpdfs is True which is the case fro this example
line 7: finds all 'a' tags and saves in list 'links', each entry is an html segment
line 8: extracts the href value from link and saves in 'url'
lines 10 and 11: check url for 'www', which will be true for the link: 'http://www.state.wy.us', which is not something needed to find pdf file, so continues with next link. Next link is: 'http://wugiwus.state.wy.us/whatupcomps.cfm?nautonum=17696' this link contains the pdf file.
The unfortunate thing here is that the url doesn't contain the actulal pdf name, it is accessed by an id number, which is what I used to assign the name: (lines 13 and 14)
The good thing is that response has the information needed to name the pdf file appropriately in response.headers['Content-Disposition'] (code to take advantage of this will be added later) This entry looks like: response.headers['Content-Disposition'] = 'attachment; filename=906576001.pdf'
lines 12 - 14 explained above
line 15 fetches the pdf file
line 16 assures that the status_code is equal to 200, which is successful
lines 17 and 18 write the pdf file
lines 19 - 25 write the summary file

I have adjusted my original code to extract and use stored file names. Please note that there are a few files that didn't provide header information. These files are noted in the display (while downloading pdf files),
and their corresponding info is displayed as well, these will will be saved, always beginning with 'comp...'

PLEASE NOTE: Backup your code before replacing with new as I have not included any changes you might have made.

New code:

import requests
from bs4 import BeautifulSoup
from pathlib import Path
import sys

class GetCompletions:
    def __init__(self, infile):
        self.homepath = Path('.')
        self.completionspath = self.homepath / 'xx_completions_xx'
        self.completionspath.mkdir(exist_ok=True)
        self.log_pdfpath = self.homepath / 'logpdfs'
        self.log_pdfpath.mkdir(exist_ok=True)
        self.textpath = self.homepath / 'text'
        self.textpath.mkdir(exist_ok=True)

        self.infile = self.textpath / infile
        self.apis = []

        with self.infile.open() as f:
            for line in f:
                self.apis.append(line.strip())

        self.fields = ['Spud Date', 'Total Depth', 'IP Oil Bbls', 'Reservoir Class', 'Completion Date',
                       'Plug Back', 'IP Gas Mcf', 'TD Formation', 'Formation', 'IP Water Bbls']
        # self.get_all_pages()
        self.parse_and_save(getpdfs=True)

    def get_url(self):
        for entry in self.apis:
            yield (entry, "http://wogcc.state.wy.us/wyocomp.cfm?nAPI={}".format(entry[3:10]))

    def get_all_pages(self):
        for entry, url in self.get_url():
            print('Fetching main page for entry: {}'.format(entry))
            response = requests.get(url)
            if response.status_code == 200:
                filename = self.completionspath / 'api_{}.html'.format(entry)
                with filename.open('w') as f:
                    f.write(response.text)
            else:
                print('error downloading {}'.format(entry))

    def parse_and_save(self, getpdfs=False):
        filelist = [file for file in self.completionspath.iterdir() if file.is_file()]
        for file in filelist:
            with file.open('r') as f:
                soup = BeautifulSoup(f.read(), 'lxml')
            if getpdfs:
                links = soup.find_all('a')
                for link in links:
                    url = link['href']
                    if 'www' in url:
                        continue
                    print('downloading pdf at: {}'.format(url))
                    p = url.index('=')
                    response = requests.get(url, stream=True, allow_redirects=False)
                    if response.status_code == 200:
                        try:
                            header_info = response.headers['Content-Disposition']
                            idx = header_info.index('filename')
                            filename = self.log_pdfpath / header_info[idx+9:]
                        except ValueError:
                            filename = self.log_pdfpath / 'comp{}.pdf'.format(url[p + 1:])
                            print("couldn't locate filename for {} will use: {}".format(file, filename))
                        except KeyError:
                            filename = self.log_pdfpath / 'comp{}.pdf'.format(url[p + 1:])
                            print('got KeyError on {}, response.headers = {}'.format(file, response.headers))
                            print('will use name: {}'.format(filename))
                            print(response.headers)
                        with filename.open('wb') as f:
                            f.write(response.content)
            sfname = self.textpath / 'summary_{}.txt'.format((file.name.split('_'))[1].split('.')[0][3:10])
            tds = soup.find_all('td')
            with sfname.open('w') as f:
                for td in tds:
                    if td.text:
                        if any(field in td.text for field in self.fields):
                            f.write('{}\n'.format(td.text))

if __name__ == '__main__':
    GetCompletions('apis.txt')

tjnichols · (This post was last modified: May-04-2018, 01:39 PM by tjnichols.)

I have been working on learning this and I have totally broken it. My hope is you can show me how it's done so I can learn from it. As I've said, I am working on David Baezeley's teachings but it doesn't cover my exact needs. This is where I want to apply what I'm learning from you. I truly appreciate all you've done and I have not taken ownership of it here at my office. Everyone knows it comes from you and the ways in which I intend to use it.

Here's where I'm stuck. It always pulls the original API's. I understand it's using an apis.txt from the text folder and I have searched and changed all of these files on my hard drive (out of frustration) to reflect the new ones I want to pull. I understand this will pull from the text file in which I put the module and I have tried saving the apis.txt file there as well. It still pulls the original ones we started with.

I know it's got to be something simple but I'm not seeing it. Thoughts?

Again - I really appreciate your help.

Ok - I found the apis.txt it's pulling from. It has the new API numbers but the module is pulling the old ones. I hope this helps.

**Larz60+** · May-04-2018, 06:41 PM

the only file you change is the apis.txt, it must remain in the text directory.
Each time you need a new set of files, change this file to contain just the new ones.
A better solution is to use a command line argument to accept file name, then you can load whatever file you want.
I can't do that now because I have to go out and do some business, but will later today.
Perhaps one of the other moderators will pick up.
If not, I'll be back in a few hours.

tjnichols · May-04-2018, 06:41 PM

Ok - I figured it out. It's getting the API numbers from the files in the xx_completion_xx folder. How do I force it to look at the apis.txt file?

**Larz60+** · (This post was last modified: May-04-2018, 11:34 PM by Larz60+.)

It doesn't need to be forced this is the way the code is developed.
This is what's called data driven software, whatever is in the apis.text file will be downloaded, period, this is the only way it is set up to work!!!

Read the commented code. I explain in detail! the file name is extracted from the header stored in the response from request!
It is extracted and used. If it can't find the proper header information, it will use the name from the apis.text file, but the origin of
the download is still the apis.txt file.

Try the following:

find the apis number (or whatever it's called) for only one that you haven't downloaded yet.
Backup the old apis.txt file by renaming it to apis_bak1.txt (or some such name)
Open a text editor like notebook or vi
type in the apis number and enter
save in the text directory as apis.txt
Run the program
You should now find all the files associated with this apis number in their respective directories.

That's all there is to it!

tjnichols · May-05-2018, 01:35 PM

I've done this and it still pulls the original API list we started with. I tried moving it to another file and running it from there - it does nothing.

When I either change the name of xx_completions_xx or delete the files within, it does nothing (in the original folder / location. I just have no idea of where to look or what to do if I find it.

**Larz60+** · (This post was last modified: May-05-2018, 03:44 PM by Larz60+.)

Did you remove those files from apis.txt. If not, it will download whatever is in that file. That is the only way this could happen.
Are you using the code I provided in post 32?

apis.txt should only have the one! Simple as that.

This is why I instructed to save the old one as apis_bak1.txt

tjnichols · May-06-2018, 01:50 PM

I have the apis.txt file and can provide it. I can setup a GoTo Meeting if that will help.

**Larz60+** · May-06-2018, 02:20 PM

No goto meeting, you are doing something wrong.
Check modification dates on files. I expect that you are looking at previously downloaded files, which are never removed.
Are you using the code I provided in post 32?
apis.txt should only have the one entry for test!
Make sure there is only one apis.txt file, and display contents

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Literal beginner - needs help	warriordazza	2	1,800	Apr-27-2020, 11:15 AM Last Post: warriordazza

"EOL While Scanning String Literal"

User Panel Messages

Announcements