Python Forum
"EOL While Scanning String Literal"
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
"EOL While Scanning String Literal"
#31
I am preparing a step by step analysis. This will take a while.
I'll post as soon as done.
After this, you should be able to change pdf names.
Reply
#32
You need to learn how to do things like this yourself.

examine the method: parse_and_save
This function is called from __init__
with the command:
self.parse_and_save(getpdfs=True)
after all of the html files (based on the contents of apis.txt) have been downloaded to the xx_completions_xx directory

Here's a breakdown of the code:
    def parse_and_save(self, getpdfs=False):
        filelist = [file for file in self.completionspath.iterdir() if file.is_file()]
        for file in filelist:
            with file.open('r') as f:
                soup = BeautifulSoup(f.read(), 'lxml')
            if getpdfs:
                links = soup.find_all('a')
                for link in links:
                    url = link['href']
                    if 'www' in url:
                        continue
                    print('downloading pdf at: {}'.format(url))
                    p = url.index('=')
                    filename = self.log_pdfpath / 'comp{}.pdf'.format(url[p+1:])
                    response = requests.get(url, stream=True, allow_redirects=False)
                    if response.status_code == 200:
                        with filename.open('wb') as f:
                            f.write(response.content)
            sfname = self.textpath / 'summary_{}.txt'.format((file.name.split('_'))[1].split('.')[0][3:10])
            tds = soup.find_all('td')
            with sfname.open('w') as f:
                for td in tds:
                    if td.text:
                        if any(field in td.text for field in self.fields):
                            f.write('{}\n'.format(td.text))

        with self.infile.open('w') as f:
            for item in apis:
                f.write(f'{item}\n')
  • line 1 called from __init__ with getpdfs=True
  • line 2 This is a list comprehension which iterates over a directory using pathlib and creates a list containing all of the
    files (not sub-directories) in the target directory (self.completionspath) which is declared as 'xx_completions_xx'
    directory in __init__
  • line 3 Will iterate over each file in filelist
  • lines 4 and 5: read in file just created, and runs each through BeautifulSoup using the lxml parser, creating 'soup'
  • this portion gets all 'a' html tags (which are links to web pages) from the 'soup' just created, only if getpdfs is True (which it is)
  • lines 3 and 4: for each html link find and extract the 'href' value (the actual URL) and save in variable url
  • lines 6 - 18 will execute if getpdfs is True which is the case fro this example
  • line 7: finds all 'a' tags and saves in list 'links', each entry is an html segment
  • line 8: extracts the href value from link and saves in 'url'
  • lines 10 and 11: check url for 'www', which will be true for the link: 'http://www.state.wy.us', which is not something needed to find pdf file, so continues with next link. Next link is: 'http://wugiwus.state.wy.us/whatupcomps.cfm?nautonum=17696' this link contains the pdf file.
  • The unfortunate thing here is that the url doesn't contain the actulal pdf name, it is accessed by an id number, which is what I used to assign the name: (lines 13 and 14)
  • The good thing is that response has the information needed to name the pdf file appropriately in response.headers['Content-Disposition'] (code to take advantage of this will be added later) This entry looks like: response.headers['Content-Disposition'] = 'attachment; filename=906576001.pdf'
  • lines 12 - 14 explained above
  • line 15 fetches the pdf file
  • line 16 assures that the status_code is equal to 200, which is successful
  • lines 17 and 18 write the pdf file
  • lines 19 - 25 write the summary file

I have adjusted my original code to extract and use stored file names. Please note that there are a few files that didn't provide header information. These files are noted in the display (while downloading pdf files),
and their corresponding info is displayed as well, these will will be saved, always beginning with 'comp...'

PLEASE NOTE: Backup your code before replacing with new as I have not included any changes you might have made.

New code:
import requests
from bs4 import BeautifulSoup
from pathlib import Path
import sys

class GetCompletions:
    def __init__(self, infile):
        self.homepath = Path('.')
        self.completionspath = self.homepath / 'xx_completions_xx'
        self.completionspath.mkdir(exist_ok=True)
        self.log_pdfpath = self.homepath / 'logpdfs'
        self.log_pdfpath.mkdir(exist_ok=True)
        self.textpath = self.homepath / 'text'
        self.textpath.mkdir(exist_ok=True)

        self.infile = self.textpath / infile
        self.apis = []

        with self.infile.open() as f:
            for line in f:
                self.apis.append(line.strip())

        self.fields = ['Spud Date', 'Total Depth', 'IP Oil Bbls', 'Reservoir Class', 'Completion Date',
                       'Plug Back', 'IP Gas Mcf', 'TD Formation', 'Formation', 'IP Water Bbls']
        # self.get_all_pages()
        self.parse_and_save(getpdfs=True)

    def get_url(self):
        for entry in self.apis:
            yield (entry, "http://wogcc.state.wy.us/wyocomp.cfm?nAPI={}".format(entry[3:10]))

    def get_all_pages(self):
        for entry, url in self.get_url():
            print('Fetching main page for entry: {}'.format(entry))
            response = requests.get(url)
            if response.status_code == 200:
                filename = self.completionspath / 'api_{}.html'.format(entry)
                with filename.open('w') as f:
                    f.write(response.text)
            else:
                print('error downloading {}'.format(entry))

    def parse_and_save(self, getpdfs=False):
        filelist = [file for file in self.completionspath.iterdir() if file.is_file()]
        for file in filelist:
            with file.open('r') as f:
                soup = BeautifulSoup(f.read(), 'lxml')
            if getpdfs:
                links = soup.find_all('a')
                for link in links:
                    url = link['href']
                    if 'www' in url:
                        continue
                    print('downloading pdf at: {}'.format(url))
                    p = url.index('=')
                    response = requests.get(url, stream=True, allow_redirects=False)
                    if response.status_code == 200:
                        try:
                            header_info = response.headers['Content-Disposition']
                            idx = header_info.index('filename')
                            filename = self.log_pdfpath / header_info[idx+9:]
                        except ValueError:
                            filename = self.log_pdfpath / 'comp{}.pdf'.format(url[p + 1:])
                            print("couldn't locate filename for {} will use: {}".format(file, filename))
                        except KeyError:
                            filename = self.log_pdfpath / 'comp{}.pdf'.format(url[p + 1:])
                            print('got KeyError on {}, response.headers = {}'.format(file, response.headers))
                            print('will use name: {}'.format(filename))
                            print(response.headers)
                        with filename.open('wb') as f:
                            f.write(response.content)
            sfname = self.textpath / 'summary_{}.txt'.format((file.name.split('_'))[1].split('.')[0][3:10])
            tds = soup.find_all('td')
            with sfname.open('w') as f:
                for td in tds:
                    if td.text:
                        if any(field in td.text for field in self.fields):
                            f.write('{}\n'.format(td.text))

if __name__ == '__main__':
    GetCompletions('apis.txt')
Reply
#33
I have been working on learning this and I have totally broken it. My hope is you can show me how it's done so I can learn from it. As I've said, I am working on David Baezeley's teachings but it doesn't cover my exact needs. This is where I want to apply what I'm learning from you. I truly appreciate all you've done and I have not taken ownership of it here at my office. Everyone knows it comes from you and the ways in which I intend to use it.

Here's where I'm stuck. It always pulls the original API's. I understand it's using an apis.txt from the text folder and I have searched and changed all of these files on my hard drive (out of frustration) to reflect the new ones I want to pull. I understand this will pull from the text file in which I put the module and I have tried saving the apis.txt file there as well. It still pulls the original ones we started with.

I know it's got to be something simple but I'm not seeing it. Thoughts?

Again - I really appreciate your help.

Ok - I found the apis.txt it's pulling from. It has the new API numbers but the module is pulling the old ones. I hope this helps.
Reply
#34
the only file you change is the apis.txt, it must remain in the text directory.
Each time you need a new set of files, change this file to contain just the new ones.
A better solution is to use a command line argument to accept file name, then you can load whatever file you want.
I can't do that now because I have to go out and do some business, but will later today.
Perhaps one of the other moderators will pick up.
If not, I'll be back in a few hours.
Reply
#35
Ok - I figured it out. It's getting the API numbers from the files in the xx_completion_xx folder. How do I force it to look at the apis.txt file?
Reply
#36
It doesn't need to be forced this is the way the code is developed.
This is what's called data driven software, whatever is in the apis.text file will be downloaded, period, this is the only way it is set up to work!!!

Read the commented code. I explain in detail! the file name is extracted from the header stored in the response from request!
It is extracted and used. If it can't find the proper header information, it will use the name from the apis.text file, but the origin of
the download is still the apis.txt file.

Try the following:
  • find the apis number (or whatever it's called) for only one that you haven't downloaded yet.
  • Backup the old apis.txt file by renaming it to apis_bak1.txt (or some such name)
  • Open a text editor like notebook or vi
  • type in the apis number and enter
  • save in the text directory as apis.txt
  • Run the program
  • You should now find all the files associated with this apis number in their respective directories.
That's all there is to it!
Reply
#37
I've done this and it still pulls the original API list we started with. I tried moving it to another file and running it from there - it does nothing.

When I either change the name of xx_completions_xx or delete the files within, it does nothing (in the original folder / location. I just have no idea of where to look or what to do if I find it.
Reply
#38
Did you remove those files from apis.txt. If not, it will download whatever is in that file. That is the only way this could happen.
Are you using the code I provided in post 32?

apis.txt should only have the one! Simple as that.

This is why I instructed to save the old one as apis_bak1.txt
Reply
#39
I have the apis.txt file and can provide it. I can setup a GoTo Meeting if that will help.
Reply
#40
No goto meeting, you are doing something wrong.
Check modification dates on files. I expect that you are looking at previously downloaded files, which are never removed.
Are you using the code I provided in post 32?
apis.txt should only have the one entry for test!
Make sure there is only one apis.txt file, and display contents
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Literal beginner - needs help warriordazza 2 1,753 Apr-27-2020, 11:15 AM
Last Post: warriordazza

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020