Python Forum
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
I have no idea
#1
Ok - I'm trying (and failing) to write my version of what I want my download file to obtain completion files from our oil and gas commission. I admit it's very much cobbled together. Its a mashup between what I've learned here (my job), YouTube videos, and several books / chapters on scraping the web.

They all have their way of doing things and I get that. I just don't know which one is best so guidance with that would be helpful in addition to this problem.

I am running this in a virtual environment.

from bs4 import BeautifulSoup
import requests
import re

apis = ['49005253730000','49005255270000']


def wogcc_completions_scraper():
    x = 0
    while x < len(apis):
        wogcc_url = 'http://pipeline.wyo.gov/whatups/whatupcomps.cfm?nautonum={}' + str(apis[x][3:10])
        print (str(apis[x]))
        las_only = [] 
        wogcc_request = requests.get(wogcc_url)
        soup = BeautifulSoup(wogcc_request.content, "html.parser")
        href_tags = soup.find_all('a')

    ### This section of code will data scrape the WOGCC for the completion report

        link_regex = "http://pipeline.wyo.gov/wellapi.cfm?nAPIno={}"
        link_pattern = re.compile(link_regex)
        link_file = re.findall(link_pattern, str(soup))

        new_urls = []
        y = 0

        print(link_file)
        if len(link_file) == 0:
            print (str(apis[x]) + " No")
        else:
            print (str(apis[x]) + " Yes")
            for link in link_file:
                link1 = "http://pipeline.wyo.gov/" + str(link)
                new_urls.append(link1)

            while y < len(new_urls):
                download = requests.get(new_urls[y])
                with open((str(apis[x]) + "_" + "Completion_report" + ".pdf"), "wb") as code:
                    code.write(download.content)
            y += 1
    

        
wogcc_completions_scraper()
Here's the issue: 1 - it keeps running 49005253730000 over and over. There is a completions report to download here so it should be saying yes and moving to the next API(rather than No shown below). There is also another issue that I simply don't understand.

Error:
Python 3.6.5 (v3.6.5:f59c0932b4, Mar 28 2018, 17:00:18) [MSC v.1900 64 bit (AMD64)] on win32 Type "copyright", "credits" or "license()" for more information. >>> RESTART: C:\scrapingEnv\WOGCC_Well_Completions_Lil_Scraper_12b_TJN_EDITS.py Warning (from warnings module): File "C:\Python365\lib\site-packages\requests\__init__.py", line 91 RequestsDependencyWarning) RequestsDependencyWarning: urllib3 (dev) or chardet (3.0.4) doesn't match a supported version! 49005253730000 [] 49005253730000 No 49005253730000 [] 49005253730000 No 49005253730000 [] 49005253730000 No 49005253730000 [] 49005253730000 No 49005253730000 [] 49005253730000 No 49005253730000 [] 49005253730000 No 49005253730000 [] 49005253730000 No 49005253730000 [] 49005253730000 No 49005253730000 [] 49005253730000 No 49005253730000 [] 49005253730000 No 49005253730000 [] 49005253730000 No 49005253730000 [] 49005253730000 No 49005253730000 [] 49005253730000 No 49005253730000 [] 49005253730000 No 49005253730000 [] 49005253730000 No 49005253730000 [] 49005253730000 No 49005253730000 [] 49005253730000 No 49005253730000 [] 49005253730000 No 49005253730000 [] 49005253730000 No 49005253730000 [] 49005253730000 No 49005253730000 [] 49005253730000 No 49005253730000 [] 49005253730000 No 49005253730000 [] 49005253730000 No 49005253730000 [] 49005253730000 No 49005253730000 [] 49005253730000 No 49005253730000 [] 49005253730000 No 49005253730000 [] 49005253730000 No 49005253730000 [] 49005253730000 No 49005253730000 [] 49005253730000 No 49005253730000 [] 49005253730000 No 49005253730000 [] 49005253730000 No 49005253730000 [] 49005253730000 No 49005253730000
As always - any help you can provide will be most appreciated!
Reply
#2
you are in infinite loop because you never increment x.
Also, check https://python-forum.io/Thread-Basic-Nev...n-sequence You use while, but in fact it's the same problem. Learn how to iterate over elements of the list, tuple, etc., without using index
there are plenty of other things that are wrong or can/should be done differently, e.g.
'http://pipeline.wyo.gov/whatups/whatupcomps.cfm?nautonum={}' + str(apis[x][3:10]) is probably 'http://pipeline.wyo.gov/whatups/whatupcomps.cfm?nautonum=' + apis[x][3:10]
or 'http://pipeline.wyo.gov/whatups/whatupcomps.cfm?nautonum={}'.format(apis[x][3:10])
the elements of the list are str, so no need to convert them everywhere in the code
also, using regex to parse html is really bad idea
If you can't explain it to a six year old, you don't understand it yourself, Albert Einstein
How to Ask Questions The Smart Way: link and another link
Create MCV example
Debug small programs

Reply
#3
So here is my latest draft:

from bs4 import BeautifulSoup
import requests
import re

apis = ['49005253730000','49005255270000']


def wogcc_completions_scraper():
    x = 0
    while x < len(apis):
        wogcc_url = 'http://pipeline.wyo.gov/whatups/whatupcomps.cfm?nautonum={}'.format(apis[x][3:10])
        print (apis[x])
        wogcc_request = requests.get(wogcc_url)
        soup = BeautifulSoup(wogcc_request.content, "html.parser")
        href_tags = soup.find_all('a')
    x = 1
    print('Foubd completions repott.')

   ### This section of code will data scrape the WOGCC for the completion report

        link_regex = "http://pipeline.wyo.gov/wellapi.cfm?nAPIno={}"
        link_pattern = re.compile(link_regex)
        link_file = re.findall(link_pattern, str(soup))

        new_urls = []
        y = 0

        print(link_file)
        if len(link_file) == 0:
            print (str(apis[x]) + " No")
        else:
            print (str(apis[x]) + " Yes")
            for link in link_file:
                link1 = "http://pipeline.wyo.gov/" + str(link)
                new_urls.append(link1)

            while y < len(new_urls):
                download = requests.get(new_urls[y])
                with open((str(apis[x]) + "_" + "Completion_report" + ".pdf"), "wb") as code:
                    code.write(download.content)
            y += 1
    

        
wogcc_completions_scraper()
Here is the error:

Error:
C:\Python365\python.exe C:/scrapingEnv/WOGCC_Well_Completions_Lil_Scraper_12b_TJN_10062018.py File "C:/scrapingEnv/WOGCC_Well_Completions_Lil_Scraper_12b_TJN_10062018.py", line 22 link_regex = "http://pipeline.wyo.gov/wellapi.cfm?nAPIno={}" ^ IndentationError: unexpected indent Process finished with exit code 1
I hope this leads to I have gone further into solving my issues rather than falling behind. I think this issue has to do with the regex issue we talked about before or simply an indent issue. I just don't know what to replace it with. Can anyone point me into the direction of learning what the proper (or just what you know what works for you) way to do this?

I appreciate any / all help I can get!
Reply
#4
the problem is that lines 16-17 are not indented and being part of the loop.
If you can't explain it to a six year old, you don't understand it yourself, Albert Einstein
How to Ask Questions The Smart Way: link and another link
Create MCV example
Debug small programs

Reply


Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020