Python Forum
Python + request from specific website - please help
Thread Rating:
  • 1 Vote(s) - 5 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Python + request from specific website - please help
#1
I'm trying to write a Python code that sends a request to this website: https://ucr.gov/enforcement using the "search" form where I enter the input (example: 129123) and receive the data back. I'm interested in the "registered" or "unregistered" data status. When you go to that website and enter that number above in the query you'll see what I'm talking about. I need to collect that the result for 2019 and 2018 as either UNREGISTERED or REGISTERED - again you'll see when you go to the website. How do I do that with python? What would be the request code?
Reply
#2
check out the web scraping tutorials
https://python-forum.io/Forum-Web-Scraping
Reply
#3
Got it. I got this far:

import requests
from bs4 import BeautifulSoup
url = "https://ucr.gov/enforcement/1712583"
url_get = requests.get(url)
soup = BeautifulSoup(url_get.content, "html.parser")
print(soup)
However it doesn't show me the text I want. I want to pull the "2019 UNREGISTERED" data. How do I do that? (referencing: https://ucr.gov/enforcement/1712583)
Reply
#4
It's a bit tricky, because the site uses javascript. But you are lucky because closer inspection at the requests being send from browser reveals they get the data from api in json format. Then we can replicate the request headers as closely as possible.
Here is something to start with

import requests
from bs4 import BeautifulSoup
import time

def get_json(query):
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:65.0) Gecko/20100101 Firefox/65.0',
        'Accept': 'application/json, text/plain, */*',
        'Accept-Language': 'en-US,en;q=0.5',
        'Referer': 'https://ucr.gov/enforcement/{}'.format(query),
        'Cache-Control': 'no-cache,no-store,must-revalidate,max-age=0,private',
        'UCR-UI-Version': '19.2.1',
        'Origin': 'https://ucr.gov',
        'Connection': 'keep-alive',
    }
 
    s = requests.Session()
     
    params = (
        ('pageNumber', '0'),
        ('itemsPerPage', '15'),
    )

    url = 'https://admin.ucr.gov/api/enforcement/{}'.format(query)
    response = s.get(url, headers=headers, params=params)
    return response.json()


if __name__ == '__main__':
    dots = [192123, 1921, 192, 1712583]
    for query in dots:
        data = get_json(query=query)
        print('DOT: {}'.format(query))
        if data.get('carrier'):
            for registration in data['history']['registrations']:
                print('Year: {year}, Status: {status}'.format(**registration))
        else:
            print('Not valid DOT')
        print('\n-----------\n')
        
        # implement 0.5 delay between requests
        time.sleep(0.5)
Output:
DOT: 192123 Year: 2019, Status: unregistered Year: 2018, Status: unregistered Year: 2017, Status: unregistered ----------- DOT: 1921 Year: 2019, Status: unregistered Year: 2018, Status: unregistered Year: 2017, Status: unregistered ----------- DOT: 192 Not valid DOT ----------- DOT: 1712583 Year: 2019, Status: unregistered Year: 2018, Status: unregistered Year: 2017, Status: unregistered -----------
You can print the entire json to see all the information available.
probably it's a good idea to implement user-agent and proxy rotation if you are going to do a lot of requests in order to avoid detection.
If you can't explain it to a six year old, you don't understand it yourself, Albert Einstein
How to Ask Questions The Smart Way: link and another link
Create MCV example
Debug small programs

Reply
#5
Thanks buran!

How did you figure out that the JS was using a json type api?
Can you point me to literature where I can learn more about this? I feel like I've got Python basics covered but networking/web scraping is still a mystery to me.
Reply
#6
(Feb-06-2019, 06:35 PM)hoff1022 Wrote: How did you figure out that the JS was using a json type api?
Usually first thing you do is to check for official API. Second - inspect what requests being sent. You need to do in any case. Especially when you are dealing with large database you may find data being retrieved in separate request from internal API. Check Type column in the attached image
   
If you can't explain it to a six year old, you don't understand it yourself, Albert Einstein
How to Ask Questions The Smart Way: link and another link
Create MCV example
Debug small programs

Reply
#7
Another "lazier" approach is to use Selenium.
This deal with JavaScript,so we get whole page source back and not a Waring that JavaScripts most be turned on.
Example:
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.keys import Keys
import time
from bs4 import BeautifulSoup

#--| Setup
options = Options()
options.add_argument("--headless")
browser = webdriver.Chrome(executable_path=r'C:\cmder\bin\chromedriver.exe', options=options)
#--| Parse or automation
browser.get('https://ucr.gov/enforcement/1712583')
soup = BeautifulSoup(browser.page_source, 'lxml')
status = soup.find('div', class_="sc-epnACN fEroud")
print(status.text)
Output:
UNREGISTERED
For more about this look at Web-scraping part-2.
Reply
#8
Guys, can you check this out now. The code you suggested throws this error:

'message': "We've detected that you are using an old version of the website!\n Please do the following to update your version\n 1. Refresh the Page (on desktop you can press Ctrl + Shift + R on your keyboard).\n 2. If you are still seeing this message. Please fully clear your browser history."}


Any way to get around that?

When I refresh my browser it then works when I try to look up that DOT# manually online. But can I somehow force that in any of the 2 codes you suggested above? i.e. why would their server throw that error when requesting the data from the python code(s) you've suggested?
Reply
#9
they have changed the headers. 'UCR-UI-Version' is now '19.2.3'
on line 12 change 19.2.1 to 19.2.3 and I think it will work
If you can't explain it to a six year old, you don't understand it yourself, Albert Einstein
How to Ask Questions The Smart Way: link and another link
Create MCV example
Debug small programs

Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  I want to create an automated website in python mkdhrub1 3 392 Dec-27-2021, 11:27 PM
Last Post: Larz60+
  Python to build website Methew324 1 906 Dec-15-2020, 05:57 AM
Last Post: buran
  Scraping all website text using Python MKMKMKMK 1 880 Nov-26-2020, 10:35 PM
Last Post: Larz60+
  Python Webscraping with a Login Website warriordazza 0 1,477 Jun-07-2020, 07:04 AM
Last Post: warriordazza
  Python Request's Proxies not working. Fudster 1 6,036 May-01-2020, 06:42 AM
Last Post: buran
  Python tool based on website? zarize 2 1,308 Mar-21-2020, 02:25 PM
Last Post: zarize
  Python handling Apache Request harzsr 3 1,968 Nov-16-2018, 04:36 AM
Last Post: nilamo
  hi new at python , trying to get urls from website dviry 6 2,926 Feb-24-2018, 07:34 PM
Last Post: metulburr

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020