Python Forum

Full Version: Python + request from specific website - please help
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
I'm trying to write a Python code that sends a request to this website: https://ucr.gov/enforcement using the "search" form where I enter the input (example: 129123) and receive the data back. I'm interested in the "registered" or "unregistered" data status. When you go to that website and enter that number above in the query you'll see what I'm talking about. I need to collect that the result for 2019 and 2018 as either UNREGISTERED or REGISTERED - again you'll see when you go to the website. How do I do that with python? What would be the request code?
check out the web scraping tutorials
https://python-forum.io/Forum-Web-Scraping
Got it. I got this far:

import requests
from bs4 import BeautifulSoup
url = "https://ucr.gov/enforcement/1712583"
url_get = requests.get(url)
soup = BeautifulSoup(url_get.content, "html.parser")
print(soup)
However it doesn't show me the text I want. I want to pull the "2019 UNREGISTERED" data. How do I do that? (referencing: https://ucr.gov/enforcement/1712583)
It's a bit tricky, because the site uses javascript. But you are lucky because closer inspection at the requests being send from browser reveals they get the data from api in json format. Then we can replicate the request headers as closely as possible.
Here is something to start with

import requests
from bs4 import BeautifulSoup
import time

def get_json(query):
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:65.0) Gecko/20100101 Firefox/65.0',
        'Accept': 'application/json, text/plain, */*',
        'Accept-Language': 'en-US,en;q=0.5',
        'Referer': 'https://ucr.gov/enforcement/{}'.format(query),
        'Cache-Control': 'no-cache,no-store,must-revalidate,max-age=0,private',
        'UCR-UI-Version': '19.2.1',
        'Origin': 'https://ucr.gov',
        'Connection': 'keep-alive',
    }
 
    s = requests.Session()
     
    params = (
        ('pageNumber', '0'),
        ('itemsPerPage', '15'),
    )

    url = 'https://admin.ucr.gov/api/enforcement/{}'.format(query)
    response = s.get(url, headers=headers, params=params)
    return response.json()


if __name__ == '__main__':
    dots = [192123, 1921, 192, 1712583]
    for query in dots:
        data = get_json(query=query)
        print('DOT: {}'.format(query))
        if data.get('carrier'):
            for registration in data['history']['registrations']:
                print('Year: {year}, Status: {status}'.format(**registration))
        else:
            print('Not valid DOT')
        print('\n-----------\n')
        
        # implement 0.5 delay between requests
        time.sleep(0.5)
Output:
DOT: 192123 Year: 2019, Status: unregistered Year: 2018, Status: unregistered Year: 2017, Status: unregistered ----------- DOT: 1921 Year: 2019, Status: unregistered Year: 2018, Status: unregistered Year: 2017, Status: unregistered ----------- DOT: 192 Not valid DOT ----------- DOT: 1712583 Year: 2019, Status: unregistered Year: 2018, Status: unregistered Year: 2017, Status: unregistered -----------
You can print the entire json to see all the information available.
probably it's a good idea to implement user-agent and proxy rotation if you are going to do a lot of requests in order to avoid detection.
Thanks buran!

How did you figure out that the JS was using a json type api?
Can you point me to literature where I can learn more about this? I feel like I've got Python basics covered but networking/web scraping is still a mystery to me.
(Feb-06-2019, 06:35 PM)hoff1022 Wrote: [ -> ]How did you figure out that the JS was using a json type api?
Usually first thing you do is to check for official API. Second - inspect what requests being sent. You need to do in any case. Especially when you are dealing with large database you may find data being retrieved in separate request from internal API. Check Type column in the attached image
[attachment=553]
Another "lazier" approach is to use Selenium.
This deal with JavaScript,so we get whole page source back and not a Waring that JavaScripts most be turned on.
Example:
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.keys import Keys
import time
from bs4 import BeautifulSoup

#--| Setup
options = Options()
options.add_argument("--headless")
browser = webdriver.Chrome(executable_path=r'C:\cmder\bin\chromedriver.exe', options=options)
#--| Parse or automation
browser.get('https://ucr.gov/enforcement/1712583')
soup = BeautifulSoup(browser.page_source, 'lxml')
status = soup.find('div', class_="sc-epnACN fEroud")
print(status.text)
Output:
UNREGISTERED
For more about this look at Web-scraping part-2.
Guys, can you check this out now. The code you suggested throws this error:

'message': "We've detected that you are using an old version of the website!\n Please do the following to update your version\n 1. Refresh the Page (on desktop you can press Ctrl + Shift + R on your keyboard).\n 2. If you are still seeing this message. Please fully clear your browser history."}


Any way to get around that?

When I refresh my browser it then works when I try to look up that DOT# manually online. But can I somehow force that in any of the 2 codes you suggested above? i.e. why would their server throw that error when requesting the data from the python code(s) you've suggested?
they have changed the headers. 'UCR-UI-Version' is now '19.2.3'
on line 12 change 19.2.1 to 19.2.3 and I think it will work