Python Forum
Python + request from specific website - please help
Thread Rating:
  • 1 Vote(s) - 5 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Python + request from specific website - please help
#1
I'm trying to write a Python code that sends a request to this website: https://ucr.gov/enforcement using the "search" form where I enter the input (example: 129123) and receive the data back. I'm interested in the "registered" or "unregistered" data status. When you go to that website and enter that number above in the query you'll see what I'm talking about. I need to collect that the result for 2019 and 2018 as either UNREGISTERED or REGISTERED - again you'll see when you go to the website. How do I do that with python? What would be the request code?
Reply
#2
check out the web scraping tutorials
https://python-forum.io/Forum-Web-Scraping
Recommended Tutorials:
Reply
#3
Got it. I got this far:

import requests
from bs4 import BeautifulSoup
url = "https://ucr.gov/enforcement/1712583"
url_get = requests.get(url)
soup = BeautifulSoup(url_get.content, "html.parser")
print(soup)
However it doesn't show me the text I want. I want to pull the "2019 UNREGISTERED" data. How do I do that? (referencing: https://ucr.gov/enforcement/1712583)
Reply
#4
It's a bit tricky, because the site uses javascript. But you are lucky because closer inspection at the requests being send from browser reveals they get the data from api in json format. Then we can replicate the request headers as closely as possible.
Here is something to start with

import requests
from bs4 import BeautifulSoup
import time

def get_json(query):
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:65.0) Gecko/20100101 Firefox/65.0',
        'Accept': 'application/json, text/plain, */*',
        'Accept-Language': 'en-US,en;q=0.5',
        'Referer': 'https://ucr.gov/enforcement/{}'.format(query),
        'Cache-Control': 'no-cache,no-store,must-revalidate,max-age=0,private',
        'UCR-UI-Version': '19.2.1',
        'Origin': 'https://ucr.gov',
        'Connection': 'keep-alive',
    }
 
    s = requests.Session()
     
    params = (
        ('pageNumber', '0'),
        ('itemsPerPage', '15'),
    )

    url = 'https://admin.ucr.gov/api/enforcement/{}'.format(query)
    response = s.get(url, headers=headers, params=params)
    return response.json()


if __name__ == '__main__':
    dots = [192123, 1921, 192, 1712583]
    for query in dots:
        data = get_json(query=query)
        print('DOT: {}'.format(query))
        if data.get('carrier'):
            for registration in data['history']['registrations']:
                print('Year: {year}, Status: {status}'.format(**registration))
        else:
            print('Not valid DOT')
        print('\n-----------\n')
        
        # implement 0.5 delay between requests
        time.sleep(0.5)
Output:
DOT: 192123 Year: 2019, Status: unregistered Year: 2018, Status: unregistered Year: 2017, Status: unregistered ----------- DOT: 1921 Year: 2019, Status: unregistered Year: 2018, Status: unregistered Year: 2017, Status: unregistered ----------- DOT: 192 Not valid DOT ----------- DOT: 1712583 Year: 2019, Status: unregistered Year: 2018, Status: unregistered Year: 2017, Status: unregistered -----------
You can print the entire json to see all the information available.
probably it's a good idea to implement user-agent and proxy rotation if you are going to do a lot of requests in order to avoid detection.
If you can't explain it to a six year old, you don't understand it yourself, Albert Einstein
How to Ask Questions The Smart Way: link and another link
Create MCV example
Debug small programs

Reply
#5
Thanks buran!

How did you figure out that the JS was using a json type api?
Can you point me to literature where I can learn more about this? I feel like I've got Python basics covered but networking/web scraping is still a mystery to me.
Reply
#6
(Feb-06-2019, 06:35 PM)hoff1022 Wrote: How did you figure out that the JS was using a json type api?
Usually first thing you do is to check for official API. Second - inspect what requests being sent. You need to do in any case. Especially when you are dealing with large database you may find data being retrieved in separate request from internal API. Check Type column in the attached image
   
If you can't explain it to a six year old, you don't understand it yourself, Albert Einstein
How to Ask Questions The Smart Way: link and another link
Create MCV example
Debug small programs

Reply
#7
Another "lazier" approach is to use Selenium.
This deal with JavaScript,so we get whole page source back and not a Waring that JavaScripts most be turned on.
Example:
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.keys import Keys
import time
from bs4 import BeautifulSoup

#--| Setup
options = Options()
options.add_argument("--headless")
browser = webdriver.Chrome(executable_path=r'C:\cmder\bin\chromedriver.exe', options=options)
#--| Parse or automation
browser.get('https://ucr.gov/enforcement/1712583')
soup = BeautifulSoup(browser.page_source, 'lxml')
status = soup.find('div', class_="sc-epnACN fEroud")
print(status.text)
Output:
UNREGISTERED
For more about this look at Web-scraping part-2.
Reply
#8
Guys, can you check this out now. The code you suggested throws this error:

'message': "We've detected that you are using an old version of the website!\n Please do the following to update your version\n 1. Refresh the Page (on desktop you can press Ctrl + Shift + R on your keyboard).\n 2. If you are still seeing this message. Please fully clear your browser history."}


Any way to get around that?

When I refresh my browser it then works when I try to look up that DOT# manually online. But can I somehow force that in any of the 2 codes you suggested above? i.e. why would their server throw that error when requesting the data from the python code(s) you've suggested?
Reply
#9
they have changed the headers. 'UCR-UI-Version' is now '19.2.3'
on line 12 change 19.2.1 to 19.2.3 and I think it will work
If you can't explain it to a six year old, you don't understand it yourself, Albert Einstein
How to Ask Questions The Smart Way: link and another link
Create MCV example
Debug small programs

Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
Question Python request (post/get) Drunknmonkie 1 2,627 Jan-19-2023, 02:02 PM
Last Post: prvncpa
  Retrieve website content using Python? Vadanane 1 1,196 Jan-16-2023, 09:55 AM
Last Post: Axel_Erfurt
  I want to create an automated website in python mkdhrub1 2 2,311 Dec-27-2021, 11:27 PM
Last Post: Larz60+
  Python to build website Methew324 1 2,195 Dec-15-2020, 05:57 AM
Last Post: buran
  Scraping all website text using Python MKMKMKMK 1 2,051 Nov-26-2020, 10:35 PM
Last Post: Larz60+
  Python Webscraping with a Login Website warriordazza 0 2,571 Jun-07-2020, 07:04 AM
Last Post: warriordazza
  Python tool based on website? zarize 2 2,434 Mar-21-2020, 02:25 PM
Last Post: zarize
  hi new at python , trying to get urls from website dviry 6 4,633 Feb-24-2018, 07:34 PM
Last Post: metulburr

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020