Python Forum
Scraping Images from Missing/ Exploited Children Site for Use with Rekognition - Printable Version

+- Python Forum (https://python-forum.io)
+-- Forum: Python Coding (https://python-forum.io/forum-7.html)
+--- Forum: Web Scraping & Web Development (https://python-forum.io/forum-13.html)
+--- Thread: Scraping Images from Missing/ Exploited Children Site for Use with Rekognition (/thread-24381.html)



Scraping Images from Missing/ Exploited Children Site for Use with Rekognition - codytsterling - Feb-11-2020

Hello,

I'm relatively new to Python and am trying to get some help scraping images from the following site: https://api.missingkids.org/missingkids/servlet/PubCaseSearchServlet?act=usMapSearch&missState=SC (National Center for Missing and Exploited Children)

I would then like to upload them into an AWS S3 bucket for comparison against images in another bucket using Rekognition.

I've tried numerous tutorials with no luck. Any tips/ advice, even just pointing me to a useful tutorial, would be very much appreciated! We're trying to locate children victims of human trafficking.

Thanks!

Cody


RE: Scraping Images from Missing/ Exploited Children Site for Use with Rekognition - snippsat - Feb-11-2020

Look at Web-Scraping part-1 and part-2

Some hint's,find name and NCMC number.
With NCMC number can make url for the large image,then do not need to follow link to get it.
If there are 2 images of person it will be after NCMC c1 first image e1 second image.
Quick example first person.
import requests
from bs4 import BeautifulSoup

url = 'https://api.missingkids.org/missingkids/servlet/PubCaseSearchServlet?act=usMapSearch&missState=SC'
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')
first_pers = soup.find('td', width="40%") #find_all for all
Usage test:
>>> name = first_pers.find_all('b')[0].text
>>> name
'FRANCISCO ALBERTO ALVARADO'
>>> ncmc = first_pers.find_all('b')[1].text
>>> ncmc
'NCMC1373468'
>>> 
>>> # Make url for large image
>>> img_ncmc_url = f'http://api.missingkids.org/photographs/{ncmc}c1.jpg'
>>> img_ncmc_url
'http://api.missingkids.org/photographs/NCMC1373468c1.jpg'