Posts: 14
Threads: 7
Joined: Mar 2020
I just need to take screenshot of a webpage and store / retrieve that image in mongodb. Below code I'm using to take screenshot of webpage and How I store it in a mongodb collection?
from selenium import webdriver
from PIL import Image
from io import BytesIO
fox = webdriver.Firefox()
fox.get('http://example.com/')
# now that we have the preliminary stuff out of the way time to get that image :D
element = fox.find_element_by_id('hlogo') # find part of the page you want image of
location = element.location
size = element.size
png = fox.get_screenshot_as_png() # saves screenshot of entire page
fox.quit()
im = Image.open(BytesIO(png)) # uses PIL library to open image in memory
left = location['x']
top = location['y']
right = location['x'] + size['width']
bottom = location['y'] + size['height']
im = im.crop((left, top, right, bottom)) # defines crop points
im.save('screenshot.png') # saves new cropped image
Posts: 1,838
Threads: 2
Joined: Apr 2017
Aug-16-2020, 04:28 AM
(This post was last modified: Aug-16-2020, 07:48 AM by ndc85430.)
Why do you want to store the image in the database in the first place, rather than some file storage (S3, Google Cloud Storage, ...) and storing the URL in the database? Did you look at the docs on Python drivers for MongoDB?
Posts: 14
Threads: 7
Joined: Mar 2020
actually I need do that with scrapy. In my program when scrapy spider crawl webpage and store some data like title,description,url... of webpage in mongodb. So I need to when spider crawl a web page, take screenshot of webpage and store it in mongodb. And after that I need retrieve it from mongodb. ? This is my spider ?
import scrapy
from scrapy.selector import Selector
from search.models import *
import lxml
from lxml.html.clean import Cleaner
import re
from urllib.parse import urlparse
import json
class DSSpider(scrapy.Spider):
name = "ds_spider"
def __init__(self, recrawl='no'):
if app.config['SPIDER_ALLOWED_DOMAINS'] != None:
self.allowed_domains = app.config['SPIDER_ALLOWED_DOMAINS']
self.start_urls = ['https://www.imdb.com']
is_crawled = recrawl.lower() in ['y', 'yes', 't', 'true', '1']
crawl_list = Crawllist \
.objects(is_crawled=is_crawled) \
.limit(app.config['CLOSESPIDER_PAGECOUNT']) \
.order_by('updated_at')
for link in crawl_list:
self.start_urls.append(link.url)
def parse(self, response):
schemas = response.xpath('//script[@type="application/ld+json"]//text()').extract()
for schema in schemas:
data = json.loads(schema, cls=json.JSONDecoder)
page_markup = data.get('@type')
selector = Selector(response)
# get page title
page_title = selector.xpath('//title/text()').extract()[0]
# get page content
cleaner = Cleaner()
cleaner.javascript = True
cleaner.style = True
page_html = selector.xpath('//body').extract()[0]
# remove js and css code
page_html = cleaner.clean_html(page_html)
# extract text
html_doc = lxml.html.document_fromstring(page_html)
page_content = ' '.join(lxml.etree.XPath("//text()")(html_doc))
page_content += ' ' + page_title
# remove line breaks tabs and extra spaces
page_content = re.sub('\n', ' ', page_content)
page_content = re.sub('\r', ' ', page_content)
page_content = re.sub('\t', ' ', page_content)
page_content = re.sub(' +', ' ', page_content)
page_content = page_content.strip()
# get page links
page_hrefs = response.xpath('//a/@href').extract()
page_urls = []
# filter links with unallowed domains
for link in page_hrefs:
# convert links to absolute urls
url = response.urljoin(link)
# extract domain from url
parsed_url = urlparse(url)
url_domain = parsed_url.netloc
if url_domain in self.allowed_domains:
page_urls.append(url)
# log out some info
self.log('Page: %s (%s)' % (response.url, page_title))
# save the page
if Page.objects(url=response.url).count() == 0:
page = Page(url=response.url, title=page_title, content=page_content, markup=page_markup).save()
for url in page_urls:
page.update(add_to_set__links=Pagelink(url=url).save())
# add url to crawl list
if Crawllist.objects(url=url).count() == 0:
Crawllist(url=url).save()
# update crawl list
Crawllist.objects(url=response.url).update(is_crawled=True)
else:
page = Page.objects.get(url=response.url)
page.update(title=page_title, content=page_content, markup=page_markup)
for next_pages in response.css('a::attr(href)'):
next_page = next_pages.extract()
print(next_page)
if next_page is not None:
yield response.follow(next_page, callback= self.parse)
Posts: 1,838
Threads: 2
Joined: Apr 2017
You didn't actually answer my question: why do you think you should store the images in the database and retrieve them from there? It's probably going to be less efficient for the database for a start. What exactly is your end goal? It's also a bit strange that you switched from Selenium to Scrapy; why didn't you just say you were using Scrapy to begin with?
Posts: 14
Threads: 7
Joined: Mar 2020
Posts: 1,838
Threads: 2
Joined: Apr 2017
Aug-16-2020, 01:05 PM
(This post was last modified: Aug-16-2020, 01:19 PM by ndc85430.)
If you're just doing this locally, then store the images on the file system and the paths in the database. There's no reason to store the images in the database.
Databases aren't really meant for storing binaries like images. File systems and object storage facilities (like those I mentioned) are.
Posts: 36
Threads: 0
Joined: May 2020
I see some options that don't involve MongoDB,
1) Sqlite3 which is much easier and quicker, as it stores all info in 1 file.
2) Just a file on your disk, easier, better, quicker.
Posts: 1,838
Threads: 2
Joined: Apr 2017
I don't think there's an issue using MongoDB. I'd disagree that using SQLite or any relational database is easier. It depends how the data are being used. Plus, if they're already using Mongo, they'd have to do more work to switch - introducing a schema, normalising the data, etc.
Posts: 14
Threads: 7
Joined: Mar 2020
Is it possible to store and retrieve large size images in mongodb? If we can, which is the data type should give for image file?
Posts: 1,838
Threads: 2
Joined: Apr 2017
Sigh. We're going round in circles here. Don't store the images in the database. Store them in something that was designed for that and store the paths in the database.
|