Python Forum

I just need to take screenshot of a webpage and store / retrieve that image in mongodb. Below code I'm using to take screenshot of webpage and How I store it in a mongodb collection? Huh

from selenium import webdriver
from PIL import Image
from io import BytesIO

fox = webdriver.Firefox()
fox.get('http://example.com/')

# now that we have the preliminary stuff out of the way time to get that image :D
element = fox.find_element_by_id('hlogo') # find part of the page you want image of
location = element.location
size = element.size
png = fox.get_screenshot_as_png() # saves screenshot of entire page
fox.quit()

im = Image.open(BytesIO(png)) # uses PIL library to open image in memory

left = location['x']
top = location['y']
right = location['x'] + size['width']
bottom = location['y'] + size['height']


im = im.crop((left, top, right, bottom)) # defines crop points
im.save('screenshot.png') # saves new cropped image

Why do you want to store the image in the database in the first place, rather than some file storage (S3, Google Cloud Storage, ...) and storing the URL in the database? Did you look at the docs on Python drivers for MongoDB?

actually I need do that with scrapy. In my program when scrapy spider crawl webpage and store some data like title,description,url... of webpage in mongodb. So I need to when spider crawl a web page, take screenshot of webpage and store it in mongodb. And after that I need retrieve it from mongodb. ? This is my spider ?

import scrapy
from scrapy.selector import Selector
from search.models import *
import lxml
from lxml.html.clean import Cleaner
import re
from urllib.parse import urlparse
import json


class DSSpider(scrapy.Spider):
    name = "ds_spider"

    def __init__(self, recrawl='no'):
        if app.config['SPIDER_ALLOWED_DOMAINS'] != None:
            self.allowed_domains = app.config['SPIDER_ALLOWED_DOMAINS']

        self.start_urls = ['https://www.imdb.com']
        is_crawled = recrawl.lower() in ['y', 'yes', 't', 'true', '1']
        crawl_list = Crawllist \
            .objects(is_crawled=is_crawled) \
            .limit(app.config['CLOSESPIDER_PAGECOUNT']) \
            .order_by('updated_at')
        for link in crawl_list:
            self.start_urls.append(link.url)

    def parse(self, response):
        schemas = response.xpath('//script[@type="application/ld+json"]//text()').extract()

        for schema in schemas:
            data = json.loads(schema, cls=json.JSONDecoder)
            page_markup = data.get('@type')

        selector = Selector(response)
        # get page title
        page_title = selector.xpath('//title/text()').extract()[0]
        # get page content 
        cleaner = Cleaner()
        cleaner.javascript = True
        cleaner.style = True
        page_html = selector.xpath('//body').extract()[0]
        # remove js and css code
        page_html = cleaner.clean_html(page_html)
        # extract text
        html_doc = lxml.html.document_fromstring(page_html)
        page_content = ' '.join(lxml.etree.XPath("//text()")(html_doc))
        page_content += ' ' + page_title
        # remove line breaks tabs and extra spaces
        page_content = re.sub('\n', ' ', page_content)
        page_content = re.sub('\r', ' ', page_content)
        page_content = re.sub('\t', ' ', page_content)
        page_content = re.sub(' +', ' ', page_content)
        page_content = page_content.strip()
        # get page links
        page_hrefs = response.xpath('//a/@href').extract()
        page_urls = []

        # filter links with unallowed domains
        for link in page_hrefs:
            # convert links to absolute urls
            url = response.urljoin(link)
            # extract domain from url
            parsed_url = urlparse(url)
            url_domain = parsed_url.netloc
            if url_domain in self.allowed_domains:
                page_urls.append(url)
        # log out some info
        self.log('Page: %s (%s)' % (response.url, page_title))


        # save the page
        if Page.objects(url=response.url).count() == 0:
            page = Page(url=response.url, title=page_title, content=page_content, markup=page_markup).save()
            for url in page_urls:
                page.update(add_to_set__links=Pagelink(url=url).save())
                # add url to crawl list
                if Crawllist.objects(url=url).count() == 0:
                    Crawllist(url=url).save()
            # update crawl list
            Crawllist.objects(url=response.url).update(is_crawled=True)
        else:
            page = Page.objects.get(url=response.url)
            page.update(title=page_title, content=page_content, markup=page_markup)

        for next_pages in  response.css('a::attr(href)'):
            next_page = next_pages.extract()
            print(next_page)

            if next_page is not None:
                yield response.follow(next_page, callback= self.parse)

You didn't actually answer my question: why do you think you should store the images in the database and retrieve them from there? It's probably going to be less efficient for the database for a start. What exactly is your end goal? It's also a bit strange that you switched from Selenium to Scrapy; why didn't you just say you were using Scrapy to begin with?

actually I'm developing a scrapy spider and I need take screenshot of each crawled url with other data. Then It should retrieve all scraped data with screenshot when search by title. something like little search engine. So I study internet for this screenshot purpose. I found selenium or splash can do that. But I don't familiar with both. And also there are lot of examples I found which store screenshots in local drive. But I want to store it in mongodb collection. And also I don't have idea that possible to use google cloud or S3 for achieve final goal. Because already I'm using mongodb. Huh

If you're just doing this locally, then store the images on the file system and the paths in the database. There's no reason to store the images in the database.

Databases aren't really meant for storing binaries like images. File systems and object storage facilities (like those I mentioned) are.

I see some options that don't involve MongoDB,
1) Sqlite3 which is much easier and quicker, as it stores all info in 1 file.
2) Just a file on your disk, easier, better, quicker.

I don't think there's an issue using MongoDB. I'd disagree that using SQLite or any relational database is easier. It depends how the data are being used. Plus, if they're already using Mongo, they'd have to do more work to switch - introducing a schema, normalising the data, etc.

Is it possible to store and retrieve large size images in mongodb? If we can, which is the data type should give for image file?

Sigh. We're going round in circles here. Don't store the images in the database. Store them in something that was designed for that and store the paths in the database.

Nuwan16

ndc85430

Nuwan16

ndc85430

Nuwan16

ndc85430

BitPythoner

ndc85430

Nuwan16

ndc85430