![]() |
Store Screenshot Selenium + MongoDB - Printable Version +- Python Forum (https://python-forum.io) +-- Forum: Python Coding (https://python-forum.io/forum-7.html) +--- Forum: Web Scraping & Web Development (https://python-forum.io/forum-13.html) +--- Thread: Store Screenshot Selenium + MongoDB (/thread-29051.html) |
Store Screenshot Selenium + MongoDB - Nuwan16 - Aug-15-2020 I just need to take screenshot of a webpage and store / retrieve that image in mongodb. Below code I'm using to take screenshot of webpage and How I store it in a mongodb collection? ![]() from selenium import webdriver from PIL import Image from io import BytesIO fox = webdriver.Firefox() fox.get('http://example.com/') # now that we have the preliminary stuff out of the way time to get that image :D element = fox.find_element_by_id('hlogo') # find part of the page you want image of location = element.location size = element.size png = fox.get_screenshot_as_png() # saves screenshot of entire page fox.quit() im = Image.open(BytesIO(png)) # uses PIL library to open image in memory left = location['x'] top = location['y'] right = location['x'] + size['width'] bottom = location['y'] + size['height'] im = im.crop((left, top, right, bottom)) # defines crop points im.save('screenshot.png') # saves new cropped image RE: Store Screenshot Selenium + MongoDB - ndc85430 - Aug-16-2020 Why do you want to store the image in the database in the first place, rather than some file storage (S3, Google Cloud Storage, ...) and storing the URL in the database? Did you look at the docs on Python drivers for MongoDB? RE: Store Screenshot Selenium + MongoDB - Nuwan16 - Aug-16-2020 actually I need do that with scrapy. In my program when scrapy spider crawl webpage and store some data like title,description,url... of webpage in mongodb. So I need to when spider crawl a web page, take screenshot of webpage and store it in mongodb. And after that I need retrieve it from mongodb. ? This is my spider ? import scrapy from scrapy.selector import Selector from search.models import * import lxml from lxml.html.clean import Cleaner import re from urllib.parse import urlparse import json class DSSpider(scrapy.Spider): name = "ds_spider" def __init__(self, recrawl='no'): if app.config['SPIDER_ALLOWED_DOMAINS'] != None: self.allowed_domains = app.config['SPIDER_ALLOWED_DOMAINS'] self.start_urls = ['https://www.imdb.com'] is_crawled = recrawl.lower() in ['y', 'yes', 't', 'true', '1'] crawl_list = Crawllist \ .objects(is_crawled=is_crawled) \ .limit(app.config['CLOSESPIDER_PAGECOUNT']) \ .order_by('updated_at') for link in crawl_list: self.start_urls.append(link.url) def parse(self, response): schemas = response.xpath('//script[@type="application/ld+json"]//text()').extract() for schema in schemas: data = json.loads(schema, cls=json.JSONDecoder) page_markup = data.get('@type') selector = Selector(response) # get page title page_title = selector.xpath('//title/text()').extract()[0] # get page content cleaner = Cleaner() cleaner.javascript = True cleaner.style = True page_html = selector.xpath('//body').extract()[0] # remove js and css code page_html = cleaner.clean_html(page_html) # extract text html_doc = lxml.html.document_fromstring(page_html) page_content = ' '.join(lxml.etree.XPath("//text()")(html_doc)) page_content += ' ' + page_title # remove line breaks tabs and extra spaces page_content = re.sub('\n', ' ', page_content) page_content = re.sub('\r', ' ', page_content) page_content = re.sub('\t', ' ', page_content) page_content = re.sub(' +', ' ', page_content) page_content = page_content.strip() # get page links page_hrefs = response.xpath('//a/@href').extract() page_urls = [] # filter links with unallowed domains for link in page_hrefs: # convert links to absolute urls url = response.urljoin(link) # extract domain from url parsed_url = urlparse(url) url_domain = parsed_url.netloc if url_domain in self.allowed_domains: page_urls.append(url) # log out some info self.log('Page: %s (%s)' % (response.url, page_title)) # save the page if Page.objects(url=response.url).count() == 0: page = Page(url=response.url, title=page_title, content=page_content, markup=page_markup).save() for url in page_urls: page.update(add_to_set__links=Pagelink(url=url).save()) # add url to crawl list if Crawllist.objects(url=url).count() == 0: Crawllist(url=url).save() # update crawl list Crawllist.objects(url=response.url).update(is_crawled=True) else: page = Page.objects.get(url=response.url) page.update(title=page_title, content=page_content, markup=page_markup) for next_pages in response.css('a::attr(href)'): next_page = next_pages.extract() print(next_page) if next_page is not None: yield response.follow(next_page, callback= self.parse) RE: Store Screenshot Selenium + MongoDB - ndc85430 - Aug-16-2020 You didn't actually answer my question: why do you think you should store the images in the database and retrieve them from there? It's probably going to be less efficient for the database for a start. What exactly is your end goal? It's also a bit strange that you switched from Selenium to Scrapy; why didn't you just say you were using Scrapy to begin with? RE: Store Screenshot Selenium + MongoDB - Nuwan16 - Aug-16-2020 actually I'm developing a scrapy spider and I need take screenshot of each crawled url with other data. Then It should retrieve all scraped data with screenshot when search by title. something like little search engine. So I study internet for this screenshot purpose. I found selenium or splash can do that. But I don't familiar with both. And also there are lot of examples I found which store screenshots in local drive. But I want to store it in mongodb collection. And also I don't have idea that possible to use google cloud or S3 for achieve final goal. Because already I'm using mongodb. ![]() ![]() ![]() RE: Store Screenshot Selenium + MongoDB - ndc85430 - Aug-16-2020 If you're just doing this locally, then store the images on the file system and the paths in the database. There's no reason to store the images in the database. Databases aren't really meant for storing binaries like images. File systems and object storage facilities (like those I mentioned) are. RE: Store Screenshot Selenium + MongoDB - BitPythoner - Aug-16-2020 I see some options that don't involve MongoDB, 1) Sqlite3 which is much easier and quicker, as it stores all info in 1 file. 2) Just a file on your disk, easier, better, quicker. RE: Store Screenshot Selenium + MongoDB - ndc85430 - Aug-17-2020 I don't think there's an issue using MongoDB. I'd disagree that using SQLite or any relational database is easier. It depends how the data are being used. Plus, if they're already using Mongo, they'd have to do more work to switch - introducing a schema, normalising the data, etc. RE: Store Screenshot Selenium + MongoDB - Nuwan16 - Aug-17-2020 Is it possible to store and retrieve large size images in mongodb? If we can, which is the data type should give for image file? RE: Store Screenshot Selenium + MongoDB - ndc85430 - Aug-18-2020 Sigh. We're going round in circles here. Don't store the images in the database. Store them in something that was designed for that and store the paths in the database. |