Python Forum
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Downloading book preview
#1
The script should open a preview of amazon book, list all given pages ( images ), download it's content and print it.
import time 
import subprocess 
from selenium import webdriver 
from urllib.request import urlretrieve

driver = webdriver.Firefox()
driver.get("http://www.amazon.com/War-Peace-Leo-Nikolayevich-Tolstoy/dp/1427030200")
time.sleep(2)

driver.find_element_by_id("imgBlkFront").click()
imageList = set()

time.sleep(5)

while "pointer" in driver.find_element_by_id("sitbReaderRightPageTurner").get_attribute("style"):
	driver.find_element_by_id("sitbReaderRightPageTurner").click()
	time.sleep(2)
	pages = driver.find_elements_by_xpath("//div[@class='pageImage']/div/img")
	for page in pages:
		image = page.get_attribute("src")
		imageList.add(image)
driver.quit()

for image in sorted(imageList):
	urlretrieve(image, "page.jpg")
	p = subprocess.Popen(["tesseract", "page.jpg", "page"], stdout=subprocess.PIPE, stderr=subprocess.PIPE)
	p.wait()
	f = open("page.txt", "r")
	print(f.read())
but this is what I get
Error:
Traceback (most recent call last): File "C:\Python36\kodovi\bookpreview.py", line 26, in <module> p = subprocess.Popen(["tesseract", "page.jpg", "page"], stdout=subprocess.PI PE, stderr=subprocess.PIPE) File "C:\Python36\lib\subprocess.py", line 709, in __init__ restore_signals, start_new_session) File "C:\Python36\lib\subprocess.py", line 997, in _execute_child startupinfo) FileNotFoundError: [WinError 2] The system cannot find the file specified
Reply
#2
It try to call Tesseract OCR Windows install.
So subprocess need to find tesseract.exe,test that it work.
Output:
C:\Program Files\Tesseract-OCR λ tesseract Usage: tesseract --help | --help-extra | --version tesseract --list-langs tesseract imagename outputbase [options...] [configfile...] OCR options: -l LANG[+LANG] Specify language(s) used for OCR. NOTE: These options must occur before any configfile. Single options: --help Show this help message. --help-extra Show extra help for advanced users. --version Show version information. --list-langs List available languages for tesseract engine.
Now i run in install folder,can also add to Windows Path(to make work anywhere cmd/cmder).
Reply
#3
It's looking for page.jpg
If I run your code in debugger, here's the exception:
Output:
page.jpg: AttributeError("'FirefoxWebElement' object has no attribute 'jpg'")
Form line 26
Reply
#4
Hm, I thought that it's going to open page.jpg file.
Any idea what is a better thing to do? I'm not so skilled with tesseract.
Reply
#5
Quote:I'm not so skilled with tesseract.
Nor I, but I'll take a look at the docs
Reply
#6
I couldn't find any limitations on image type, but all examples show .png.
you can try converting to png (gimp, or from pillow, open as jpg and save as png)
Reply
#7
now checking my folder with .py files. What is interesting that yesterday when I tried this code for the first time page.jpg file was created but it had only the first preview page. Wondering why...
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  python selenium downloading embedded pdf damian0612 0 3,725 Feb-23-2021, 09:11 PM
Last Post: damian0612
  Downloading CSV from a website bmiller12 1 1,803 Nov-26-2020, 09:33 AM
Last Post: Axel_Erfurt
  Downloading Multiple Webpages MoziakBeats 4 3,244 Apr-17-2019, 04:06 AM
Last Post: Skaperen
  Downloading txt files tjnichols 6 4,052 Aug-27-2018, 10:01 PM
Last Post: tjnichols

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020