Python Forum

Full Version: Downloading book preview
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
The script should open a preview of amazon book, list all given pages ( images ), download it's content and print it.
import time 
import subprocess 
from selenium import webdriver 
from urllib.request import urlretrieve

driver = webdriver.Firefox()
driver.get("http://www.amazon.com/War-Peace-Leo-Nikolayevich-Tolstoy/dp/1427030200")
time.sleep(2)

driver.find_element_by_id("imgBlkFront").click()
imageList = set()

time.sleep(5)

while "pointer" in driver.find_element_by_id("sitbReaderRightPageTurner").get_attribute("style"):
	driver.find_element_by_id("sitbReaderRightPageTurner").click()
	time.sleep(2)
	pages = driver.find_elements_by_xpath("//div[@class='pageImage']/div/img")
	for page in pages:
		image = page.get_attribute("src")
		imageList.add(image)
driver.quit()

for image in sorted(imageList):
	urlretrieve(image, "page.jpg")
	p = subprocess.Popen(["tesseract", "page.jpg", "page"], stdout=subprocess.PIPE, stderr=subprocess.PIPE)
	p.wait()
	f = open("page.txt", "r")
	print(f.read())
but this is what I get
Error:
Traceback (most recent call last): File "C:\Python36\kodovi\bookpreview.py", line 26, in <module> p = subprocess.Popen(["tesseract", "page.jpg", "page"], stdout=subprocess.PI PE, stderr=subprocess.PIPE) File "C:\Python36\lib\subprocess.py", line 709, in __init__ restore_signals, start_new_session) File "C:\Python36\lib\subprocess.py", line 997, in _execute_child startupinfo) FileNotFoundError: [WinError 2] The system cannot find the file specified
It try to call Tesseract OCR Windows install.
So subprocess need to find tesseract.exe,test that it work.
Output:
C:\Program Files\Tesseract-OCR λ tesseract Usage: tesseract --help | --help-extra | --version tesseract --list-langs tesseract imagename outputbase [options...] [configfile...] OCR options: -l LANG[+LANG] Specify language(s) used for OCR. NOTE: These options must occur before any configfile. Single options: --help Show this help message. --help-extra Show extra help for advanced users. --version Show version information. --list-langs List available languages for tesseract engine.
Now i run in install folder,can also add to Windows Path(to make work anywhere cmd/cmder).
It's looking for page.jpg
If I run your code in debugger, here's the exception:
Output:
page.jpg: AttributeError("'FirefoxWebElement' object has no attribute 'jpg'")
Form line 26
Hm, I thought that it's going to open page.jpg file.
Any idea what is a better thing to do? I'm not so skilled with tesseract.
Quote:I'm not so skilled with tesseract.
Nor I, but I'll take a look at the docs
I couldn't find any limitations on image type, but all examples show .png.
you can try converting to png (gimp, or from pillow, open as jpg and save as png)
now checking my folder with .py files. What is interesting that yesterday when I tried this code for the first time page.jpg file was created but it had only the first preview page. Wondering why...