Python Forum

The script should open a preview of amazon book, list all given pages ( images ), download it's content and print it.

import time 
import subprocess 
from selenium import webdriver 
from urllib.request import urlretrieve

driver = webdriver.Firefox()
driver.get("http://www.amazon.com/War-Peace-Leo-Nikolayevich-Tolstoy/dp/1427030200")
time.sleep(2)

driver.find_element_by_id("imgBlkFront").click()
imageList = set()

time.sleep(5)

while "pointer" in driver.find_element_by_id("sitbReaderRightPageTurner").get_attribute("style"):
	driver.find_element_by_id("sitbReaderRightPageTurner").click()
	time.sleep(2)
	pages = driver.find_elements_by_xpath("//div[@class='pageImage']/div/img")
	for page in pages:
		image = page.get_attribute("src")
		imageList.add(image)
driver.quit()

for image in sorted(imageList):
	urlretrieve(image, "page.jpg")
	p = subprocess.Popen(["tesseract", "page.jpg", "page"], stdout=subprocess.PIPE, stderr=subprocess.PIPE)
	p.wait()
	f = open("page.txt", "r")
	print(f.read())

but this is what I get

Error:Traceback (most recent call last):
  File "C:\Python36\kodovi\bookpreview.py", line 26, in <module>
    p = subprocess.Popen(["tesseract", "page.jpg", "page"], stdout=subprocess.PI
PE, stderr=subprocess.PIPE)
  File "C:\Python36\lib\subprocess.py", line 709, in __init__
    restore_signals, start_new_session)
  File "C:\Python36\lib\subprocess.py", line 997, in _execute_child
    startupinfo)
FileNotFoundError: [WinError 2] The system cannot find the file specified

It try to call Tesseract OCR Windows install.
So subprocess need to find tesseract.exe,test that it work.

Output:C:\Program Files\Tesseract-OCR
λ tesseract
Usage:
  tesseract --help | --help-extra | --version
  tesseract --list-langs
  tesseract imagename outputbase [options...] [configfile...]

OCR options:
  -l LANG[+LANG]        Specify language(s) used for OCR.
NOTE: These options must occur before any configfile.

Single options:
  --help                Show this help message.
  --help-extra          Show extra help for advanced users.
  --version             Show version information.
  --list-langs          List available languages for tesseract engine.

Now i run in install folder,can also add to Windows Path(to make work anywhere cmd/cmder).

It's looking for page.jpg
If I run your code in debugger, here's the exception:

Output:
page.jpg: AttributeError("'FirefoxWebElement' object has no attribute 'jpg'")

Form line 26

Hm, I thought that it's going to open page.jpg file.
Any idea what is a better thing to do? I'm not so skilled with tesseract.

Quote:I'm not so skilled with tesseract.

Nor I, but I'll take a look at the docs

I couldn't find any limitations on image type, but all examples show .png.
you can try converting to png (gimp, or from pillow, open as jpg and save as png)

now checking my folder with .py files. What is interesting that yesterday when I tried this code for the first time page.jpg file was created but it had only the first preview page. Wondering why...

Truman

snippsat

Larz60+

Truman

Larz60+

Larz60+

Truman