Downloading book preview

Truman · (This post was last modified: May-13-2019, 11:33 PM by Truman.)

The script should open a preview of amazon book, list all given pages ( images ), download it's content and print it.

import time 
import subprocess 
from selenium import webdriver 
from urllib.request import urlretrieve

driver = webdriver.Firefox()
driver.get("http://www.amazon.com/War-Peace-Leo-Nikolayevich-Tolstoy/dp/1427030200")
time.sleep(2)

driver.find_element_by_id("imgBlkFront").click()
imageList = set()

time.sleep(5)

while "pointer" in driver.find_element_by_id("sitbReaderRightPageTurner").get_attribute("style"):
	driver.find_element_by_id("sitbReaderRightPageTurner").click()
	time.sleep(2)
	pages = driver.find_elements_by_xpath("//div[@class='pageImage']/div/img")
	for page in pages:
		image = page.get_attribute("src")
		imageList.add(image)
driver.quit()

for image in sorted(imageList):
	urlretrieve(image, "page.jpg")
	p = subprocess.Popen(["tesseract", "page.jpg", "page"], stdout=subprocess.PIPE, stderr=subprocess.PIPE)
	p.wait()
	f = open("page.txt", "r")
	print(f.read())

but this is what I get

Error:Traceback (most recent call last):
  File "C:\Python36\kodovi\bookpreview.py", line 26, in <module>
    p = subprocess.Popen(["tesseract", "page.jpg", "page"], stdout=subprocess.PI
PE, stderr=subprocess.PIPE)
  File "C:\Python36\lib\subprocess.py", line 709, in __init__
    restore_signals, start_new_session)
  File "C:\Python36\lib\subprocess.py", line 997, in _execute_child
    startupinfo)
FileNotFoundError: [WinError 2] The system cannot find the file specified

***snippsat*** · May-14-2019, 12:16 AM

It try to call Tesseract OCR Windows install.
So subprocess need to find tesseract.exe,test that it work.

Output:C:\Program Files\Tesseract-OCR
λ tesseract
Usage:
  tesseract --help | --help-extra | --version
  tesseract --list-langs
  tesseract imagename outputbase [options...] [configfile...]

OCR options:
  -l LANG[+LANG]        Specify language(s) used for OCR.
NOTE: These options must occur before any configfile.

Single options:
  --help                Show this help message.
  --help-extra          Show extra help for advanced users.
  --version             Show version information.
  --list-langs          List available languages for tesseract engine.

Now i run in install folder,can also add to Windows Path(to make work anywhere cmd/cmder).

**Larz60+** · May-14-2019, 12:33 AM

It's looking for page.jpg
If I run your code in debugger, here's the exception:

Output:
page.jpg: AttributeError("'FirefoxWebElement' object has no attribute 'jpg'")

Form line 26

Truman · May-14-2019, 10:01 PM

Hm, I thought that it's going to open page.jpg file.
Any idea what is a better thing to do? I'm not so skilled with tesseract.

**Larz60+** · (This post was last modified: May-15-2019, 05:34 AM by Larz60+.)

Quote:I'm not so skilled with tesseract.

Nor I, but I'll take a look at the docs

**Larz60+** · (This post was last modified: May-15-2019, 06:02 AM by Larz60+.)

I couldn't find any limitations on image type, but all examples show .png.
you can try converting to png (gimp, or from pillow, open as jpg and save as png)

Truman · (This post was last modified: May-15-2019, 10:03 PM by Truman.)

now checking my folder with .py files. What is interesting that yesterday when I tried this code for the first time page.jpg file was created but it had only the first preview page. Wondering why...

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	python selenium downloading embedded pdf	damian0612	0	3,725	Feb-23-2021, 09:11 PM Last Post: damian0612
	Downloading CSV from a website	bmiller12	1	1,803	Nov-26-2020, 09:33 AM Last Post: Axel_Erfurt
	Downloading Multiple Webpages	MoziakBeats	4	3,244	Apr-17-2019, 04:06 AM Last Post: Skaperen
	Downloading txt files	tjnichols	6	4,052	Aug-27-2018, 10:01 PM Last Post: tjnichols

Downloading book preview

User Panel Messages

Announcements