Python Forum
HTML - Save Images From Folder - PIL
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
HTML - Save Images From Folder - PIL
#1
Hello,

good sunday all

I have found this code on so that saves images from html files - from a folder.

I keep getting an error

soup = bs(open(os.path.join(root, f)).read())
File "C:\Python\Python36-32\lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 254: character maps to <undefined>


import os, os.path
from PIL import Image
from bs4 import BeautifulSoup as bs

path = 'c:/Users/Dan/Desktop/c'

for root, dirs, files in os.walk(path):
    for f in files:
      soup = bs(open(os.path.join(root, f)).read())
      for image in soup.findAll("img"):
        print ("Image: %(src)s" % image)
        im = Image.open(image)
        im.save(path+image["src"], "png")

#https://stackoverflow.com/questions/9610728/how-do-i-extract-images-from-html-files-in-a-directory
I have researched for days - and cant work out what it wants

please will some one be kind enough to advise this error issue

I appreciate your help



:)


Python newbie trying to learn the ropes
Reply
#2
Remove read() this will let BeautifulSoup handle Unicode.
As code is old most also set parser to BS eg html.parser or lxml.
import os, os.path
from PIL import Image
from bs4 import BeautifulSoup as bs

path = 'C:/code/img'
for root, dirs, files in os.walk(path):
    for f in files:
        soup = bs(open(os.path.join(root, f)), 'lxml')
        for image in soup.find_all("img"):
            image = image.get('src')    
            print(image)
The rest i don't care about,as it do not download so the files most be local.
Reply
#3
Hello S,

thank you for the pointer.


It does print the images list but this error
_____________________________________________

â—¾%20%20%20%20%20WPCA1_files/image001.jpg
â—¾%20%20%20%20%20WPCA2_files/image001.jpg
Traceback (most recent call last):


soup = bs(open(os.path.join(root, f)), 'lxml')
File "C:\Python\Python36-32\lib\site-packages\bs4\__init__.py", line 191, in __init__
markup = markup.read()
File "C:\Python\Python36-32\lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 254: character maps to <undefined>
_____________________________________________

any ideas?



:)


Python newbie trying to learn the ropes
Reply
#4
How do you saving html files local?
Have to be careful and keep utf-8 all the way.
Quote:â—¾%20%20%20%20%20WPCA1_files/image001.jpg
You see that this is messed up,and the encoding can be wrong(saving wrong) before you read it in.
Python 3 default encoding is utf-8,Python 2 had ascii as default encoding.

Can use chardet to Detect encoding.
Running a couple of html files.
E:\div_code\img
λ chardetect foo.html bar.html
foo.html: UTF-8-SIG with confidence 1.0
bar.html: UTF-8-SIG with confidence 1.0

# Run code as i posted
E:\div_code\img
λ python img.py
img_chania.jpg
smiley.gif
File are in bar.zip if you want to test.

Attached Files

.zip   bar.zip (Size: 436 bytes / Downloads: 174)
Reply
#5
Hello S,

so all my html files have to be UTF-8

The files are just saved from a word document as Web Page html.


I will do some testing

thanks for the help

Hello S,

I have been testing on the html files you gave.

There is no error with them.

So I have to encode all my html files as UTF-8

The final part to save my images - I get an error

import os, os.path
from PIL import Image
from bs4 import BeautifulSoup as bs
 
path = 'c:/Users/Dan/Desktop/a/'
for root, dirs, files in os.walk(path):
    for f in files:
        soup = bs(open(os.path.join(root, f)), 'lxml')
        
        for image in soup.find_all("img"):
            image = image.get('src')

            
            im = Image.open(os.path.join(root, image["src"]))
            im.save(path+image["src"], "png")
            print(image)
im = Image.open(os.path.join(root, image["src"]))
TypeError: string indices must be integers

My file paths are ok the code looks ok but this error - i dont know

This is the html
<!DOCTYPE html>
<html>
  <body>
    <h2>HTML Image</h2>
    <img src="images/image002.jpg" alt="Flowers in Chania" width="460" height="345">
  </body>
</html>



:)


Python newbie trying to learn the ropes
Reply
#6
Have already taken image string out of src.
So you can not use image["src"] on image string.
>>> import os 
>>>  
>>> image = 'img_chania.jpg'
>>> root = 'E:/div_code/img'
>>> os.path.join(root, image)
'E:/div_code/img\\img_chania.jpg'
Reply
#7
Hello s,


when i use this line to check the path

print(os.path.join(root, image))

i get the correct path
eg

c:/Users/Dan/Desktop/a/images/image001.jpg


When I use it in the complete code.

The image does not get saved




import os, os.path
from PIL import Image
from bs4 import BeautifulSoup as bs
 
path = 'c:/Users/Dan/Desktop/a/'
for root, dirs, files in os.walk(path):
    for f in files:
        soup = bs(open(os.path.join(root, f)), 'lxml')
        
        for image in soup.find_all("img"):
            image = image.get('src')

            
            #im = Image.open(os.path.join(root, image["src"]))
            
            im = Image.open(os.path.join(root, image))    #["src"]))

            im.save(path+image, "png")   # < < < Image not saving


	    # print(os.path.join(root, image))
Unicode error
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 77: character maps to <undefined>

using the html file below
<!DOCTYPE html>
<html>
  <body>
    <h2>HTML Image</h2>
    <img src="images/image001.jpg" alt="Flowers in Chania" width="460" height="345">
  </body>
</html>



:)


Python newbie trying to learn the ropes
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  HTML multi select HTML listbox with Flask/Python rfeyer 0 4,534 Mar-14-2021, 12:23 PM
Last Post: rfeyer
  cant access root environment folder from django folder using __init__.py Sanjish 0 1,874 Dec-25-2020, 05:56 AM
Last Post: Sanjish
  Python3 + BeautifulSoup4 + lxml (HTML -> CSV) - How to loop to next HTML/new CSV Row BrandonKastning 0 2,329 Mar-22-2020, 06:10 AM
Last Post: BrandonKastning
  flask app to save images locally when deployed on heroku not working Prince_Bhatia 1 5,231 Feb-20-2019, 11:59 PM
Last Post: snippsat
  how i save the html form to flask database mebaysan 1 7,245 Feb-07-2019, 12:56 AM
Last Post: snippsat
  Execute using Html, Save data into Database and Download in CSV in Django --Part 1 Prince_Bhatia 0 3,799 Jan-19-2018, 06:05 AM
Last Post: Prince_Bhatia

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020