Python Forum

Hello,

good sunday all

I have found this code on so that saves images from html files - from a folder.

I keep getting an error

soup = bs(open(os.path.join(root, f)).read())
File "C:\Python\Python36-32\lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 254: character maps to <undefined>

import os, os.path
from PIL import Image
from bs4 import BeautifulSoup as bs

path = 'c:/Users/Dan/Desktop/c'

for root, dirs, files in os.walk(path):
    for f in files:
      soup = bs(open(os.path.join(root, f)).read())
      for image in soup.findAll("img"):
        print ("Image: %(src)s" % image)
        im = Image.open(image)
        im.save(path+image["src"], "png")

#https://stackoverflow.com/questions/9610728/how-do-i-extract-images-from-html-files-in-a-directory

I have researched for days - and cant work out what it wants

please will some one be kind enough to advise this error issue

I appreciate your help

Remove read() this will let BeautifulSoup handle Unicode.
As code is old most also set parser to BS eg html.parser or lxml.

import os, os.path
from PIL import Image
from bs4 import BeautifulSoup as bs

path = 'C:/code/img'
for root, dirs, files in os.walk(path):
    for f in files:
        soup = bs(open(os.path.join(root, f)), 'lxml')
        for image in soup.find_all("img"):
            image = image.get('src')    
            print(image)

The rest i don't care about,as it do not download so the files most be local.

Hello S,

thank you for the pointer.

It does print the images list but this error
_____________________________________________

â—¾%20%20%20%20%20WPCA1_files/image001.jpg
â—¾%20%20%20%20%20WPCA2_files/image001.jpg
Traceback (most recent call last):

soup = bs(open(os.path.join(root, f)), 'lxml')
File "C:\Python\Python36-32\lib\site-packages\bs4\__init__.py", line 191, in __init__
markup = markup.read()
File "C:\Python\Python36-32\lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 254: character maps to <undefined>
_____________________________________________

any ideas?

How do you saving html files local?
Have to be careful and keep utf-8 all the way.

Quote:â—¾%20%20%20%20%20WPCA1_files/image001.jpg

You see that this is messed up,and the encoding can be wrong(saving wrong) before you read it in.
Python 3 default encoding is utf-8,Python 2 had ascii as default encoding.

Can use chardet to Detect encoding.
Running a couple of html files.

E:\div_code\img
λ chardetect foo.html bar.html
foo.html: UTF-8-SIG with confidence 1.0
bar.html: UTF-8-SIG with confidence 1.0

# Run code as i posted
E:\div_code\img
λ python img.py
img_chania.jpg
smiley.gif

File are in bar.zip if you want to test.

Hello S,

so all my html files have to be UTF-8

The files are just saved from a word document as Web Page html.

I will do some testing

thanks for the help

Hello S,

I have been testing on the html files you gave.

There is no error with them.

So I have to encode all my html files as UTF-8

The final part to save my images - I get an error

import os, os.path
from PIL import Image
from bs4 import BeautifulSoup as bs
 
path = 'c:/Users/Dan/Desktop/a/'
for root, dirs, files in os.walk(path):
    for f in files:
        soup = bs(open(os.path.join(root, f)), 'lxml')
        
        for image in soup.find_all("img"):
            image = image.get('src')

            
            im = Image.open(os.path.join(root, image["src"]))
            im.save(path+image["src"], "png")
            print(image)

im = Image.open(os.path.join(root, image["src"]))
TypeError: string indices must be integers

My file paths are ok the code looks ok but this error - i dont know

This is the html

<!DOCTYPE html>
<html>
  <body>
    <h2>HTML Image</h2>
    <img src="images/image002.jpg" alt="Flowers in Chania" width="460" height="345">
  </body>
</html>

Have already taken image string out of src.
So you can not use image["src"] on image string.

>>> import os 
>>>  
>>> image = 'img_chania.jpg'
>>> root = 'E:/div_code/img'
>>> os.path.join(root, image)
'E:/div_code/img\\img_chania.jpg'

Hello s,

when i use this line to check the path

print(os.path.join(root, image))

i get the correct path
eg

c:/Users/Dan/Desktop/a/images/image001.jpg

When I use it in the complete code.

The image does not get saved

import os, os.path
from PIL import Image
from bs4 import BeautifulSoup as bs
 
path = 'c:/Users/Dan/Desktop/a/'
for root, dirs, files in os.walk(path):
    for f in files:
        soup = bs(open(os.path.join(root, f)), 'lxml')
        
        for image in soup.find_all("img"):
            image = image.get('src')

            
            #im = Image.open(os.path.join(root, image["src"]))
            
            im = Image.open(os.path.join(root, image))    #["src"]))

            im.save(path+image, "png")   # < < < Image not saving


	    # print(os.path.join(root, image))

Unicode error
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 77: character maps to <undefined>

using the html file below

<!DOCTYPE html>
<html>
  <body>
    <h2>HTML Image</h2>
    <img src="images/image001.jpg" alt="Flowers in Chania" width="460" height="345">
  </body>
</html>

dj99

snippsat

dj99

snippsat

dj99

snippsat

dj99