Hello,
good sunday all
I have found this code on so that saves images from html files - from a folder.
I keep getting an error
soup = bs(open(os.path.join(root, f)).read())
File "C:\Python\Python36-32\lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 254: character maps to <undefined>
import os, os.path
from PIL import Image
from bs4 import BeautifulSoup as bs
path = 'c:/Users/Dan/Desktop/c'
for root, dirs, files in os.walk(path):
for f in files:
soup = bs(open(os.path.join(root, f)).read())
for image in soup.findAll("img"):
print ("Image: %(src)s" % image)
im = Image.open(image)
im.save(path+image["src"], "png")
#https://stackoverflow.com/questions/9610728/how-do-i-extract-images-from-html-files-in-a-directory
I have researched for days - and cant work out what it wants
please will some one be kind enough to advise this error issue
I appreciate your help
Remove
read()
this will let BeautifulSoup handle Unicode.
As code is old most also set parser to BS eg
html.parser
or
lxml
.
import os, os.path
from PIL import Image
from bs4 import BeautifulSoup as bs
path = 'C:/code/img'
for root, dirs, files in os.walk(path):
for f in files:
soup = bs(open(os.path.join(root, f)), 'lxml')
for image in soup.find_all("img"):
image = image.get('src')
print(image)
The rest i don't care about,as it do not download so the files most be local.
Hello S,
thank you for the pointer.
It does print the images list but this error
_____________________________________________
â—¾%20%20%20%20%20WPCA1_files/image001.jpg
â—¾%20%20%20%20%20WPCA2_files/image001.jpg
Traceback (most recent call last):
soup = bs(open(os.path.join(root, f)), 'lxml')
File "C:\Python\Python36-32\lib\site-packages\bs4\__init__.py", line 191, in __init__
markup = markup.read()
File "C:\Python\Python36-32\lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 254: character maps to <undefined>
_____________________________________________
any ideas?
How do you saving html files local?
Have to be careful and keep
utf-8
all the way.
Quote:â—¾%20%20%20%20%20WPCA1_files/image001.jpg
You see that this is messed up,and the encoding can be wrong(saving wrong) before you read it in.
Python 3 default encoding is
utf-8
,Python 2 had
ascii
as default encoding.
Can use
chardet to Detect encoding.
Running a couple of html files.
E:\div_code\img
λ chardetect foo.html bar.html
foo.html: UTF-8-SIG with confidence 1.0
bar.html: UTF-8-SIG with confidence 1.0
# Run code as i posted
E:\div_code\img
λ python img.py
img_chania.jpg
smiley.gif
File are in
bar.zip
if you want to test.
Hello S,
so all my html files have to be UTF-8
The files are just saved from a word document as Web Page html.
I will do some testing
thanks for the help
Hello S,
I have been testing on the html files you gave.
There is no error with them.
So I have to encode all my html files as UTF-8
The final part to save my images - I get an error
import os, os.path
from PIL import Image
from bs4 import BeautifulSoup as bs
path = 'c:/Users/Dan/Desktop/a/'
for root, dirs, files in os.walk(path):
for f in files:
soup = bs(open(os.path.join(root, f)), 'lxml')
for image in soup.find_all("img"):
image = image.get('src')
im = Image.open(os.path.join(root, image["src"]))
im.save(path+image["src"], "png")
print(image)
im = Image.open(os.path.join(root, image["src"]))
TypeError: string indices must be integers
My file paths are ok the code looks ok but this error - i dont know
This is the html
<!DOCTYPE html>
<html>
<body>
<h2>HTML Image</h2>
<img src="images/image002.jpg" alt="Flowers in Chania" width="460" height="345">
</body>
</html>
Have already taken image string out of
src
.
So you can not use
image["src"]
on image string.
>>> import os
>>>
>>> image = 'img_chania.jpg'
>>> root = 'E:/div_code/img'
>>> os.path.join(root, image)
'E:/div_code/img\\img_chania.jpg'
Hello s,
when i use this line to check the path
print(os.path.join(root, image))
i get the correct path
eg
c:/Users/Dan/Desktop/a/images/image001.jpg
When I use it in the complete code.
The image does not get saved
import os, os.path
from PIL import Image
from bs4 import BeautifulSoup as bs
path = 'c:/Users/Dan/Desktop/a/'
for root, dirs, files in os.walk(path):
for f in files:
soup = bs(open(os.path.join(root, f)), 'lxml')
for image in soup.find_all("img"):
image = image.get('src')
#im = Image.open(os.path.join(root, image["src"]))
im = Image.open(os.path.join(root, image)) #["src"]))
im.save(path+image, "png") # < < < Image not saving
# print(os.path.join(root, image))
Unicode error
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 77: character maps to <undefined>
using the html file below
<!DOCTYPE html>
<html>
<body>
<h2>HTML Image</h2>
<img src="images/image001.jpg" alt="Flowers in Chania" width="460" height="345">
</body>
</html>