HTML file crashes program

mikefirth · Dec-27-2021, 07:01 PM

I am reading my local copy of an HTML file that works fine both working from my disk and on the internet.
In this code an infinite loop reads each line and looks for "img" to create an output file with image names (crudely at this point.)
After running a while through the code, it crashes.
The short output lines are for debugging, so I can search the HTML file since I don't have line count access.
Error at end of output also

Error:presentation at request of Iow
    <font face="Times New Roma
Traceback (most recent call last):
  File "/Users/mike_mac/Programming/Python/read-HTM.py", line 15, in <module>
    txt5=test_file.readline()
  File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/codecs.py", line 322, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd7 in position 973: invalid continuation byte
>>>

test_file = open("/Users/mikefirth/Python W/LOUISE-P.htm")
out_file = open("/Users/mikefirth/Python W/LOUISE-P-out.txt","w")
tchr=int(0)
tlin=0
tint=0
tcnt=0
tinp=str(' ')
txt1=str(' ')
txt2=txt1
txt3=txt2
txt4=txt2
txt5=txt2
while 1!=0:  #loop
    txt5=test_file.readline()
    tcnt=tcnt+1
    print(txt5[0:30])
    tchr=txt5.find("img")
    if(tchr>0):
        tlin=tlin+1
        print(tcnt,tlin,txt5[tchr:tchr+80],'\n')

Output:<td valign="top" width="350"> 
1051 17 img src="ds5040a.jpg" alt="Louise Kelly countryside painting in Firth home abuut 

    <img src="ds5040b.jpg" alt
1052 18 img src="ds5040b.jpg" alt="Louise Kelly Firth french landscape image in B&amp;W" 

</tr>

<tr>

<td align="center" width="19">
<td width="350">Two paintings 
University where Louise taught
&nbsp;</td>

<td width="350">Pencil-sketch-
presentation at request of Iow
In president's home, seen stra
</tr>

<tr>

<td align="center" width="19">
<td valign="top" width="350"><
Accession Number: </strong>U99
Object Title: </strong>End of 
Date of Work: </strong>Unknown
<strong> Artist: </strong>Kell
<strong> Country: </strong>USA
<strong>Description: </strong>
green, yellow, reds. There is 
off in the distance over a hil
fields of grain to left. A row
the road. In the center backgr
several smaller farm buildings
extreme background. White clou
form the sky. There is a 3 1/2
shades of green overall. There
bottom of the frame with the f
Gamble Liljidahl, presented by
<td valign="top" width="350">

<p align="center"> <a href="ds
1083 19 img src="ds4476bw.jpg" alt="Louise Kelly painting, End of the Trail, at ISU" bor 

</td>

</tr>

<tr>

<td align="center" width="19">
<a name="5I">5I</a></td>

<td valign="top" width="350"><
Accession Number: </strong>UM8
Object Title: </strong>TRURO, 
<strong> Artist: </strong>Kell
<strong> Date of Work: </stron
<strong> Medium: </strong>Oil 
boat centered is red, the sea 
blue. The boat down right and 
while the bluff at left and th
green. The sky is a deeper blu
that are not defined in the sk
<td valign="top" width="350">

<p align="left">Pencil-sketch-
presentation at request of Iow
    <font face="Times New Roma
Traceback (most recent call last):
  File "/Users/mike_mac/Programming/Python/read-HTM.py", line 15, in <module>
    txt5=test_file.readline()
  File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/codecs.py", line 322, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd7 in position 973: invalid continuation byte

***snippsat*** · (This post was last modified: Dec-27-2021, 09:21 PM by snippsat.)

You should parser for this,also saving/read a html file local is easy to mess up Unicode eg should keep utf-8 all the way.
Example.

from bs4 import BeautifulSoup

# Simulate a web page
html = '''\
<body>
  <div id='images'>
    <a href='image1.html'>My image 1<br /><img src='image1_thumb.jpg'></a>
    <a href='image2.html'>My image 1<br /><img src='image2_thumb.jpg'></a>
    <a href='image3.html'>My image 3<br /><img src='image3_thumb.jpg'></a>
  </div>
</body>
'''

soup = BeautifulSoup(html, 'html.parser')
print([link.text for link in soup.find_all('a')])
print([link.get('src') for link in soup.find_all('img')])

Output:['My image 1', 'My image 2', 'My image 3']
['image1_thumb.jpg', 'image2_thumb.jpg', 'image3_thumb.jpg']

Reading from local file it would be like this.

from bs4 import BeautifulSoup

soup = BeautifulSoup(open('page.html', encoding='utf-8'), 'html.parser')
print([link.text for link in soup.find_all('a')])
print([link.get('src') for link in soup.find_all('img')])

Output:['My image 1', 'My image 2', 'My image 3']
['image1_thumb.jpg', 'image2_thumb.jpg', 'image3_thumb.jpg']

mikefirth · Dec-28-2021, 04:29 AM

I don't want to appear stupid or insulting but ...
'My image 2' appears in your output and is not in the program simulation
'You should parser' is apparently jargon for something in BeautifulSoup that does mystery magic. Follows references to other files?
As suggested in the output sample, there are over 1080 lines with 'img' in them. Outputting a bracketed list may serve some scraping purpose but I would like to be able have alternate/multiple responses to lines found (like also missing 'alt' statements.)
Altho you can find the file online https://mikegigi.com/firthg/louise.htm I (almost) never download, always work on my local copy and upload. Hand built, mostly. Not sure how Unicode would get in and outside of BeautifulSoup, how to open a file in Python with 8 bit encoding.
Fairly sure my problems stem from old-man thinking rooted in FORTRAN+LISP learned in 1960 (then COBOL, BASIC x6, Pascal, Assember x4, HTML, and trying to self-teach Python. Much prefer straight code to obscure libraries

(Dec-27-2021, 09:21 PM)snippsat Wrote: You should parser for this,also saving/read a html file local is easy to mess up Unicode eg should keep utf-8 all the way.
Example.

from bs4 import BeautifulSoup

# Simulate a web page
html = '''\
<body>
  <div id='images'>
    <a href='image1.html'>My image 1<br /><img src='image1_thumb.jpg'></a>
    <a href='image2.html'>My image 1<br /><img src='image2_thumb.jpg'></a>
    <a href='image3.html'>My image 3<br /><img src='image3_thumb.jpg'></a>
  </div>
</body>
'''

soup = BeautifulSoup(html, 'html.parser')
print([link.text for link in soup.find_all('a')])
print([link.get('src') for link in soup.find_all('img')])

Output:['My image 1', 'My image 2', 'My image 3']
['image1_thumb.jpg', 'image2_thumb.jpg', 'image3_thumb.jpg']

Reading from local file it would be like this.

from bs4 import BeautifulSoup

soup = BeautifulSoup(open('page.html', encoding='utf-8'), 'html.parser')
print([link.text for link in soup.find_all('a')])
print([link.get('src') for link in soup.find_all('img')])

Output:['My image 1', 'My image 2', 'My image 3']
['image1_thumb.jpg', 'image2_thumb.jpg', 'image3_thumb.jpg']

ibreeden · (This post was last modified: Dec-28-2021, 10:01 AM by ibreeden.)

Hi @mikefirth ,

(Dec-28-2021, 04:29 AM)mikefirth Wrote: Much prefer straight code to obscure libraries

I feel a lot like you do. I also want to know what is happening. Want to stick to a good algorithm.
But in the case of managing complex file formats, I must agree with Snippsat to use a parser for these.
Complex file formats are for example: html, xml, json and even CSV. CSV seems the most simple of these but yet there are pitfalls. For example when a field contains the field separator. Or a newline! There are rules to manage these cases, but you would have quite some work to write a CSV reader having implemented these rules. So in these cases: using the CSV module solves a lot of these problems.

The same goes for more complex formats like XML and HTML. These formats may for example contain comments. A simple program reading a file in this format would not notice this.

In this case, like Snippsat said, I would recommend using a html parser like Beautiful Soup.

***snippsat*** · (This post was last modified: Dec-28-2021, 10:26 AM by snippsat.)

(Dec-28-2021, 04:29 AM)mikefirth Wrote: Altho you can find the file online https://mikegigi.com/firthg/louise.htm I (almost) never download, always work on my local copy and upload.

Can show a example how to parse that site.
Site use encoding ISO-8859-1 when using Requests/BeautifulSoup they handle this automatically.
If save local to disk most keep same encoding.

Images info from first Table,it's and old site so they split all up in Tables.

import requests
from bs4 import BeautifulSoup

url = 'https://mikegigi.com/firthg/louise.htm'
response = requests.get(url)
soup = BeautifulSoup(response.content, 'lxml')
print(soup.find('title').text)
print('-' * 40)
first_table = soup.select_one('body > div:nth-child(5) > center > table > tbody')
for im in first_table:
    if im.find('img') == -1:
        pass
    else:
        print(im.find('img'))

Output:<img alt="Louise and Edward Kelly in Carrington ND" border="1" height="196" src="ds4589.jpg" width="242"/>
<img alt="Picture of Louise Kelly in women's club booklet" border="1" height="273" src="louise1b.jpg" width="186"/>
<img alt="Louise Kelly while working on picture Jan 1933" border="1" height="258" src="louise2a.jpg" width="194"/>
<img alt="Louise Kelly working on painting in ad 1930's" border="1" height="256" src="louise3a.jpg" width="250"/>
<img alt="Louise Kelly portrait in Rockport Art Association, Artists of" border="1" height="384" src="lkrockport001b.jpg" width="266"/>

So can take out info from alt and src.
Change line 14:

print(im.find('img').get('alt'))

Output:Louise Minert Kelly
----------------------------------------
Louise and Edward Kelly in Carrington ND
Picture of Louise Kelly in women's club booklet
Louise Kelly while working on picture Jan 1933
Louise Kelly working on painting in ad 1930's
Louise Kelly portrait in Rockport Art Association, Artists of

print(im.find('img').get('src'))

Output:Louise Minert Kelly
----------------------------------------
ds4589.jpg
louise1b.jpg
louise2a.jpg
louise3a.jpg
lkrockport001b.jpg

So get ref link to image if want to download site url + ds4589.jpg.

For more info abut parsing look at part-1 and part-2.

Pedroski55 · Dec-29-2021, 12:06 AM

@mikefirth saurier

Listen to snippsat, unlike me, he really knows what he is talking about!

This still needs tweaking, but does the job. Some pics are not found, not sure why right now.

I think you should be using re

#! /usr/bin/python3
import re
  
path2text = '/home/pedro/temp/lovely_louise.html'
with open(path2text) as f:
    lines = f.readlines()

print('lines is', len(lines), 'long')

# get lines with '<img src="'
data = []
for line in lines:
    if '<img src="' in line:
        data.append(line)

print('data is', len(data), 'long')

pattern1 = re.compile('img src="')
pattern2 = re.compile('jpg"')
pattern3 = re.compile('gif"')

def get_Image_name(line):
    start_span = pattern1.search(line)
    start_pos = pattern1.search(line).span()[1]
    # maybe not a jpg
    if not pattern2.search(line) == None:
        end_pos = pattern2.search(line).span()[0] + 3
    if not pattern3.search(line) == None:
        end_pos = pattern3.search(line).span()[0] + 3
    # add more ifs for other images
    img_name = line[start_pos:end_pos]
    return img_name

# a list to take the names
jpg_names = []

# some names are not picked up, need to look at that
for line in data:
    print(line)
    name = get_Image_name(line)
    jpg_names.append(name)

print('jpg_names is', len(jpg_names), 'long')
savename = '/home/pedro/temp/photo_names.txt'

with open(savename, 'w') as f:
    text = '\n'.join(jpg_names)
    f.write(text)

print('All done!')

***snippsat*** · Dec-29-2021, 01:00 AM

(Dec-29-2021, 12:06 AM)Pedroski55 Wrote: I think you should be using re

No.

(Dec-29-2021, 12:06 AM)Pedroski55 Wrote: Some pics are not found, not sure why right now.

Regex and html is not best friends,take a look a this post funny read🎭

That said both BS and lxml has some regex capability build in regular expression,
and you use regex when parser can not do anymore like after calling .text.
Using regex alone with HTML/XML is bad idea,that's why parser exists.

Pedroski55 · (This post was last modified: Dec-29-2021, 04:46 AM by Pedroski55.)

@snippsatt Just trying!

Still not too sure how to jump ahead from the start_pos in this start_pos = pattern1.search(line).span()[1], but I get the result I wanted!

I believe bowlfred is very good with re, maybe he can straighten out my code!

This gets me the desired result. The problem was, the html is not uniform.

I needed another pattern to cater for the lack of alt="

#! /usr/bin/python3
import re
  
path2text = '/home/pedro/temp/lovely_louise.html'
with open(path2text) as f:
    lines = f.readlines()

print('lines is', len(lines), 'long')

# get lines with '<img src="'
data = []
for line in lines:
    if '<img src="' in line:
        data.append(line)

print('data is', len(data), 'long')

for d in data:
    print(d)

pattern1 = re.compile('<img src="')
#pattern2 = re.compile('img src=".jpg"')
pattern2 = re.compile('jpg" alt="')
pattern3 = re.compile('gif" alt="')
pattern4 = re.compile('jpg" border="')

def get_Image_name(line):
    start_span = pattern1.search(line).span()
    start_pos = pattern1.search(line).span()[1]
    # maybe not a jpg
    if not pattern2.search(line) == None:
        end_pos = pattern2.search(line).span()[0] + 3
    elif not pattern3.search(line) == None:
        end_pos = pattern3.search(line).span()[0] + 3
    elif not pattern4.search(line) == None:
        end_pos = pattern4.search(line).span()[0] + 3
    # add more ifs for other images
    img_name = line[start_pos:end_pos]
    return img_name

# a list to take the names
jpg_names = []

# some names are not picked up, need to look at that
for line in data:
    print(line)
    name = get_Image_name(line)
    jpg_names.append(name)

print('jpg_names is', len(jpg_names), 'long')
savename = '/home/pedro/temp/photo_names.txt'

for jpeg in jpg_names:
    print('picture is', jpeg)

with open(savename, 'w') as f:
    text = '\n'.join(jpg_names)
    f.write(text)

print('All done!')

***snippsat*** · (This post was last modified: Dec-29-2021, 12:11 PM by snippsat.)

(Dec-29-2021, 04:45 AM)Pedroski55 Wrote: I believe bowlfred is very good with re, maybe he can straighten out my code!

This gets me the desired result. The problem was, the html is not uniform.

It's not about being good at regex,it's the wrong tool so gone be struggle every time try do this with HTML.
I could fix it as my regex knowledge is good,but why struggle when parser has solved this and make it so much easier.

My code was more for learning purpose to show how parser works,we try not to solve the whole task so OP can try.
Ok download all images at site,to show how it can be done using the right tool.

import requests, os
from bs4 import BeautifulSoup

url = 'https://mikegigi.com/firthg/louise.htm'
response = requests.get(url)
soup = BeautifulSoup(response.content, 'lxml')
base_url = 'https://mikegigi.com/firthg/'
for im in soup.select('img'):
    image_url = f"{base_url}{im.get('src')}"
    img_name = os.path.basename(image_url)
    response = requests.get(image_url)
    img_name = os.path.basename(image_url)
    with open(img_name, 'wb') as f_out:
        f_out.write(response.content)

Pedroski55 · Dec-29-2021, 10:18 PM

Point taken! Do this as above!

The link to stackoverflow was very funny!

I remember reading a rant from a programmer about html.

He was complaining that a browser will still display html when it is full of mistakes and omissions. He found that very annoying!

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Need to replace a string with a file (HTML file)	tester_V	1	2,045	Aug-30-2023, 03:42 AM Last Post: Larz60+
	Understanding and debugging memory error crashes with python3.10.10	Arkaik	5	4,898	Apr-18-2023, 03:22 AM Last Post: Larz60+
	Tkinterweb (Browser Module) Appending/Adding Additional HTML to a HTML Table Row	AaronCatolico1	0	1,967	Dec-25-2022, 06:28 PM Last Post: AaronCatolico1
	Pydroid3 app crashes on xiaomi poco F3	JMD	2	2,563	Nov-27-2022, 11:56 AM Last Post: JMD
	Scraping a Flexible Element - works at first, and then crashes	JonnyB	0	1,966	Aug-14-2021, 07:25 PM Last Post: JonnyB
	reading html and edit chekcbox to html	jacklee26	5	4,561	Jul-01-2021, 10:31 AM Last Post: snippsat
	code for CSV file to html file without pandas	jony057	1	4,149	Apr-24-2021, 09:41 PM Last Post: snippsat
	Making .exe file that requires access to text and html files	ClassicalSoul	0	2,049	Apr-23-2020, 05:03 PM Last Post: ClassicalSoul
	importing CSV file into a HTML table using Python	trybakov	1	3,281	Feb-22-2020, 09:47 PM Last Post: scidam
	How do I read the HTML files in a directory and write the content into a CSV file?	glittergirl	1	3,465	Sep-23-2019, 11:01 AM Last Post: Larz60+

HTML file crashes program

User Panel Messages

Announcements