Python Forum
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
HTML file crashes program
#1
I am reading my local copy of an HTML file that works fine both working from my disk and on the internet.
In this code an infinite loop reads each line and looks for "img" to create an output file with image names (crudely at this point.)
After running a while through the code, it crashes.
The short output lines are for debugging, so I can search the HTML file since I don't have line count access.
Error at end of output also
Error:
presentation at request of Iow <font face="Times New Roma Traceback (most recent call last): File "/Users/mike_mac/Programming/Python/read-HTM.py", line 15, in <module> txt5=test_file.readline() File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/codecs.py", line 322, in decode (result, consumed) = self._buffer_decode(data, self.errors, final) UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd7 in position 973: invalid continuation byte >>>
test_file = open("/Users/mikefirth/Python W/LOUISE-P.htm")
out_file = open("/Users/mikefirth/Python W/LOUISE-P-out.txt","w")
tchr=int(0)
tlin=0
tint=0
tcnt=0
tinp=str(' ')
txt1=str(' ')
txt2=txt1
txt3=txt2
txt4=txt2
txt5=txt2
while 1!=0:  #loop
    txt5=test_file.readline()
    tcnt=tcnt+1
    print(txt5[0:30])
    tchr=txt5.find("img")
    if(tchr>0):
        tlin=tlin+1
        print(tcnt,tlin,txt5[tchr:tchr+80],'\n')
Output:
<td valign="top" width="350"> 1051 17 img src="ds5040a.jpg" alt="Louise Kelly countryside painting in Firth home abuut <img src="ds5040b.jpg" alt 1052 18 img src="ds5040b.jpg" alt="Louise Kelly Firth french landscape image in B&amp;W" </tr> <tr> <td align="center" width="19"> <td width="350">Two paintings University where Louise taught &nbsp;</td> <td width="350">Pencil-sketch- presentation at request of Iow In president's home, seen stra </tr> <tr> <td align="center" width="19"> <td valign="top" width="350">< Accession Number: </strong>U99 Object Title: </strong>End of Date of Work: </strong>Unknown <strong> Artist: </strong>Kell <strong> Country: </strong>USA <strong>Description: </strong> green, yellow, reds. There is off in the distance over a hil fields of grain to left. A row the road. In the center backgr several smaller farm buildings extreme background. White clou form the sky. There is a 3 1/2 shades of green overall. There bottom of the frame with the f Gamble Liljidahl, presented by <td valign="top" width="350"> <p align="center"> <a href="ds 1083 19 img src="ds4476bw.jpg" alt="Louise Kelly painting, End of the Trail, at ISU" bor </td> </tr> <tr> <td align="center" width="19"> <a name="5I">5I</a></td> <td valign="top" width="350">< Accession Number: </strong>UM8 Object Title: </strong>TRURO, <strong> Artist: </strong>Kell <strong> Date of Work: </stron <strong> Medium: </strong>Oil boat centered is red, the sea blue. The boat down right and while the bluff at left and th green. The sky is a deeper blu that are not defined in the sk <td valign="top" width="350"> <p align="left">Pencil-sketch- presentation at request of Iow <font face="Times New Roma Traceback (most recent call last): File "/Users/mike_mac/Programming/Python/read-HTM.py", line 15, in <module> txt5=test_file.readline() File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/codecs.py", line 322, in decode (result, consumed) = self._buffer_decode(data, self.errors, final) UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd7 in position 973: invalid continuation byte
Reply
#2
You should parser for this,also saving/read a html file local is easy to mess up Unicode eg should keep utf-8 all the way.
Example.
from bs4 import BeautifulSoup

# Simulate a web page
html = '''\
<body>
  <div id='images'>
    <a href='image1.html'>My image 1<br /><img src='image1_thumb.jpg'></a>
    <a href='image2.html'>My image 1<br /><img src='image2_thumb.jpg'></a>
    <a href='image3.html'>My image 3<br /><img src='image3_thumb.jpg'></a>
  </div>
</body>
'''

soup = BeautifulSoup(html, 'html.parser')
print([link.text for link in soup.find_all('a')])
print([link.get('src') for link in soup.find_all('img')])
Output:
['My image 1', 'My image 2', 'My image 3'] ['image1_thumb.jpg', 'image2_thumb.jpg', 'image3_thumb.jpg']
Reading from local file it would be like this.
from bs4 import BeautifulSoup

soup = BeautifulSoup(open('page.html', encoding='utf-8'), 'html.parser')
print([link.text for link in soup.find_all('a')])
print([link.get('src') for link in soup.find_all('img')])
Output:
['My image 1', 'My image 2', 'My image 3'] ['image1_thumb.jpg', 'image2_thumb.jpg', 'image3_thumb.jpg']
Reply
#3
I don't want to appear stupid or insulting but ...
'My image 2' appears in your output and is not in the program simulation
'You should parser' is apparently jargon for something in BeautifulSoup that does mystery magic. Follows references to other files?
As suggested in the output sample, there are over 1080 lines with 'img' in them. Outputting a bracketed list may serve some scraping purpose but I would like to be able have alternate/multiple responses to lines found (like also missing 'alt' statements.)
Altho you can find the file online https://mikegigi.com/firthg/louise.htm I (almost) never download, always work on my local copy and upload. Hand built, mostly. Not sure how Unicode would get in and outside of BeautifulSoup, how to open a file in Python with 8 bit encoding.
Fairly sure my problems stem from old-man thinking rooted in FORTRAN+LISP learned in 1960 (then COBOL, BASIC x6, Pascal, Assember x4, HTML, and trying to self-teach Python. Much prefer straight code to obscure libraries

(Dec-27-2021, 09:21 PM)snippsat Wrote: You should parser for this,also saving/read a html file local is easy to mess up Unicode eg should keep utf-8 all the way.
Example.
from bs4 import BeautifulSoup

# Simulate a web page
html = '''\
<body>
  <div id='images'>
    <a href='image1.html'>My image 1<br /><img src='image1_thumb.jpg'></a>
    <a href='image2.html'>My image 1<br /><img src='image2_thumb.jpg'></a>
    <a href='image3.html'>My image 3<br /><img src='image3_thumb.jpg'></a>
  </div>
</body>
'''

soup = BeautifulSoup(html, 'html.parser')
print([link.text for link in soup.find_all('a')])
print([link.get('src') for link in soup.find_all('img')])
Output:
['My image 1', 'My image 2', 'My image 3'] ['image1_thumb.jpg', 'image2_thumb.jpg', 'image3_thumb.jpg']
Reading from local file it would be like this.
from bs4 import BeautifulSoup

soup = BeautifulSoup(open('page.html', encoding='utf-8'), 'html.parser')
print([link.text for link in soup.find_all('a')])
print([link.get('src') for link in soup.find_all('img')])
Output:
['My image 1', 'My image 2', 'My image 3'] ['image1_thumb.jpg', 'image2_thumb.jpg', 'image3_thumb.jpg']
Reply
#4
Hi @mikefirth ,
(Dec-28-2021, 04:29 AM)mikefirth Wrote: Much prefer straight code to obscure libraries
I feel a lot like you do. I also want to know what is happening. Want to stick to a good algorithm.
But in the case of managing complex file formats, I must agree with Snippsat to use a parser for these.
Complex file formats are for example: html, xml, json and even CSV. CSV seems the most simple of these but yet there are pitfalls. For example when a field contains the field separator. Or a newline! There are rules to manage these cases, but you would have quite some work to write a CSV reader having implemented these rules. So in these cases: using the CSV module solves a lot of these problems.

The same goes for more complex formats like XML and HTML. These formats may for example contain comments. A simple program reading a file in this format would not notice this.

In this case, like Snippsat said, I would recommend using a html parser like Beautiful Soup.
Reply
#5
(Dec-28-2021, 04:29 AM)mikefirth Wrote: Altho you can find the file online https://mikegigi.com/firthg/louise.htm I (almost) never download, always work on my local copy and upload.
Can show a example how to parse that site.
Site use encoding ISO-8859-1 when using Requests/BeautifulSoup they handle this automatically.
If save local to disk most keep same encoding.

Images info from first Table,it's and old site so they split all up in Tables.
import requests
from bs4 import BeautifulSoup

url = 'https://mikegigi.com/firthg/louise.htm'
response = requests.get(url)
soup = BeautifulSoup(response.content, 'lxml')
print(soup.find('title').text)
print('-' * 40)
first_table = soup.select_one('body > div:nth-child(5) > center > table > tbody')
for im in first_table:
    if im.find('img') == -1:
        pass
    else:
        print(im.find('img')) 
Output:
<img alt="Louise and Edward Kelly in Carrington ND" border="1" height="196" src="ds4589.jpg" width="242"/> <img alt="Picture of Louise Kelly in women's club booklet" border="1" height="273" src="louise1b.jpg" width="186"/> <img alt="Louise Kelly while working on picture Jan 1933" border="1" height="258" src="louise2a.jpg" width="194"/> <img alt="Louise Kelly working on painting in ad 1930's" border="1" height="256" src="louise3a.jpg" width="250"/> <img alt="Louise Kelly portrait in Rockport Art Association, Artists of" border="1" height="384" src="lkrockport001b.jpg" width="266"/>
So can take out info from alt and src.
Change line 14:
print(im.find('img').get('alt')) 
Output:
Louise Minert Kelly ---------------------------------------- Louise and Edward Kelly in Carrington ND Picture of Louise Kelly in women's club booklet Louise Kelly while working on picture Jan 1933 Louise Kelly working on painting in ad 1930's Louise Kelly portrait in Rockport Art Association, Artists of
print(im.find('img').get('src'))
Output:
Louise Minert Kelly ---------------------------------------- ds4589.jpg louise1b.jpg louise2a.jpg louise3a.jpg lkrockport001b.jpg
So get ref link to image if want to download site url + ds4589.jpg.

For more info abut parsing look at part-1 and part-2.
Reply
#6
@mikefirth saurier

Listen to snippsat, unlike me, he really knows what he is talking about!

This still needs tweaking, but does the job. Some pics are not found, not sure why right now.

I think you should be using re

#! /usr/bin/python3
import re
  
path2text = '/home/pedro/temp/lovely_louise.html'
with open(path2text) as f:
    lines = f.readlines()

print('lines is', len(lines), 'long')

# get lines with '<img src="'
data = []
for line in lines:
    if '<img src="' in line:
        data.append(line)

print('data is', len(data), 'long')

pattern1 = re.compile('img src="')
pattern2 = re.compile('jpg"')
pattern3 = re.compile('gif"')

def get_Image_name(line):
    start_span = pattern1.search(line)
    start_pos = pattern1.search(line).span()[1]
    # maybe not a jpg
    if not pattern2.search(line) == None:
        end_pos = pattern2.search(line).span()[0] + 3
    if not pattern3.search(line) == None:
        end_pos = pattern3.search(line).span()[0] + 3
    # add more ifs for other images
    img_name = line[start_pos:end_pos]
    return img_name

# a list to take the names
jpg_names = []

# some names are not picked up, need to look at that
for line in data:
    print(line)
    name = get_Image_name(line)
    jpg_names.append(name)

print('jpg_names is', len(jpg_names), 'long')
savename = '/home/pedro/temp/photo_names.txt'

with open(savename, 'w') as f:
    text = '\n'.join(jpg_names)
    f.write(text)

print('All done!')   
Reply
#7
(Dec-29-2021, 12:06 AM)Pedroski55 Wrote: I think you should be using re
No.
(Dec-29-2021, 12:06 AM)Pedroski55 Wrote: Some pics are not found, not sure why right now.
Regex and html is not best friends,take a look a this post funny read🎭

That said both BS and lxml has some regex capability build in regular expression,
and you use regex when parser can not do anymore like after calling .text.
Using regex alone with HTML/XML is bad idea,that's why parser exists.
Reply
#8
@snippsatt Just trying!

Still not too sure how to jump ahead from the start_pos in this start_pos = pattern1.search(line).span()[1], but I get the result I wanted!

I believe bowlfred is very good with re, maybe he can straighten out my code!

This gets me the desired result. The problem was, the html is not uniform.

I needed another pattern to cater for the lack of alt="

#! /usr/bin/python3
import re
  
path2text = '/home/pedro/temp/lovely_louise.html'
with open(path2text) as f:
    lines = f.readlines()

print('lines is', len(lines), 'long')

# get lines with '<img src="'
data = []
for line in lines:
    if '<img src="' in line:
        data.append(line)

print('data is', len(data), 'long')

for d in data:
    print(d)

pattern1 = re.compile('<img src="')
#pattern2 = re.compile('img src=".jpg"')
pattern2 = re.compile('jpg" alt="')
pattern3 = re.compile('gif" alt="')
pattern4 = re.compile('jpg" border="')

def get_Image_name(line):
    start_span = pattern1.search(line).span()
    start_pos = pattern1.search(line).span()[1]
    # maybe not a jpg
    if not pattern2.search(line) == None:
        end_pos = pattern2.search(line).span()[0] + 3
    elif not pattern3.search(line) == None:
        end_pos = pattern3.search(line).span()[0] + 3
    elif not pattern4.search(line) == None:
        end_pos = pattern4.search(line).span()[0] + 3
    # add more ifs for other images
    img_name = line[start_pos:end_pos]
    return img_name

# a list to take the names
jpg_names = []

# some names are not picked up, need to look at that
for line in data:
    print(line)
    name = get_Image_name(line)
    jpg_names.append(name)

print('jpg_names is', len(jpg_names), 'long')
savename = '/home/pedro/temp/photo_names.txt'

for jpeg in jpg_names:
    print('picture is', jpeg)

with open(savename, 'w') as f:
    text = '\n'.join(jpg_names)
    f.write(text)

print('All done!')   
Reply
#9
(Dec-29-2021, 04:45 AM)Pedroski55 Wrote: I believe bowlfred is very good with re, maybe he can straighten out my code!

This gets me the desired result. The problem was, the html is not uniform.
It's not about being good at regex,it's the wrong tool so gone be struggle every time try do this with HTML.
I could fix it as my regex knowledge is good,but why struggle when parser has solved this and make it so much easier.

My code was more for learning purpose to show how parser works,we try not to solve the whole task so OP can try.
Ok download all images at site,to show how it can be done using the right tool.
import requests, os
from bs4 import BeautifulSoup

url = 'https://mikegigi.com/firthg/louise.htm'
response = requests.get(url)
soup = BeautifulSoup(response.content, 'lxml')
base_url = 'https://mikegigi.com/firthg/'
for im in soup.select('img'):
    image_url = f"{base_url}{im.get('src')}"
    img_name = os.path.basename(image_url)
    response = requests.get(image_url)
    img_name = os.path.basename(image_url)
    with open(img_name, 'wb') as f_out:
        f_out.write(response.content)
Reply
#10
Point taken! Do this as above!

The link to stackoverflow was very funny!

I remember reading a rant from a programmer about html.

He was complaining that a browser will still display html when it is full of mistakes and omissions. He found that very annoying!
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Need to replace a string with a file (HTML file) tester_V 1 762 Aug-30-2023, 03:42 AM
Last Post: Larz60+
  Understanding and debugging memory error crashes with python3.10.10 Arkaik 5 2,074 Apr-18-2023, 03:22 AM
Last Post: Larz60+
  Tkinterweb (Browser Module) Appending/Adding Additional HTML to a HTML Table Row AaronCatolico1 0 923 Dec-25-2022, 06:28 PM
Last Post: AaronCatolico1
  Pydroid3 app crashes on xiaomi poco F3 JMD 2 1,235 Nov-27-2022, 11:56 AM
Last Post: JMD
  Scraping a Flexible Element - works at first, and then crashes JonnyB 0 1,509 Aug-14-2021, 07:25 PM
Last Post: JonnyB
  reading html and edit chekcbox to html jacklee26 5 3,073 Jul-01-2021, 10:31 AM
Last Post: snippsat
  code for CSV file to html file without pandas jony057 1 2,958 Apr-24-2021, 09:41 PM
Last Post: snippsat
  Making .exe file that requires access to text and html files ClassicalSoul 0 1,579 Apr-23-2020, 05:03 PM
Last Post: ClassicalSoul
  importing CSV file into a HTML table using Python trybakov 1 2,281 Feb-22-2020, 09:47 PM
Last Post: scidam
  How do I read the HTML files in a directory and write the content into a CSV file? glittergirl 1 2,587 Sep-23-2019, 11:01 AM
Last Post: Larz60+

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020