Python Forum
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Web scraping: os.path.basename
#1
I'm looking at tutorial Web-scraping part-2 and have a question regarding this code:

import requests
from bs4 import BeautifulSoup
import webbrowser
import os

url = 'http://xkcd.com/1/'
url_get = requests.get(url)
soup = BeautifulSoup(url_get.content, 'lxml')
text = soup.select_one('#ctitle').text
link = soup.find('div', id='comic').find('img').get('src')
link = link.replace('//', 'http://')

# Image title and link
print(f'{text}\n{link}')

# Download image
img_name = os.path.basename(link)
img = requests.get(link)              
with open(img_name, 'wb') as f_out:   
	f_out.write(img.content)          

# Open image in browser or default image viewer
webbrowser.open_new_tab(img_name)
Why is img_name = os.path.basename(link) added? Is that a better practise from some reason?
I also ran code with webbrowser.open_new_tab([b]link[/b]) and it works. Also, script works without lines 17-20.
Reply
#2
your link will be
http://imgs.xkcd.com/comics/barrel_cropped_(1).jpg
so:
img_name = os.path.basename(link)
will get you:
Output:
'barrel_cropped_(1).jpg'
Doc:
Output:
os.path.basename(path) Return the base name of pathname path. This is the second element of the pair returned by passing path to the function split(). Note that the result of this function is different from the Unix basename program; where basename for '/foo/bar/' returns 'bar', the basename() function returns an empty string (''). Changed in version 3.6: Accepts a path-like object.
Reply
#3
I read that doc explanation before posting but I didn't understand it, and still don't. This part - This is the second element of the pair returned by passing path to the function split().
Still, not sure why is that a better practise then adding link in webborowser.open which also works...
Reply
#4
(Aug-23-2018, 09:53 PM)Truman Wrote: Still, not sure why is that a better practise then adding link in webborowser.open which also works...
It's not about best practice,it's a example i made of download a image from web to local hard drive.
Then open that image on local hard drive in browser.
Could of course just given html link to webbrowser module,
but then would download and open local image in browse not have make sense.
Reply
#5
Thank you, now doing more advanced download from number of pages...
import requests 
from bs4 import BeautifulSoup
import os
import webbrowser 
browser_path = r"C:\Program Files (x86)\Mozilla Firefox\firefox.exe"
webbrowser.register('mozzila', None, webbrowser.BackgroundBrowser(browser_path))

def image_down(start_img, stop_imp):
	for numb in range(start_img, stop_img):
		url = f'http://xkcd.com/{numb}'
		url_get = requests.get(url)
		soup = BeautifulSoup(url_get.content, 'html.parser')
		link = soup.find('div', id='comic').find('img').get('src')
		link = link.replace('//', 'http://')
		img_name = os.path.basename(link)
		webbrowser.get('mozzila').open_new_tab(img_name)
		#try:
			#img = requests.get(link)
			#with open(img_name, 'wb') as f_out:
				#f_out.write(img.content)
		#except:
			# Just want images don't care about errors
			#pass
			
if __name__ == '__main__':
	start_img = 1
	stop_img = 5
	image_down(start_img, stop_img)
It opens only the first image in the first tab and for the rest in other 3 tabs it says that server is not found.

solved it. Just changed line 16 to
webbrowser.get('mozzila').open_new_tab(link)
Ok, now it's all clear.
Reply


Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020