Web scraping: os.path.basename

Truman · Aug-22-2018, 11:39 PM

I'm looking at tutorial Web-scraping part-2 and have a question regarding this code:

import requests
from bs4 import BeautifulSoup
import webbrowser
import os

url = 'http://xkcd.com/1/'
url_get = requests.get(url)
soup = BeautifulSoup(url_get.content, 'lxml')
text = soup.select_one('#ctitle').text
link = soup.find('div', id='comic').find('img').get('src')
link = link.replace('//', 'http://')

# Image title and link
print(f'{text}\n{link}')

# Download image
img_name = os.path.basename(link)
img = requests.get(link)              
with open(img_name, 'wb') as f_out:   
	f_out.write(img.content)          

# Open image in browser or default image viewer
webbrowser.open_new_tab(img_name)

Why is img_name = os.path.basename(link) added? Is that a better practise from some reason?
I also ran code with webbrowser.open_new_tab([b]link[/b]) and it works. Also, script works without lines 17-20.

**Larz60+** · Aug-23-2018, 12:58 AM

your link will be

http://imgs.xkcd.com/comics/barrel_cropped_(1).jpg

so:

img_name = os.path.basename(link)

will get you:

Output:
'barrel_cropped_(1).jpg'

Doc:

Output: os.path.basename(path) Return the base name of pathname path.
This is the second element of the pair returned by passing path to the function split(). 
Note that the result of this function is different from the Unix basename program; 
where basename for '/foo/bar/' returns 'bar', 
the basename() function returns an empty string (''). 
Changed in version 3.6: Accepts a path-like object.

Truman · Aug-23-2018, 09:53 PM

I read that doc explanation before posting but I didn't understand it, and still don't. This part - This is the second element of the pair returned by passing path to the function split().
Still, not sure why is that a better practise then adding link in webborowser.open which also works...

***snippsat*** · (This post was last modified: Aug-23-2018, 10:17 PM by snippsat.)

(Aug-23-2018, 09:53 PM)Truman Wrote: Still, not sure why is that a better practise then adding link in webborowser.open which also works...

It's not about best practice,it's a example i made of download a image from web to local hard drive.
Then open that image on local hard drive in browser.
Could of course just given html link to webbrowser module,
but then would download and open local image in browse not have make sense.

Truman · (This post was last modified: Aug-23-2018, 11:22 PM by Truman.)

Thank you, now doing more advanced download from number of pages...

import requests 
from bs4 import BeautifulSoup
import os
import webbrowser 
browser_path = r"C:\Program Files (x86)\Mozilla Firefox\firefox.exe"
webbrowser.register('mozzila', None, webbrowser.BackgroundBrowser(browser_path))

def image_down(start_img, stop_imp):
	for numb in range(start_img, stop_img):
		url = f'http://xkcd.com/{numb}'
		url_get = requests.get(url)
		soup = BeautifulSoup(url_get.content, 'html.parser')
		link = soup.find('div', id='comic').find('img').get('src')
		link = link.replace('//', 'http://')
		img_name = os.path.basename(link)
		webbrowser.get('mozzila').open_new_tab(img_name)
		#try:
			#img = requests.get(link)
			#with open(img_name, 'wb') as f_out:
				#f_out.write(img.content)
		#except:
			# Just want images don't care about errors
			#pass
			
if __name__ == '__main__':
	start_img = 1
	stop_img = 5
	image_down(start_img, stop_img)

It opens only the first image in the first tab and for the rest in other 3 tabs it says that server is not found.

solved it. Just changed line 16 to

webbrowser.get('mozzila').open_new_tab(link)

Ok, now it's all clear.

Web scraping: os.path.basename

User Panel Messages

Announcements