Python Forum
Need help with Web Parsing and Lists
Thread Rating:
  • 2 Vote(s) - 4.5 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Need help with Web Parsing and Lists
#1


This is my code:
>>> import bs4 as bs
>>> import urllib.request
>>> sauce = urllib.request.urlopen('https://globenewswire.com/Search/NewsSearch?lang=en&exchange=Nasdaq').read()
>>> soup = bs.BeautifulSoup(sauce,'lxml')
>>> for div in soup.find_all('div', class_='results-link'):
    str = ('https://globenewswire.com' + div.h1.a['href'])
	List = str.splitlines()
	print(List)

['https://globenewswire.com/news-release/2017/11/18/1197160/0/en/IT-Genium-INET-Successfully-Upgraded-to-5-0-0201.html']
['https://globenewswire.com/news-release/2017/11/18/1197159/0/en/IT-Genium-INET-Successfully-Upgraded-to-5-0-0201.html']
['https://globenewswire.com/news-release/2017/11/18/1197158/0/en/IT-INET-Nordic-Production-Successfully-upgraded-to-the-November-20-release-82-17.html']
['https://globenewswire.com/news-release/2017/11/18/1197157/0/en/IT-Genium-INET-Successfully-Upgraded-to-5-0-0201.html']
['https://globenewswire.com/news-release/2017/11/18/1195106/0/en/Aerojet-Rocketdyne-Supports-ULA-Delta-II-Launch-of-Joint-Polar-Satellite-System-1.html']
['https://globenewswire.com/news-release/2017/11/18/1195105/0/en/Voting-for-Stars-of-Science-Season-9-Finale-Opens.html']
['https://globenewswire.com/news-release/2017/11/18/1195104/0/en/SHAREHOLDER-ALERT-Pomerantz-Law-Firm-Reminds-Shareholders-with-Losses-on-their-Investment-in-Intercept-Pharmaceuticals-Inc-of-Class-Action-Lawsuit-and-Upcoming-Deadline-ICPT.html']
['https://globenewswire.com/news-release/2017/11/18/1195103/0/is/Hampi%C3%B0jan-l%C3%BDkur-vi%C3%B0-kaup-%C3%A1-Voot-Beitu.html']
['https://globenewswire.com/news-release/2017/11/18/1195102/0/en/Best-Fitbit-Black-Friday-Cyber-Monday-Deals-of-2017-Compared-by-Deal-Tomato.html']
['https://globenewswire.com/news-release/2017/11/18/1195101/0/en/The-Best-Canon-DSLR-Camera-Black-Friday-2017-Deals-Topic-Reviews-Publish-Round-Up-of-Top-Deals.html']
I've tried doing
 str.splitlines() 
however, that just gives me 10 separate lists each with one url in them. How do I put all 10 of these urls into a single list. There would basically be only two brackets and nine commas, so the list would look like this:
['https://globenewswire.com/news-release/2017/11/18/1197160/0/en/IT-Genium-INET-Successfully-Upgraded-to-5-0-0201.html', 'https://globenewswire.com/news-release/2017/11/18/1197159/0/en/IT-Genium-INET-Successfully-Upgraded-to-5-0-0201.html',
'https://globenewswire.com/news-release/2017/11/18/1197158/0/en/IT-INET-Nordic-Production-Successfully-upgraded-to-the-November-20-release-82-17.html',
'https://globenewswire.com/news-release/2017/11/18/1197157/0/en/IT-Genium-INET-Successfully-Upgraded-to-5-0-0201.html',
'https://globenewswire.com/news-release/2017/11/18/1195106/0/en/Aerojet-Rocketdyne-Supports-ULA-Delta-II-Launch-of-Joint-Polar-Satellite-System-1.html',
'https://globenewswire.com/news-release/2017/11/18/1195105/0/en/Voting-for-Stars-of-Science-Season-9-Finale-Opens.html',
'https://globenewswire.com/news-release/2017/11/18/1195104/0/en/SHAREHOLDER-ALERT-Pomerantz-Law-Firm-Reminds-Shareholders-with-Losses-on-their-Investment-in-Intercept-Pharmaceuticals-Inc-of-Class-Action-Lawsuit-and-Upcoming-Deadline-ICPT.html',
'https://globenewswire.com/news-release/2017/11/18/1195103/0/is/Hampi%C3%B0jan-l%C3%BDkur-vi%C3%B0-kaup-%C3%A1-Voot-Beitu.html',
'https://globenewswire.com/news-release/2017/11/18/1195102/0/en/Best-Fitbit-Black-Friday-Cyber-Monday-Deals-of-2017-Compared-by-Deal-Tomato.html',
'https://globenewswire.com/news-release/2017/11/18/1195101/0/en/The-Best-Canon-DSLR-Camera-Black-Friday-2017-Deals-Topic-Reviews-Publish-Round-Up-of-Top-Deals.html']
I'm trying to get each url into one list so I can assign a variable to each individual url of the list:
a, b, c, d, e, f, g, h, i, j = List
Any help is appreciated. Thank you. Tongue
Reply
#2
you can just append the url to a list on each iteration

import bs4 as bs
import urllib.request
sauce = urllib.request.urlopen('https://globenewswire.com/Search/NewsSearch?lang=en&exchange=Nasdaq').read()
soup = bs.BeautifulSoup(sauce,'lxml')
lst = []
for div in soup.find_all('div', class_='results-link'):
    url = 'https://globenewswire.com{}'.format(div.h1.a['href'])
    lst.append(url)
    
print(lst)
Output:
['https://globenewswire.com/news-release/2017/11/18/1197161/0/en/Veritas-Pharma-Enters-Binding-Letter-of-Intent-to-Secure-ACMPR-License-and-Cannabis-Growing-Facility.html', 'https://globenewswire.com/news-release/2017/11/18/1197160/0/en/IT-Genium-INET-Successfully-Upgraded-to-5-0-0201.html', 'https://globenewswire.com/news-release/2017/11/18/1197159/0/en/IT-Genium-INET-Successfully-Upgraded-to-5-0-0201.html', 'https://globenewswire.com/news-release/2017/11/18/1197158/0/en/IT-INET-Nordic-Production-Successfully-upgraded-to-the-November-20-release-82-17.html', 'https://globenewswire.com/news-release/2017/11/18/1197157/0/en/IT-Genium-INET-Successfully-Upgraded-to-5-0-0201.html', 'https://globenewswire.com/news-release/2017/11/18/1195106/0/en/Aerojet-Rocketdyne-Supports-ULA-Delta-II-Launch-of-Joint-Polar-Satellite-System-1.html', 'https://globenewswire.com/news-release/2017/11/18/1195105/0/en/Voting-for-Stars-of-Science-Season-9-Finale-Opens.html', 'https://globenewswire.com/news-release/2017/11/18/1195104/0/en/SHAREHOLDER-ALERT-Pomerantz-Law-Firm-Reminds-Shareholders-with-Losses-on-their-Investment-in-Intercept-Pharmaceuticals-Inc-of-Class-Action-Lawsuit-and-Upcoming-Deadline-ICPT.html', 'https://globenewswire.com/news-release/2017/11/18/1195103/0/is/Hampi%C3%B0jan-l%C3%BDkur-vi%C3%B0-kaup-%C3%A1-Voot-Beitu.html', 'https://globenewswire.com/news-release/2017/11/18/1195102/0/en/Best-Fitbit-Black-Friday-Cyber-Monday-Deals-of-2017-Compared-by-Deal-Tomato.html']
if you want pretty print
import bs4 as bs
import urllib.request
import pprint
sauce = urllib.request.urlopen('https://globenewswire.com/Search/NewsSearch?lang=en&exchange=Nasdaq').read()
soup = bs.BeautifulSoup(sauce,'lxml')
lst = []
for div in soup.find_all('div', class_='results-link'):
    url = 'https://globenewswire.com{}'.format(div.h1.a['href'])
    lst.append(url)
    
pprint.pprint(lst)
Output:
['https://globenewswire.com/news-release/2017/11/18/1197161/0/en/Veritas-Pharma-Enters-Binding-Letter-of-Intent-to-Secure-ACMPR-License-and-Cannabis-Growing-Facility.html', 'https://globenewswire.com/news-release/2017/11/18/1197160/0/en/IT-Genium-INET-Successfully-Upgraded-to-5-0-0201.html', 'https://globenewswire.com/news-release/2017/11/18/1197159/0/en/IT-Genium-INET-Successfully-Upgraded-to-5-0-0201.html', 'https://globenewswire.com/news-release/2017/11/18/1197158/0/en/IT-INET-Nordic-Production-Successfully-upgraded-to-the-November-20-release-82-17.html', 'https://globenewswire.com/news-release/2017/11/18/1197157/0/en/IT-Genium-INET-Successfully-Upgraded-to-5-0-0201.html', 'https://globenewswire.com/news-release/2017/11/18/1195106/0/en/Aerojet-Rocketdyne-Supports-ULA-Delta-II-Launch-of-Joint-Polar-Satellite-System-1.html', 'https://globenewswire.com/news-release/2017/11/18/1195105/0/en/Voting-for-Stars-of-Science-Season-9-Finale-Opens.html', 'https://globenewswire.com/news-release/2017/11/18/1195104/0/en/SHAREHOLDER-ALERT-Pomerantz-Law-Firm-Reminds-Shareholders-with-Losses-on-their-Investment-in-Intercept-Pharmaceuticals-Inc-of-Class-Action-Lawsuit-and-Upcoming-Deadline-ICPT.html', 'https://globenewswire.com/news-release/2017/11/18/1195103/0/is/Hampi%C3%B0jan-l%C3%BDkur-vi%C3%B0-kaup-%C3%A1-Voot-Beitu.html', 'https://globenewswire.com/news-release/2017/11/18/1195102/0/en/Best-Fitbit-Black-Friday-Cyber-Monday-Deals-of-2017-Compared-by-Deal-Tomato.html']
Recommended Tutorials:
Reply
#3
Thank you for the help again Tongue
Reply


Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020