Python Forum
How do I avoid Beautiful Soup redirects?
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
How do I avoid Beautiful Soup redirects?
#1

import bs4 as bs
import urllib.request
sauce = urllib.request.urlopen('https://globenewswire.com/Search/NewsSearch?lang=en&exchange=NYSE').read()
soup = bs.BeautifulSoup(sauce,'lxml')
list = []
for div in soup.find_all('div', class_='results-link', limit=10):
	initialglobenewsnyseurls = ('https://globenewswire.com' + div.h1.a['href'])
	list.append(initialglobenewsnyseurls)
	a, b, c, d, e, f, g, h, i, j = list
so far this works. The only problem is I have the exchange set to NYSE, but when I enter the url as such, NYSE is removed from it, as the url is automatically redirected to:
https://globenewswire.com/NewsRoom

(if you copy and paste the original url into chrome(the one in the code), it will redirect you to the main newsroom, and remove any criteria you previously selected. How can I keep this from happening?
Reply
#2
What is the procedure of clicks to get to the point that gives you the search URL? Like how would i replicate how you got that URL? Also do you have to be logged in to use their search?

You should also use the requests module instead of the standard library

When i tried this
import requests
r = requests.get('https://globenewswire.com/Search/NewsSearch?lang=en&exchange=NYSE', allow_redirects=False)
r.content
the response was
>>> r.content
b'<html><head><title>Object moved</title></head><body>\r\n<h2>Object moved to <a href="/NewsRoom">here</a>.</h2>\r\n</body></html>\r\n'
So it looks like the URL you have is old or must be logged in.

try using keyword
https://globenewswire.com/Search/NewsSea...d=exchange
Recommended Tutorials:
Reply
#3
You don't need to be logged in to access that url. All you have to do is select 'NYSE' as one of your options. I tried searching with keyword, and that isn't being redirected and works. However searching with keyword won't give me all of the results and it will give me some extraneous results.

Is there any way I can keep BeautifulSoup from redirecting urls? Or perchaps go to the main site and then select 'NYSE'?
Reply
#4
Quote:All you have to do is select 'NYSE' as one of your options.
what option? Be specific. I dont see that option anywhere.
Recommended Tutorials:
Reply
#5
On the left side of the webpage there is a column under the words:'Narrow By:' for selecting categories. When you scroll down to the bottom of the webpage you get to the end of the column, and there is an option called 'Stock Market'. If you click on that, it will reveal many options for selecting what stock market you specifically want. If you click on 'NYSE' it will select that as an option for your search criteria and reload the page and change your url.

this is the webpage url: https://globenewswire.com/NewsRoom
Reply
#6
i am not getting redirected anymore
https://globenewswire.com/Search/NewsSea...hange=NYSE
Recommended Tutorials:
Reply
#7
It's not changing the address because you choose an option from the menu. This is the request you are sending and you see the answer as a web page. You can build your requests according to the web address schema. Look at the address bar and the web address closely and you will see how is built.
"As they say in Mexico 'dosvidaniya'. That makes two vidaniyas."
https://freedns.afraid.org
Reply
#8
The redirect is not happening anymore for me when I paste the url into my searchbar. However, BeautifulSoup is still being redirected as the results I get from it do not all match the selected criteria of 'NYSE'. Let me show you what I mean.

This is my input code:
>>> import bs4 as bs
>>> import urllib.request
>>> sauce = urllib.request.urlopen('http://globenewswire.com/Search/NewsSearch?exchange=NYSE').read()\
>>> soup = bs.BeautifulSoup(sauce,'lxml')
>>> list = []\>>> for div in soup.find_all('div', class_='results-link', limit=10):
	initialglobenewsnasdaqurls = ('https://globenewswire.com' + div.h1.a['href'])
	list.append(initialglobenewsnasdaqurls)
>>> a, b, c, d, e, f, g, h, i, j = list
>>> while True:
	saucea = urllib.request.urlopen(a).read()
	soupa = bs.BeautifulSoup(saucea,'lxml')
	sauceb = urllib.request.urlopen(b).read()
	soupb = bs.BeautifulSoup(sauceb,'lxml')
	saucec = urllib.request.urlopen(c).read()
	soupc = bs.BeautifulSoup(saucec,'lxml')
	sauced = urllib.request.urlopen(d).read()
	soupd = bs.BeautifulSoup(sauced,'lxml')
	saucee = urllib.request.urlopen(e).read()
	soupe = bs.BeautifulSoup(saucee,'lxml')
	saucef = urllib.request.urlopen(f).read()
	soupf = bs.BeautifulSoup(saucef,'lxml')
	sauceg = urllib.request.urlopen(g).read()
	soupg = bs.BeautifulSoup(sauceg,'lxml')
	sauceh = urllib.request.urlopen(h).read()
	souph = bs.BeautifulSoup(sauceh,'lxml')
	saucei = urllib.request.urlopen(i).read()
	soupi = bs.BeautifulSoup(saucei,'lxml')
	saucej = urllib.request.urlopen(j).read()
	soupj = bs.BeautifulSoup(saucej,'lxml')
	desca = soupa.find_all(attrs={"name":"ticker"}, limit=1)
	tickeraraw = (desca[0]['content'].encode('utf-8'))
	decodedtickera = tickeraraw.decode('utf')
	soupatitle = soupa.title.text
	descb = soupb.find_all(attrs={"name":"ticker"}, limit=1)
	tickerbraw = (descb[0]['content'].encode('utf-8'))
	decodedtickerb = tickerbraw.decode('utf')
	soupbtitle = soupb.title.text
	descc = soupc.find_all(attrs={"name":"ticker"}, limit=1)
	tickercraw = (descc[0]['content'].encode('utf-8'))
	decodedtickerc = tickercraw.decode('utf')
	soupctitle = soupc.title.text
	descd = soupd.find_all(attrs={"name":"ticker"}, limit=1)
	tickerdraw = (descd[0]['content'].encode('utf-8'))
	decodedtickerd = tickerdraw.decode('utf')
	soupdtitle = soupd.title.text
	desce = soupe.find_all(attrs={"name":"ticker"}, limit=1)
	tickereraw = (desce[0]['content'].encode('utf-8'))
	decodedtickere = tickereraw.decode('utf')
	soupetitle = soupe.title.text
	descf = soupf.find_all(attrs={"name":"ticker"}, limit=1)
	tickerfraw = (descf[0]['content'].encode('utf-8'))
	decodedtickerf = tickerfraw.decode('utf')
	soupftitle = soupf.title.text
	descg = soupg.find_all(attrs={"name":"ticker"}, limit=1)
	tickergraw = (descg[0]['content'].encode('utf-8'))
	decodedtickerg = tickergraw.decode('utf')
	soupgtitle = soupg.title.text
	desch = souph.find_all(attrs={"name":"ticker"}, limit=1)
	tickerhraw = (desch[0]['content'].encode('utf-8'))
	decodedtickerh = tickerhraw.decode('utf')
	souphtitle = souph.title.text
	desci = soupi.find_all(attrs={"name":"ticker"}, limit=1) 
	tickeriraw = (desci[0]['content'].encode('utf-8'))
	decodedtickeri = tickeriraw.decode('utf')
	soupititle = soupi.title.text
	descj = soupj.find_all(attrs={"name":"ticker"}, limit=1)
	tickerjraw = (descj[0]['content'].encode('utf-8'))
	decodedtickerj = tickerjraw.decode('utf')
	soupjtitle = soupj.title.text
	break
Then I go to my results of what I parsed. I now print the stock ticker which also prints the stock exchange. They should all be listed on the NYSE because that is my search criteria, however these are my results:
>>> print(decodedtickera)
NYSE:PGH, TSX:PGF
>>> print(decodedtickerb)
TSX-V:TIC
>>> print(decodedtickerc)

>>> print(decodedtickerd)

>>> print(decodedtickere)
NYSE:BSCI, NYSE:BSCJ, NYSE:BSCK, NYSE:BSCH, NYSE:GSY, NYSE:BSCL, NYSE:BSCM, NYSE:BSCN, NYSE:BSCO, NYSE:BSCP, NYSE:GTO, NYSE:BSCQ

>>> print(decodedtickerf)

>>> print(decodedtickerg)
Nasdaq:BMTC, Nasdaq:RBPAA
>>> print(decodedtickerh)
Nasdaq:VBTX
>>> print(decodedtickeri)
TSX:XAU, TSX-V:AGX-H.V
>>> print(decodedtickerj)
I know that BeautifulSoup is being redirected from the url with the search criteria (http://globenewswire.com/Search/NewsSear...hange=NYSE) to the main page (http://globenewswire.com/NewsRoom). I know this because not all of my search critera has the 'NYSE' with the ticker. This search is getting some stocks from the Nasdaq exchange and some more stocks from the TSX and TSX-V exchange. The redirect stopped happening in my browser, however BeautifulSoup is still being redirected.
Reply
#9
Did you try to change the User-Agent?

headers = {
    'User-Agent': "Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.1b3pre) Gecko/20090109 Shiretoko/3.1b3pre"}

response = requests.get(url, headers=headers)
"As they say in Mexico 'dosvidaniya'. That makes two vidaniyas."
https://freedns.afraid.org
Reply
#10
response = requests.get(url, headers=headers)
I have never used curly brackets before in python and I do not know what response and requests are. Could you show me how that line would fit into my original code?
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Beautiful Soup - access a rating value in a class KatMac 1 3,420 Apr-16-2021, 01:27 PM
Last Post: snippsat
  *Beginner* web scraping/Beautiful Soup help 7ken8 2 2,561 Jan-28-2021, 04:26 PM
Last Post: 7ken8
  Help: Beautiful Soup - Parsing HTML table ironfelix717 2 2,623 Oct-01-2020, 02:19 PM
Last Post: snippsat
  Beautiful Soup (suddenly) doesn't get full webpage html j.crater 8 16,397 Jul-11-2020, 04:31 PM
Last Post: j.crater
  Requests-HTML vs Beautiful Soup - How to Choose? robin73 0 3,780 Jun-23-2020, 02:53 PM
Last Post: robin73
  looking for direction - scrappy, crawler, beautiful soup Sly_Corn 2 2,403 Mar-17-2020, 03:17 PM
Last Post: Sly_Corn
  Beautiful soup truncates results jonesjoz 4 3,800 Mar-09-2020, 06:04 PM
Last Post: jonesjoz
  Beautiful soup and tags starter_student 11 6,055 Jul-08-2019, 03:41 PM
Last Post: starter_student
  Beautiful Soup find_all() kirito85 2 3,311 Jun-14-2019, 02:17 AM
Last Post: kirito85
  [split] Using beautiful soup to get html attribute value moski 6 6,224 Jun-03-2019, 04:24 PM
Last Post: moski

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020