Python Forum

Full Version: Web Scraping Inquiry (Extracting content from a table in asubdomain)
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Good Day,

I am very new to Python and am just now learning Beautiful Soup and web scraping methods. I wanted to try to practice one method by automating a very tedious process whereas the script does the following:

1. Goes to this link: https://mattrode.com/blog/robinhood-collections-list/

2. Goes to each of the sublinks (numbered 1 to 191) (Example https://robinhood.com/collections/100-most-popular)

3. In that link extracts the symbol and saves it to a .csv (or text) file.

and it does it for every one of the links and symbols within each link.

I have searched online for doing this but I'm really not sure where to start or what to look for exactly.

Two things I do not yet know how to do: 1. Navigate to subdomains and return back to the original link to go into the next one and 2. extract elements from a table as complex as the ones in the link provided.

If anyone could assist me with this I would be grateful.
Here some start help as you new this,we like most that you give it try and post code even if it's all wrong.
Look at Web-Scraping part-1
This site also need a User Agent(copy agent from site) or will not get in.
import requests
from bs4 import BeautifulSoup

url = 'https://mattrode.com/blog/robinhood-collections-list/'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.125 Safari/537.36'}
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.content, 'lxml')
ol_tag = soup.find('ol')
Some test.
>>> first_tag = ol_tag.find('li')
>>> first_tag
<li><a href="https://robinhood.com/collections/100-most-popular">100 Most Popular Stocks</a></li>
>>> first_tag.a
<a href="https://robinhood.com/collections/100-most-popular">100 Most Popular Stocks</a>
>>> first_tag.a.get('href')
'https://robinhood.com/collections/100-most-popular'
>>> first_tag.text
'100 Most Popular Stocks'
Start to test and break stuff in small pieces when learning this.
As you see i get address for subdomains this can be open and read the same way.
This code gets me the first link:

import requests
from bs4 import BeautifulSoup

# Get info from main site
url = 'https://mattrode.com/blog/robinhood-collections-list/'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.125 Safari/537.36'}
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.content, 'lxml')
ol_tag = soup.find('ol')
li_tag = ol_tag.find('li')
link = li_tag.a.get('href')
print(link)
How do I go about going into the link to extract the stock ticker symbol?
(Aug-17-2020, 03:35 AM)DustinKlent Wrote: [ -> ]How do I go about going into the link to extract the stock ticker symbol?
You can open link same way as did with first url.
Could now also make function for open url to avoid the repeat code.
import requests
from bs4 import BeautifulSoup

# Get info from main site
url = 'https://mattrode.com/blog/robinhood-collections-list/'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.125 Safari/537.36'}
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.content, 'lxml')
ol_tag = soup.find('ol')
li_tag = ol_tag.find('li')
link = li_tag.a.get('href')
#print(link)
response = requests.get(link, headers=headers)
new_soup = BeautifulSoup(response.content, 'lxml')
Test.
>>> new_soup.select_one('div.col-13 > section > div > table > tbody > tr:nth-child(1) > td:nth-child(2) > a')
<a class="rh-hyperlink qD5a4psv-CV7GnWdHxLvn AaXTyP3x99eRIDW0ExfYP" href="/stocks/CPRX" rel=""><div><span>CPRX</span></div></a>
>>> new_soup.select_one('div.col-13 > section > div > table > tbody > tr:nth-child(1) > td:nth-child(2) > a').text
'CPRX'
>>> new_soup.select_one('div.col-13 > section > div > table > tbody > tr:nth-child(1) > td:nth-child(3) > a').text
'$3.43'
>>> new_soup.select_one('div.col-13 > section > div > table > tbody > tr:nth-child(1) > td:nth-child(4) > a').text
'0.59%'
>>> new_soup.select_one('div.col-13 > section > div > table > tbody > tr:nth-child(1) > td:nth-child(5) > a').text
'359.02M'