Web Scraping Inquiry (Extracting content from a table in asubdomain)

DustinKlent · Aug-16-2020, 08:55 PM

Good Day,

I am very new to Python and am just now learning Beautiful Soup and web scraping methods. I wanted to try to practice one method by automating a very tedious process whereas the script does the following:

1. Goes to this link: https://mattrode.com/blog/robinhood-collections-list/

2. Goes to each of the sublinks (numbered 1 to 191) (Example https://robinhood.com/collections/100-most-popular)

3. In that link extracts the symbol and saves it to a .csv (or text) file.

and it does it for every one of the links and symbols within each link.

I have searched online for doing this but I'm really not sure where to start or what to look for exactly.

Two things I do not yet know how to do: 1. Navigate to subdomains and return back to the original link to go into the next one and 2. extract elements from a table as complex as the ones in the link provided.

If anyone could assist me with this I would be grateful.

***snippsat*** · (This post was last modified: Aug-16-2020, 09:39 PM by snippsat.)

Here some start help as you new this,we like most that you give it try and post code even if it's all wrong.
Look at Web-Scraping part-1
This site also need a User Agent(copy agent from site) or will not get in.

import requests
from bs4 import BeautifulSoup

url = 'https://mattrode.com/blog/robinhood-collections-list/'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.125 Safari/537.36'}
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.content, 'lxml')
ol_tag = soup.find('ol')

Some test.

>>> first_tag = ol_tag.find('li')
>>> first_tag
<li><a href="https://robinhood.com/collections/100-most-popular">100 Most Popular Stocks</a></li>
>>> first_tag.a
<a href="https://robinhood.com/collections/100-most-popular">100 Most Popular Stocks</a>
>>> first_tag.a.get('href')
'https://robinhood.com/collections/100-most-popular'
>>> first_tag.text
'100 Most Popular Stocks'

Start to test and break stuff in small pieces when learning this.
As you see i get address for subdomains this can be open and read the same way.

DustinKlent · Aug-17-2020, 03:35 AM

This code gets me the first link:

import requests
from bs4 import BeautifulSoup

# Get info from main site
url = 'https://mattrode.com/blog/robinhood-collections-list/'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.125 Safari/537.36'}
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.content, 'lxml')
ol_tag = soup.find('ol')
li_tag = ol_tag.find('li')
link = li_tag.a.get('href')
print(link)

How do I go about going into the link to extract the stock ticker symbol?

***snippsat*** · Aug-17-2020, 10:10 AM

(Aug-17-2020, 03:35 AM)DustinKlent Wrote: How do I go about going into the link to extract the stock ticker symbol?

You can open link same way as did with first url.
Could now also make function for open url to avoid the repeat code.

import requests
from bs4 import BeautifulSoup

# Get info from main site
url = 'https://mattrode.com/blog/robinhood-collections-list/'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.125 Safari/537.36'}
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.content, 'lxml')
ol_tag = soup.find('ol')
li_tag = ol_tag.find('li')
link = li_tag.a.get('href')
#print(link)
response = requests.get(link, headers=headers)
new_soup = BeautifulSoup(response.content, 'lxml')

Test.

>>> new_soup.select_one('div.col-13 > section > div > table > tbody > tr:nth-child(1) > td:nth-child(2) > a')
<a class="rh-hyperlink qD5a4psv-CV7GnWdHxLvn AaXTyP3x99eRIDW0ExfYP" href="/stocks/CPRX" rel=""><div><span>CPRX</span></div></a>
>>> new_soup.select_one('div.col-13 > section > div > table > tbody > tr:nth-child(1) > td:nth-child(2) > a').text
'CPRX'
>>> new_soup.select_one('div.col-13 > section > div > table > tbody > tr:nth-child(1) > td:nth-child(3) > a').text
'$3.43'
>>> new_soup.select_one('div.col-13 > section > div > table > tbody > tr:nth-child(1) > td:nth-child(4) > a').text
'0.59%'
>>> new_soup.select_one('div.col-13 > section > div > table > tbody > tr:nth-child(1) > td:nth-child(5) > a').text
'359.02M'

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Help Scraping links and table from link	cartonics	11	1,580	Oct-12-2023, 06:42 AM Last Post: cartonics
	Scraping data from table into existing dataframe	vincer58	1	2,020	Jan-09-2022, 05:15 PM Last Post: vincer58
	Scraping the page without distorting content	oleglpts	5	2,494	Dec-16-2021, 05:08 PM Last Post: oleglpts
	Python Web Scraping can not getting all HTML content	yqqwe123	0	1,645	Aug-02-2021, 08:56 AM Last Post: yqqwe123
	Need help scraping wikipedia table	bborusz2	6	3,251	Dec-01-2020, 11:31 PM Last Post: snippsat
	Scraping a dynamic data-table in python through AJAX request	filozofo	1	3,886	Aug-14-2020, 10:13 AM Last Post: kashcode
	scraping multiple pages from table	bandar	1	2,698	Jun-27-2020, 10:43 PM Last Post: Larz60+
	BeautifulSoup: Error while extracting a value from an HTML table	kawasso	3	3,224	Aug-25-2019, 01:13 AM Last Post: kawasso
	Web scraping User Generated Content	StephenG93	2	2,951	Oct-10-2018, 12:17 AM Last Post: StephenG93
	Web scraping "fancy" table	acehole60	2	4,913	Dec-16-2016, 09:17 AM Last Post: acehole60

Web Scraping Inquiry (Extracting content from a table in asubdomain)

User Panel Messages

Announcements