Python Forum
Web Scraping Inquiry (Extracting content from a table in asubdomain)
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Web Scraping Inquiry (Extracting content from a table in asubdomain)
#1
Good Day,

I am very new to Python and am just now learning Beautiful Soup and web scraping methods. I wanted to try to practice one method by automating a very tedious process whereas the script does the following:

1. Goes to this link: https://mattrode.com/blog/robinhood-collections-list/

2. Goes to each of the sublinks (numbered 1 to 191) (Example https://robinhood.com/collections/100-most-popular)

3. In that link extracts the symbol and saves it to a .csv (or text) file.

and it does it for every one of the links and symbols within each link.

I have searched online for doing this but I'm really not sure where to start or what to look for exactly.

Two things I do not yet know how to do: 1. Navigate to subdomains and return back to the original link to go into the next one and 2. extract elements from a table as complex as the ones in the link provided.

If anyone could assist me with this I would be grateful.
Reply
#2
Here some start help as you new this,we like most that you give it try and post code even if it's all wrong.
Look at Web-Scraping part-1
This site also need a User Agent(copy agent from site) or will not get in.
import requests
from bs4 import BeautifulSoup

url = 'https://mattrode.com/blog/robinhood-collections-list/'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.125 Safari/537.36'}
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.content, 'lxml')
ol_tag = soup.find('ol')
Some test.
>>> first_tag = ol_tag.find('li')
>>> first_tag
<li><a href="https://robinhood.com/collections/100-most-popular">100 Most Popular Stocks</a></li>
>>> first_tag.a
<a href="https://robinhood.com/collections/100-most-popular">100 Most Popular Stocks</a>
>>> first_tag.a.get('href')
'https://robinhood.com/collections/100-most-popular'
>>> first_tag.text
'100 Most Popular Stocks'
Start to test and break stuff in small pieces when learning this.
As you see i get address for subdomains this can be open and read the same way.
Reply
#3
This code gets me the first link:

import requests
from bs4 import BeautifulSoup

# Get info from main site
url = 'https://mattrode.com/blog/robinhood-collections-list/'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.125 Safari/537.36'}
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.content, 'lxml')
ol_tag = soup.find('ol')
li_tag = ol_tag.find('li')
link = li_tag.a.get('href')
print(link)
How do I go about going into the link to extract the stock ticker symbol?
Reply
#4
(Aug-17-2020, 03:35 AM)DustinKlent Wrote: How do I go about going into the link to extract the stock ticker symbol?
You can open link same way as did with first url.
Could now also make function for open url to avoid the repeat code.
import requests
from bs4 import BeautifulSoup

# Get info from main site
url = 'https://mattrode.com/blog/robinhood-collections-list/'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.125 Safari/537.36'}
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.content, 'lxml')
ol_tag = soup.find('ol')
li_tag = ol_tag.find('li')
link = li_tag.a.get('href')
#print(link)
response = requests.get(link, headers=headers)
new_soup = BeautifulSoup(response.content, 'lxml')
Test.
>>> new_soup.select_one('div.col-13 > section > div > table > tbody > tr:nth-child(1) > td:nth-child(2) > a')
<a class="rh-hyperlink qD5a4psv-CV7GnWdHxLvn AaXTyP3x99eRIDW0ExfYP" href="/stocks/CPRX" rel=""><div><span>CPRX</span></div></a>
>>> new_soup.select_one('div.col-13 > section > div > table > tbody > tr:nth-child(1) > td:nth-child(2) > a').text
'CPRX'
>>> new_soup.select_one('div.col-13 > section > div > table > tbody > tr:nth-child(1) > td:nth-child(3) > a').text
'$3.43'
>>> new_soup.select_one('div.col-13 > section > div > table > tbody > tr:nth-child(1) > td:nth-child(4) > a').text
'0.59%'
>>> new_soup.select_one('div.col-13 > section > div > table > tbody > tr:nth-child(1) > td:nth-child(5) > a').text
'359.02M'
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Help Scraping links and table from link cartonics 11 1,462 Oct-12-2023, 06:42 AM
Last Post: cartonics
  Scraping data from table into existing dataframe vincer58 1 1,960 Jan-09-2022, 05:15 PM
Last Post: vincer58
  Scraping the page without distorting content oleglpts 5 2,443 Dec-16-2021, 05:08 PM
Last Post: oleglpts
  Python Web Scraping can not getting all HTML content yqqwe123 0 1,617 Aug-02-2021, 08:56 AM
Last Post: yqqwe123
  Need help scraping wikipedia table bborusz2 6 3,167 Dec-01-2020, 11:31 PM
Last Post: snippsat
  Scraping a dynamic data-table in python through AJAX request filozofo 1 3,823 Aug-14-2020, 10:13 AM
Last Post: kashcode
  scraping multiple pages from table bandar 1 2,651 Jun-27-2020, 10:43 PM
Last Post: Larz60+
  BeautifulSoup: Error while extracting a value from an HTML table kawasso 3 3,156 Aug-25-2019, 01:13 AM
Last Post: kawasso
  Web scraping User Generated Content StephenG93 2 2,916 Oct-10-2018, 12:17 AM
Last Post: StephenG93
  Web scraping "fancy" table acehole60 2 4,859 Dec-16-2016, 09:17 AM
Last Post: acehole60

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020