Python Forum
Scraping next page of LinkedIn jobs
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Scraping next page of LinkedIn jobs
#1
Hi All,

I scraping LinkedIn to get all the job postings.

Using BeautifulSoup i am able to get the first 25 jobs from 1st page.

Any help on how to go to next page till last job is fetched?

Not able to get the link of next page.

Scroll to bottom of the this page
Reply
#2
It would be most helpful to see how you coded so far.
there is a link at the end of the ul element that looks like:
Output:
<a class="jobs-search__results-create-alert-cta" data-impression-id="guest_job_search_create-job-alert-bottom-of-results" data-tracking-control-name="guest_job_search_create-job-alert-bottom-of-results" href="https://www.linkedin.com/uas/login?session_redirect=https%3A%2F%2Fwww%2Elinkedin%2Ecom%2Fjobs%2Fsearch%3Fkeywords%3DData%2520Science%26location%3DUnited%2520Kingdom%26trk%3Dhomepage-basic_recent-search%26redirect%3Dfalse%26position%3D1%26pageNum%3D0&amp;amp;emailAddress=&amp;amp;fromSignIn=&amp;trk=guest_job_search_create-job-alert-bottom-of-results" data-tracking-will-navigate=""><li-icon class="job-search__icon job-search__icon--bell" data-delayed-url="https://static-exp1.licdn.com/sc/p/com.linkedin.jobs-guest-frontend%3Ajobs-guest-frontend-static-content%2B0.0.1229/f/%2Fjobs-guest-frontend%2Fimages%2Fcommon%2Fbell-icon-blue.svg"></li-icon>Sign in to create a job alert</a>
you need to start with the tag that you used to get the ul tag, then something like (assume top tag named section, then):
nextpagelink = section.find('a', {'data-tracking-control-name': 'guest_job_search_create-job-alert-bottom-of-results'}
nextpageurl = nextpagelink.get('href')
then fetch mextpageurl with selenium and continur from there.
Reply
#3
Below is my code.

I tried by adding your piece of code. It took me to login page instead of fetching next page.

from bs4 import BeautifulSoup
import requests
import csv

session = requests.Session()
job_page = session.get('https://www.linkedin.com/jobs/search?keywords=Data%20Science&location=United%20Kingdom&redirect=false&position=1&pageNum=0')
soup = BeautifulSoup(job_page.content,'html.parser')

with open("Job_page.html", "w",encoding='utf-8') as file:
    file.write(str(soup.prettify()))

header = ['Company','Title','Location','posted date','applying link']
single_job = []
csvfile = open('Jobs.csv','a', newline='')
obj = csv.writer(csvfile)
obj.writerow(header)

for job_card in soup.find_all(class_ = 'result-card job-result-card result-card--with-hover-state'):
    try:
        job_company = job_card.find(class_ = 'result-card__subtitle-link job-result-card__subtitle-link').contents[0]
    except:
        job_company = job_card.find(class_ = 'result-card__subtitle job-result-card__subtitle').contents[0]
    finally:
        single_job.append(job_company)
    job_title = job_card.find(class_ = 'screen-reader-text').contents[0]
    single_job.append(job_title)
    job_location = job_card.find(class_ = 'job-result-card__location').contents[0]
    single_job.append(job_location)
    job_date = job_card.find('time').contents[0]
    single_job.append(job_date)
    job_link1 = job_card.find(class_ = 'result-card__full-card-link')
    job_link = job_link1.get('href')
    single_job.append(job_link)
    obj.writerow(single_job)
    single_job.clear()

nextpagelink = soup.find('a', {'data-tracking-control-name': 'guest_job_search_create-job-alert-bottom-of-results'})
nextpageurl = nextpagelink.get('href')
next_page = session.get(nextpageurl)
other_soup = BeautifulSoup(next_page.content,'html.parser')

with open("next_page.html", "w",encoding='utf-8') as file:
    file.write(str(other_soup.prettify()))
Reply
#4
You need to start the search at a different point, I didn't look carefully enough the first time.

I'm watching a football game right now, I'll take a look after right the game
Reply
#5
I attempted to load this with selenium, because the next page is created dynamically with javascript, and if not done that way, redirects to the same page.
this will work with login and password, and click on button (with selenium)
That's more work that I'm willing to do, but you can use code below as a starting point
There's also an Event listener involved which I haven't worked with before, but I found this link: https://stackoverflow.com/questions/3588...he-webpage

Code so far:
You will need this one (it's used to create an easy on the eye copy of page fetched by selenium named LinkedinPage1.html (in script directory) useful for looking at javascript rendered page: PrettifyPage.py
# PrettifyPage.py

from bs4 import BeautifulSoup
import requests
import pathlib


class PrettifyPage:
    def __init__(self):
        pass

    def prettify(self, soup, indent):
        pretty_soup = str()
        previous_indent = 0
        for line in soup.prettify().split("\n"):
            current_indent = str(line).find("<")
            if current_indent == -1 or current_indent > previous_indent + 2:
                current_indent = previous_indent + 1
            previous_indent = current_indent
            pretty_soup += self.write_new_line(line, current_indent, indent)
        return pretty_soup

    def write_new_line(self, line, current_indent, desired_indent):
        new_line = ""
        spaces_to_add = (current_indent * desired_indent) - current_indent
        if spaces_to_add > 0:
            for i in range(spaces_to_add):
                new_line += " "		
        new_line += str(line) + "\n"
        return new_line
start of selenium scraper:
from selenium import webdriver
from selenium.webdriver.common.by import By
from bs4 import BeautifulSoup
from pathlib import Path
import os
import time
import PrettifyPage


class GetLinkedinJobs:
    def __init__(self):
        self.pp = PrettifyPage.PrettifyPage()
        # assert starting directory
        os.chdir(os.path.abspath(os.path.dirname(__file__)))
        self.homepath = Path('.')

    def start_browser(self):
        caps = webdriver.DesiredCapabilities().FIREFOX
        caps["marionette"] = True
        self.browser = None
        self.browser = webdriver.Firefox(capabilities=caps)

    def stop_browser(self):
        self.browser.close()

    def save_pretty_page(self, soup):
        save_pretty_filename =  self.homepath / 'LinkedinPage1.html'
        print(f'self.save_pretty_filename: {save_pretty_filename.resolve()}')
        with save_pretty_filename.open('w') as fp:
            fp.write(self.pp.prettify(soup, 2))

    def get_page_info(self):
        self.start_browser()
        url = 'https://www.linkedin.com/jobs/search?keywords=Data%20Science&location=United%20Kingdom&redirect=false&position=1&pageNum=0'
        self.browser.get(url)
        time.sleep(2)
        src = self.browser.page_source
        soup = BeautifulSoup(src,"lxml")
        self.save_pretty_page(soup)        
        self.stop_browser()

if __name__ == '__main__':
    glj = GetLinkedinJobs()
    glj.get_page_info()
Good luck
Reply
#6
Sorry to say, I have no expertise in selenium too..

Above code throws some errors..

I am only trying to understand how that 'see more jobs' button works, and also it fetched more data without loading the page on LinkedIn.

How the Link can be formed to get the more Job posting other than the list of first 25 jobs
Reply
#7
Any errors that you get with the code I supplied should only be for uninstalled packages which you can easily get with pip.
If you just don't want to work with selenium, there is perhaps an alternative.
If you are logged into Linkedin, this should work.
What I suggest is to alter the URL to proceed to the next page.
If you examine the URL that you provide, you will notice that it ends with: &pageNum=0
You may be able to modify the page number (again if logged in to Linkedin) and fetch the next page with requests.
untested code
def get_url(pageno):
    return f"https://www.linkedin.com/jobs/search?keywords=Data%20Science&location=United%20Kingdom&redirect=false&position=1&pageNum={pageno}"
You may also have to change position number as I don't know if that refers to first item to show, or item number from entire list if the latter, then this would change by incrementing by 25.

Again this may possibly work, but the only sure way is to render the JavaScript which beautifulsoup is incapable of doing.
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  I am scraping a web page but got an Error Sarmad54 3 1,418 Mar-02-2023, 08:20 PM
Last Post: Sarmad54
  Scraping the page without distorting content oleglpts 5 2,443 Dec-16-2021, 05:08 PM
Last Post: oleglpts
  Scraping a page with log in data (security, proxies) iamaghost 0 2,103 Mar-27-2021, 02:56 PM
Last Post: iamaghost
  Scraping .aspx page Larz60+ 21 50,858 Mar-18-2021, 10:16 AM
Last Post: Larz60+
  Scraping Whole Page Source GJG 1 2,104 Jan-13-2021, 03:19 PM
Last Post: GJG
  use Xpath in Python :: libxml2 for a page-to-page skip-setting apollo 2 3,583 Mar-19-2020, 06:13 PM
Last Post: apollo
  Not able to login and maintain session of LinkedIn using beautifulsoup RiteshMahto 3 4,342 Dec-08-2019, 06:00 PM
Last Post: RiteshMahto
  Scraping data from ebay seller page yuvalta 3 5,948 Sep-25-2019, 04:22 AM
Last Post: sandramoraes
  Web Page not opening while web scraping through python selenium sumandas89 4 10,004 Nov-19-2018, 02:47 PM
Last Post: snippsat
  Youtube page scraping ChipsSlave 1 3,011 Jun-05-2018, 03:55 PM
Last Post: snippsat

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020