Scraping next page of LinkedIn jobs

RiteshMahto · Dec-08-2019, 02:31 PM

Hi All,

I scraping LinkedIn to get all the job postings.

Using BeautifulSoup i am able to get the first 25 jobs from 1st page.

Any help on how to go to next page till last job is fetched?

Not able to get the link of next page.

Scroll to bottom of the this page

**Larz60+** · Dec-08-2019, 04:21 PM

It would be most helpful to see how you coded so far.
there is a link at the end of the ul element that looks like:

Output:
<a class="jobs-search__results-create-alert-cta" data-impression-id="guest_job_search_create-job-alert-bottom-of-results" data-tracking-control-name="guest_job_search_create-job-alert-bottom-of-results" href="https://www.linkedin.com/uas/login?session_redirect=https%3A%2F%2Fwww%2Elinkedin%2Ecom%2Fjobs%2Fsearch%3Fkeywords%3DData%2520Science%26location%3DUnited%2520Kingdom%26trk%3Dhomepage-basic_recent-search%26redirect%3Dfalse%26position%3D1%26pageNum%3D0&amp;amp;emailAddress=&amp;amp;fromSignIn=&amp;trk=guest_job_search_create-job-alert-bottom-of-results" data-tracking-will-navigate=""><li-icon class="job-search__icon job-search__icon--bell" data-delayed-url="https://static-exp1.licdn.com/sc/p/com.linkedin.jobs-guest-frontend%3Ajobs-guest-frontend-static-content%2B0.0.1229/f/%2Fjobs-guest-frontend%2Fimages%2Fcommon%2Fbell-icon-blue.svg"></li-icon>Sign in to create a job alert</a>

you need to start with the tag that you used to get the ul tag, then something like (assume top tag named section, then):

nextpagelink = section.find('a', {'data-tracking-control-name': 'guest_job_search_create-job-alert-bottom-of-results'}
nextpageurl = nextpagelink.get('href')

then fetch mextpageurl with selenium and continur from there.

RiteshMahto · Dec-08-2019, 05:32 PM

Below is my code.

I tried by adding your piece of code. It took me to login page instead of fetching next page.

from bs4 import BeautifulSoup
import requests
import csv

session = requests.Session()
job_page = session.get('https://www.linkedin.com/jobs/search?keywords=Data%20Science&location=United%20Kingdom&redirect=false&position=1&pageNum=0')
soup = BeautifulSoup(job_page.content,'html.parser')

with open("Job_page.html", "w",encoding='utf-8') as file:
    file.write(str(soup.prettify()))

header = ['Company','Title','Location','posted date','applying link']
single_job = []
csvfile = open('Jobs.csv','a', newline='')
obj = csv.writer(csvfile)
obj.writerow(header)

for job_card in soup.find_all(class_ = 'result-card job-result-card result-card--with-hover-state'):
    try:
        job_company = job_card.find(class_ = 'result-card__subtitle-link job-result-card__subtitle-link').contents[0]
    except:
        job_company = job_card.find(class_ = 'result-card__subtitle job-result-card__subtitle').contents[0]
    finally:
        single_job.append(job_company)
    job_title = job_card.find(class_ = 'screen-reader-text').contents[0]
    single_job.append(job_title)
    job_location = job_card.find(class_ = 'job-result-card__location').contents[0]
    single_job.append(job_location)
    job_date = job_card.find('time').contents[0]
    single_job.append(job_date)
    job_link1 = job_card.find(class_ = 'result-card__full-card-link')
    job_link = job_link1.get('href')
    single_job.append(job_link)
    obj.writerow(single_job)
    single_job.clear()

nextpagelink = soup.find('a', {'data-tracking-control-name': 'guest_job_search_create-job-alert-bottom-of-results'})
nextpageurl = nextpagelink.get('href')
next_page = session.get(nextpageurl)
other_soup = BeautifulSoup(next_page.content,'html.parser')

with open("next_page.html", "w",encoding='utf-8') as file:
    file.write(str(other_soup.prettify()))

**Larz60+** · Dec-08-2019, 09:49 PM

You need to start the search at a different point, I didn't look carefully enough the first time.

I'm watching a football game right now, I'll take a look after right the game

**Larz60+** · Dec-09-2019, 01:56 AM

I attempted to load this with selenium, because the next page is created dynamically with javascript, and if not done that way, redirects to the same page.
this will work with login and password, and click on button (with selenium)
That's more work that I'm willing to do, but you can use code below as a starting point
There's also an Event listener involved which I haven't worked with before, but I found this link: https://stackoverflow.com/questions/3588...he-webpage

Code so far:
You will need this one (it's used to create an easy on the eye copy of page fetched by selenium named LinkedinPage1.html (in script directory) useful for looking at javascript rendered page: PrettifyPage.py

# PrettifyPage.py

from bs4 import BeautifulSoup
import requests
import pathlib


class PrettifyPage:
    def __init__(self):
        pass

    def prettify(self, soup, indent):
        pretty_soup = str()
        previous_indent = 0
        for line in soup.prettify().split("\n"):
            current_indent = str(line).find("<")
            if current_indent == -1 or current_indent > previous_indent + 2:
                current_indent = previous_indent + 1
            previous_indent = current_indent
            pretty_soup += self.write_new_line(line, current_indent, indent)
        return pretty_soup

    def write_new_line(self, line, current_indent, desired_indent):
        new_line = ""
        spaces_to_add = (current_indent * desired_indent) - current_indent
        if spaces_to_add > 0:
            for i in range(spaces_to_add):
                new_line += " "		
        new_line += str(line) + "\n"
        return new_line

start of selenium scraper:

from selenium import webdriver
from selenium.webdriver.common.by import By
from bs4 import BeautifulSoup
from pathlib import Path
import os
import time
import PrettifyPage


class GetLinkedinJobs:
    def __init__(self):
        self.pp = PrettifyPage.PrettifyPage()
        # assert starting directory
        os.chdir(os.path.abspath(os.path.dirname(__file__)))
        self.homepath = Path('.')

    def start_browser(self):
        caps = webdriver.DesiredCapabilities().FIREFOX
        caps["marionette"] = True
        self.browser = None
        self.browser = webdriver.Firefox(capabilities=caps)

    def stop_browser(self):
        self.browser.close()

    def save_pretty_page(self, soup):
        save_pretty_filename =  self.homepath / 'LinkedinPage1.html'
        print(f'self.save_pretty_filename: {save_pretty_filename.resolve()}')
        with save_pretty_filename.open('w') as fp:
            fp.write(self.pp.prettify(soup, 2))

    def get_page_info(self):
        self.start_browser()
        url = 'https://www.linkedin.com/jobs/search?keywords=Data%20Science&location=United%20Kingdom&redirect=false&position=1&pageNum=0'
        self.browser.get(url)
        time.sleep(2)
        src = self.browser.page_source
        soup = BeautifulSoup(src,"lxml")
        self.save_pretty_page(soup)        
        self.stop_browser()

if __name__ == '__main__':
    glj = GetLinkedinJobs()
    glj.get_page_info()

Good luck

RiteshMahto · Dec-09-2019, 05:33 PM

Sorry to say, I have no expertise in selenium too..

Above code throws some errors..

I am only trying to understand how that 'see more jobs' button works, and also it fetched more data without loading the page on LinkedIn.

How the Link can be formed to get the more Job posting other than the list of first 25 jobs

**Larz60+** · (This post was last modified: Dec-09-2019, 09:43 PM by Larz60+.)

Any errors that you get with the code I supplied should only be for uninstalled packages which you can easily get with pip.
If you just don't want to work with selenium, there is perhaps an alternative.
If you are logged into Linkedin, this should work.
What I suggest is to alter the URL to proceed to the next page.
If you examine the URL that you provide, you will notice that it ends with: &pageNum=0
You may be able to modify the page number (again if logged in to Linkedin) and fetch the next page with requests.
untested code

def get_url(pageno):
    return f"https://www.linkedin.com/jobs/search?keywords=Data%20Science&location=United%20Kingdom&redirect=false&position=1&pageNum={pageno}"

You may also have to change position number as I don't know if that refers to first item to show, or item number from entire list if the latter, then this would change by incrementing by 25.

Again this may possibly work, but the only sure way is to render the JavaScript which beautifulsoup is incapable of doing.

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	I am scraping a web page but got an Error	Sarmad54	3	2,523	Mar-02-2023, 08:20 PM Last Post: Sarmad54
	Scraping the page without distorting content	oleglpts	5	3,652	Dec-16-2021, 05:08 PM Last Post: oleglpts
	Scraping a page with log in data (security, proxies)	iamaghost	0	2,860	Mar-27-2021, 02:56 PM Last Post: iamaghost
	Scraping .aspx page	Larz60+	21	58,006	Mar-18-2021, 10:16 AM Last Post: Larz60+
	Scraping Whole Page Source	GJG	1	2,787	Jan-13-2021, 03:19 PM Last Post: GJG
	use Xpath in Python :: libxml2 for a page-to-page skip-setting	apollo	2	4,763	Mar-19-2020, 06:13 PM Last Post: apollo
	Not able to login and maintain session of LinkedIn using beautifulsoup	RiteshMahto	3	5,501	Dec-08-2019, 06:00 PM Last Post: RiteshMahto
	Scraping data from ebay seller page	yuvalta	3	7,112	Sep-25-2019, 04:22 AM Last Post: sandramoraes
	Web Page not opening while web scraping through python selenium	sumandas89	4	12,346	Nov-19-2018, 02:47 PM Last Post: snippsat
	Youtube page scraping	ChipsSlave	1	3,614	Jun-05-2018, 03:55 PM Last Post: snippsat

Scraping next page of LinkedIn jobs

User Panel Messages

Announcements