Hi All,
I scraping LinkedIn to get all the job postings.
Using BeautifulSoup i am able to get the first 25 jobs from 1st page.
Any help on how to go to next page till last job is fetched?
Not able to get the link of next page.
Scroll to bottom of the this page
It would be most helpful to see how you coded so far.
there is a link at the end of the ul element that looks like:
Output:
<a class="jobs-search__results-create-alert-cta" data-impression-id="guest_job_search_create-job-alert-bottom-of-results" data-tracking-control-name="guest_job_search_create-job-alert-bottom-of-results" href="https://www.linkedin.com/uas/login?session_redirect=https%3A%2F%2Fwww%2Elinkedin%2Ecom%2Fjobs%2Fsearch%3Fkeywords%3DData%2520Science%26location%3DUnited%2520Kingdom%26trk%3Dhomepage-basic_recent-search%26redirect%3Dfalse%26position%3D1%26pageNum%3D0&amp;emailAddress=&amp;fromSignIn=&trk=guest_job_search_create-job-alert-bottom-of-results" data-tracking-will-navigate=""><li-icon class="job-search__icon job-search__icon--bell" data-delayed-url="https://static-exp1.licdn.com/sc/p/com.linkedin.jobs-guest-frontend%3Ajobs-guest-frontend-static-content%2B0.0.1229/f/%2Fjobs-guest-frontend%2Fimages%2Fcommon%2Fbell-icon-blue.svg"></li-icon>Sign in to create a job alert</a>
you need to start with the tag that you used to get the ul tag, then something like (assume top tag named section, then):
nextpagelink = section.find('a', {'data-tracking-control-name': 'guest_job_search_create-job-alert-bottom-of-results'}
nextpageurl = nextpagelink.get('href')
then fetch mextpageurl with selenium and continur from there.
Below is my code.
I tried by adding your piece of code. It took me to login page instead of fetching next page.
from bs4 import BeautifulSoup
import requests
import csv
session = requests.Session()
job_page = session.get('https://www.linkedin.com/jobs/search?keywords=Data%20Science&location=United%20Kingdom&redirect=false&position=1&pageNum=0')
soup = BeautifulSoup(job_page.content,'html.parser')
with open("Job_page.html", "w",encoding='utf-8') as file:
file.write(str(soup.prettify()))
header = ['Company','Title','Location','posted date','applying link']
single_job = []
csvfile = open('Jobs.csv','a', newline='')
obj = csv.writer(csvfile)
obj.writerow(header)
for job_card in soup.find_all(class_ = 'result-card job-result-card result-card--with-hover-state'):
try:
job_company = job_card.find(class_ = 'result-card__subtitle-link job-result-card__subtitle-link').contents[0]
except:
job_company = job_card.find(class_ = 'result-card__subtitle job-result-card__subtitle').contents[0]
finally:
single_job.append(job_company)
job_title = job_card.find(class_ = 'screen-reader-text').contents[0]
single_job.append(job_title)
job_location = job_card.find(class_ = 'job-result-card__location').contents[0]
single_job.append(job_location)
job_date = job_card.find('time').contents[0]
single_job.append(job_date)
job_link1 = job_card.find(class_ = 'result-card__full-card-link')
job_link = job_link1.get('href')
single_job.append(job_link)
obj.writerow(single_job)
single_job.clear()
nextpagelink = soup.find('a', {'data-tracking-control-name': 'guest_job_search_create-job-alert-bottom-of-results'})
nextpageurl = nextpagelink.get('href')
next_page = session.get(nextpageurl)
other_soup = BeautifulSoup(next_page.content,'html.parser')
with open("next_page.html", "w",encoding='utf-8') as file:
file.write(str(other_soup.prettify()))
You need to start the search at a different point, I didn't look carefully enough the first time.
I'm watching a football game right now, I'll take a look after right the game
I attempted to load this with selenium, because the next page is created dynamically with javascript, and if not done that way, redirects to the same page.
this will work with login and password, and click on button (with selenium)
That's more work that I'm willing to do, but you can use code below as a starting point
There's also an Event listener involved which I haven't worked with before, but I found this link:
https://stackoverflow.com/questions/3588...he-webpage
Code so far:
You will need this one (it's used to create an easy on the eye copy of page fetched by selenium named LinkedinPage1.html (in script directory) useful for looking at javascript rendered page: PrettifyPage.py
# PrettifyPage.py
from bs4 import BeautifulSoup
import requests
import pathlib
class PrettifyPage:
def __init__(self):
pass
def prettify(self, soup, indent):
pretty_soup = str()
previous_indent = 0
for line in soup.prettify().split("\n"):
current_indent = str(line).find("<")
if current_indent == -1 or current_indent > previous_indent + 2:
current_indent = previous_indent + 1
previous_indent = current_indent
pretty_soup += self.write_new_line(line, current_indent, indent)
return pretty_soup
def write_new_line(self, line, current_indent, desired_indent):
new_line = ""
spaces_to_add = (current_indent * desired_indent) - current_indent
if spaces_to_add > 0:
for i in range(spaces_to_add):
new_line += " "
new_line += str(line) + "\n"
return new_line
start of selenium scraper:
from selenium import webdriver
from selenium.webdriver.common.by import By
from bs4 import BeautifulSoup
from pathlib import Path
import os
import time
import PrettifyPage
class GetLinkedinJobs:
def __init__(self):
self.pp = PrettifyPage.PrettifyPage()
# assert starting directory
os.chdir(os.path.abspath(os.path.dirname(__file__)))
self.homepath = Path('.')
def start_browser(self):
caps = webdriver.DesiredCapabilities().FIREFOX
caps["marionette"] = True
self.browser = None
self.browser = webdriver.Firefox(capabilities=caps)
def stop_browser(self):
self.browser.close()
def save_pretty_page(self, soup):
save_pretty_filename = self.homepath / 'LinkedinPage1.html'
print(f'self.save_pretty_filename: {save_pretty_filename.resolve()}')
with save_pretty_filename.open('w') as fp:
fp.write(self.pp.prettify(soup, 2))
def get_page_info(self):
self.start_browser()
url = 'https://www.linkedin.com/jobs/search?keywords=Data%20Science&location=United%20Kingdom&redirect=false&position=1&pageNum=0'
self.browser.get(url)
time.sleep(2)
src = self.browser.page_source
soup = BeautifulSoup(src,"lxml")
self.save_pretty_page(soup)
self.stop_browser()
if __name__ == '__main__':
glj = GetLinkedinJobs()
glj.get_page_info()
Good luck
Sorry to say, I have no expertise in selenium too..
Above code throws some errors..
I am only trying to understand how that 'see more jobs' button works, and also it fetched more data without loading the page on LinkedIn.
How the Link can be formed to get the more Job posting other than the list of first 25 jobs
Any errors that you get with the code I supplied should only be for uninstalled packages which you can easily get with pip.
If you just don't want to work with selenium, there is perhaps an alternative.
If you are logged into Linkedin, this should work.
What I suggest is to alter the URL to proceed to the next page.
If you examine the URL that you provide, you will notice that it ends with: &pageNum=0
You may be able to modify the page number (again if logged in to Linkedin) and fetch the next page with requests.
untested code
def get_url(pageno):
return f"https://www.linkedin.com/jobs/search?keywords=Data%20Science&location=United%20Kingdom&redirect=false&position=1&pageNum={pageno}"
You may also have to change position number as I don't know if that refers to first item to show, or item number from entire list if the latter, then this would change by incrementing by 25.
Again this may possibly work, but the only sure way is to render the JavaScript which beautifulsoup is incapable of doing.