Python Forum
How to Build a LinkedIn Scraper in Python [No Headless Browser Needed]
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
How to Build a LinkedIn Scraper in Python [No Headless Browser Needed]
#1
Lightbulb 
Hi guys! Our team recently wrote this article and I hope you find it a useful tutorial! If you have any questions, please just shoot me a message.

[Image: python-linkedin-s.jpg]

![](https://7s0543v95is2txo7f2lkp001-wpengin...g)Original article about [Linkedin Scraper with Python](https://www.scraperapi.com/blog/linkedin...er-python/) found on ScraperAPI.

LinkedIn is a huge source of data that’s publicly available for users and non-users alike, and that, per the time of writing this piece, it’s legal to scrape. However, just like it was shown in the [2019 LinkedIn vs. HiQ case](https://jeremiahtang.medium.com/scraping...aafc93ba41), that doesn’t mean LinkedIn is comfortable with it.

For that reason, in this article, we’ll show you how to build a web scraper that doesn’t infringe any privacy policies or required a headless browser to access any data behind a login wall — which, although not illegal, could be considered unethical.

Instead, we’ll extract the job title, company hiring, location, and the link to the job listing using Requests and Beautiful Soup and export the data to a CSV file for later analysis or use.

**Note:** If you’re more proficient in JavaScript, we have a tutorial on [building a LinkedIn scraper using Node.js and Cheerio](https://www.scraperapi.com/uncategorized...craper/%5C) you can check.

### 1. Setting Up Our Project

We’ll start by installing all the dependencies we’ll be using for this project. Assuming you already have Python 3 installed, open VScode — or your favorite text editor — and open a new terminal window. From there, use the following commands to install the libraries:

- Requests: pip3 install requests
- Beautiful Soup: pip3 install beautifulsoup4
- CSV: Python comes with a CSV module ready to use

With our dependencies installed, let’s create a new file and named it linkedin_python.py and import the libraries at the top:

1 **import** csv

2 **import** requests

3 **from** bs4 **import** BeautifulSoup

### 2. Using Chrome DevTools to Understand LinkedIn’s Site Structure

Now that our file is ready to go, let’s explore our target website first. Navigate to the homepage at [https://www.linkedin.com/](https://www.linkedin.com/) from an InPrivate browser window (Incognito in Chrome) and click on _jobs_ at the top of the page.

https://media-exp1.licdn.com/dms/image/D...YtS3D4VRH0[/img]

It will send us directly to the job search result page where we can create a new search. For the sake of this example, let’s say that we’re trying to build a list of product management jobs in San Francisco.

https://media-exp1.licdn.com/dms/image/D...ZUzns2bHTc[/img]

At a glance, it seems like every job data is inside a card-like container, and sure enough, after inspecting the page (right-click > inspect), we can see that every job result is wrapped between a <li> tag inside an <ul> element.

https://media-exp1.licdn.com/dms/image/D...7A1_8NHPcQ[/img]

So, a first approach would be to grab the <ul> element and iterate through every <li> tag inside to extract the data we’re looking for.

[Image: 1657063350680?e=1662595200&v=beta&t=flZm...cCu-NrIqKI]

But there’s a problem: to access new jobs, LinkedIn uses infinite scrolling pagination, which means there is no “next page” button to grab the next page URL link from nor the URL itself changes.

In cases like this, we can use a headless browser like [Selenium](https://www.selenium.dev/) to access the site, extract the data and then scroll down to reveal the new data.

Of course, as we previously stated, we’re not doing that. Instead, let’s outsmart the website by using the _Network Tab_ inside DevTools.

### 3. Using the DevTool’s Network Tab

With DevTools open, we can navigate to the _[Network Tab](https://developer.chrome.com/docs/devtools/network/)_ from the dropdown menu at the top of the window.

[Image: 1657063369313?e=1662595200&v=beta&t=Mfxh...bZHtWWPtxc]

To populate the report, just reload the page and you’ll be able to see all the fetch requests the browser is running to render the data on the page. After scrolling to the bottom, the browser sends a new request to the URL in the screenshot below.

[Image: 1657063385604?e=1662595200&v=beta&t=tpE-...BnlLB5mJ0E]

Let’s try this new URL in our browser to see where it takes us:

https://media-exp1.licdn.com/dms/image/D...j1cMlbNwAw[/img]

Perfect! This page has all the information we want right there for the grabbing. An additional finding is that this URL has a structure we can actually manipulate really easily. Just by changing the value in the start parameter, we can access the new data.

To put this to a test, let’s change the value to 0 — which is the starting value from the organic URL:

[https://www.linkedin.com/jobs/search?key...pageNum=0)

[Image: 1657063426851?e=1662595200&v=beta&t=tm7e...u3CzRdLDoA]

And yes, that did the trick. Confirmed because the first job for each page is the same.

Experimenting is crucial for web scraping, so here are a few more things we tried before settling for this solution:

- Changing the pageNum parameter doesn’t change anything on the page.
- The start parameter increases by 25 for every new URL. We found this out by scrolling down the page and comparing the fetch requests sent by the site itself.
- Changing the start parameter by 1 (so start=2, start=3, and so on) will change the resulting page by hiding the previous job listings out of the page – which is not what we want.
- The current last page is start=975. It goes to a 404 page when hitting 1000.

Having our initial URL, we can move to the next step.

### 4. Parsing LinkedIn Using Requests and Beautiful Soup

Sending a request and parsing the returning response is super simple in Python. First, let’s create a variable containing our initial URL and pass it to the requests.get() method. Then, we’ll store the returned HTML in a variable called “response” to create our Python object. For testing, let’s print response:

1 url **=** '[https://www.linkedin.com/jobs-guest/jobs...p;start=0)'

2

3 response **=** requests.get(url)

4

5 print(response)

[Image: 1657063447003?e=1662595200&v=beta&t=y8R_...tuvmAgeDYM]

Awesome. A 200 status code indicates a successful HTTP request.

Before we can start extracting any data, we’ll need to parse the raw HTML data to make it easier to navigate using CSS selectors. To do so, all we need is to create a new Beautiful Soup object by passing response.content as the first argument, and our parser method as the second argument:

1 soup **=** BeautifulSoup(response.content,'html.parser')

Because testing should be a part of our development process, let’s do a simple experiment using our new soup object by selecting and printing the first job title on the page, which we already know it’s wrapped inside <h3> tags with the class base-search-card__title.

1 job_title **=** soup.find('h3', class_**=**'base-search-card__title').text

2 print(job_title)

soup.find will do exactly what it says, it’ll find an element inside our Beautiful Soup object that matches the parameters we stated. By adding the .text method at the end, it’ll return only the text inside the element without the whole HTML surrounding it.

[Image: 1657063471875?e=1662595200&v=beta&t=LGo2...M3sO3svzlM]

To delete all the white space around the text, all we need to do is add the .strip() method at the end of our string.

### 5. Handling Multiple Pages Using Conditions

Here’s where things get a little tricky, but we’ve already done the hardest part: figuring out how to move through the pages. In simple words, all we need to do is to create some logic to change the start parameter in our URL.

In an earlier article, we talked about [scraping paginated pages in Scrapy](https://www.scraperapi.com/blog/how-to-d...full-code/), but with Beautiful Soup, we’ll do something different.

For starters, we’ll define a new function that will contain the entirety of our code, and pass webpage and page_number as arguments — will use these two arguments to build the URL we’ll use to send the HTTP request.

**1 def** linkedin_scraper(webpage, page_number):

2 next_page **=** webpage **+** str(page_number)

3 print(str(next_page))

4 response **=** requests.get(str(next_page))

5 soup **=** BeautifulSoup(response.content,'html.parser')

In the variable next_page we’re combining both arguments — where webpage is a string and page_number, being a number, needs to be turn into a string – before passing it to Requests for sending the result URL.

For the next step to make sense, we need to understand that our scraper will:

- Create the new URL
- Send the HTTP request
- Parse the response
- Extract the data
- Send it to a CSV file
- Increase the start parameter
- Repeat until it breaks

To increase the start parameter in a loop, we’ll create an If condition:

**1 if** page_number &lt; 25:

2 page_number **=** page_number **+** 25

3 linkedin_scraper(webpage, page_number)

What we’re saying here is that as long as page_number is not higher than 25 (so if it’s 26 or higher it’ll break), page_number will increase by 25 and pass the new number to our function.

Why 25 you ask? Because before going all in, we want to make sure that our logic works with a simple test.

1 print(response)

2 print(page_number)

3

**4 if** page_number &lt; 25:

5 page_number **=** page_number **+** 25

6 linkedin_scraper(webpage, page_number)

7

8 linkedin_scraper('[https://www.linkedin.com/jobs-guest/jobs...mp;start=)', 0)

We’re going to print the response status code and page_number to verify that we’re accessing both pages. To run our code, we’ll need to call our function with the relevant parameters.

**Note:** Notice that in the parameters we separated the start parameter from its value. We need its value to become a number so we can increase it in the If statement.

We’ve also added a print() statement for the new URL created, just to verify that everything is working correctly.

### 6. Testing Our Selectors

We already found the elements and classes we’re going to be using for our parser. Nevertheless, it’s always essential to test them out of the script to avoid unnecessary requests to the server.

Right inside the DevTools’s console, we can use the document.querySelectorAll() method to test each CSS selector from the browser. Here’s an example for the job title:

It returns a NodeList of 25, which matches the number of jobs on the page. We can do the same thing for the rest of our targets. Here are our targets:

- Job title: ‘h3′, class_=’base-search-card__title’
- Company: ‘h4′, class_=’base-search-card__subtitle’
- Location: ‘span’, class_=’job-search-card__location’
- URL: ‘a’, class_=’base-card__full-link’

Noticed that we changed the syntax to match the [.find()](https://www.w3schools.com/python/ref_string_find.asp) method. If you want to keep using the same JQuery-like selectors, you can use the [.select()](https://www.projectpro.io/recipes/select...eb%20pages.) function instead.

### 7. Extracting LinkedIn Job Data

Extracting the data is as simple as selecting all the parent elements that are wrapping our data and then looping through them to extract the information we want.

Inside each <li> element there’s a div with a class we can target:

[Image: 1657063498346?e=1662595200&v=beta&t=wE81...80j6wFny_E]

To access the data inside, let’s create a new variable to pick all these <div>s.

1 jobs **=** soup.find_all('div', class_**=**'base-card relative w-full hover:no-underline focus:no-underline base-card--link base-search-card base-search-card--link job-search-card')

Now, that we have a list of <div>s, we can create our [for loop](https://www.w3schools.com/python/python_for_loops.asp) using the CSS selectors we chose:

**1 for** job **in** jobs:

2 job_title **=** job.find('h3', class_**=**'base-search-card__title').text.strip()

3 job_company **=** job.find('h4', class_**=**'base-search-card__subtitle').text.strip()

4 job_location **=** job.find('span', class_**=**'job-search-card__location').text.strip()

5 job_link **=** job.find('a', class_**=**'base-card__full-link')['href']

**Note:** If you feel like we’re moving to fast, we recommend you reading our [python web scraping tutorial for beginners](https://www.scraperapi.com/blog/web-scraping-python/). It goes into more details on this process.

### 9. Sending Extracted Data to a CSV File

Outside our main function, we’ll open a new file, create a new writer and tell it to create our heading row using the .writerow() method:

1 file **=** open('linkedin-jobs.csv', 'a')

2 writer **=** csv.writer(file)

3 writer.writerow(['Title', 'Company', 'Location', 'Apply'])

When opening a file you’ll want to continuously add new rows to, you’ll need to open it in _append mode_ hence why the a as a second argument in the open() function.

Now, we’ll add a new row with the data extract from our parser. We just need to add this snippet of code at the end of the for loop:

1 writer.writerow([

2 job_title.encode('utf-8'),

3 job_company.encode('utf-8'),

4 job_location.encode('utf-8'),

5 job_link.encode('utf-8')

6 ])

At the end of each iteration through the list of jobs, our scraper will append all the data into a new row.

**Note:** It’s important to make sure that the order we add the new data is at the same order of our headings.

To finish this step, let’s add an else statement to close the file once the loop breaks.

**1 else**:

2 file.close()

3 print('File closed')

### 10. Using ScraperAPI to Avoid Getting Block

Our last step is optional but can save you hours of work in the long run. After all, we’re not trying to scrape just one or two pages. To scale your project, you’ll need to handle IP rotations, manage a pool of proxies, handle CAPTCHAs, and send the proper headers just to avoid getting blocked or even banned for life.

However, ScraperAPI can handle these and more challenges with just a few changes to the base URL we’re using right now, and [create a new free ScraperAPI account](https://www.scraperapi.com/signup) to have access to our API key.

From there, the only thing we have to do is add this string at the beginning of our URL:

1 [**http://api.scraperapi.com?api_key={YOUR_API_KEY}&amp;url=**](http://api.scraperapi.com/?api_key=%7BYO...D&amp;url=)

Resulting in the following function call:

1 linkedin_scraper('[http://api.scraperapi.com?api_key=51e43be283e4db2a5afb62660xxxxxxx&amp;url=https://www.linkedin.com/jobs-guest/jobs/api/seeMoreJobPostings/search?keywords=Product%20Management&amp;location=San%20Francisco%20Bay%20Area&amp;geoId=90000084&amp;trk=public_jobs_jobs-search-bar_search-submit&amp;position=1&amp;pageNum=0&amp;start=](http://api.scraperapi.com/?api_key=51e43...amp;start=)', 0)

With that, our HTTP request will be processed by ScraperAPI’s server. It’ll rotate our IP after every request and choose the right Headers based on years of statistical analysis and machine learning.

In addition, ScraperAPI has ultra-premium proxies chosen specifically to handle really hard websites like LinkedIn. To use them is as simple as adding the ultra_premium=true parameter to our request.

### Wrapping Up: Full Code

If you’ve followed along, here’s how your code base should look like:

**1 import** csv

**2 import** requests

**3 from** bs4 **import** BeautifulSoup

4

5 file **=** open('linkedin-jobs.csv', 'a')

6 writer **=** csv.writer(file)

7 writer.writerow(['Title', 'Company', 'Location', 'Apply'])

8

**9 def** linkedin_scraper(webpage, page_number):

10 next_page **=** webpage **+** str(page_number)

11 print(str(next_page))

12 response **=** requests.get(str(next_page))

13 soup **=** BeautifulSoup(response.content,'html.parser')

14

15 jobs **=** soup.find_all('div', class_**=**'base-card relative w-full hover:no-underline focus:no-underline base-card--link base-search-card base-search-card--link job-search-card')

**16 for** job **in** jobs:

17 job_title **=** job.find('h3', class_**=**'base-search-card__title').text.strip()

18 job_company **=** job.find('h4', class_**=**'base-search-card__subtitle').text.strip()

19 job_location **=** job.find('span', class_**=**'job-search-card__location').text.strip()

20 job_link **=** job.find('a', class_**=**'base-card__full-link')['href']

21

22 writer.writerow([

23 job_title.encode('utf-8'),

24 job_company.encode('utf-8'),

25 job_location.encode('utf-8'),

26 job_link.encode('utf-8')

27 ])

28

29 print('Data updated')

30

**31 if** page_number &lt; 25:

32 page_number **=** page_number **+** 25

33 linkedin_scraper(webpage, page_number)

**34 else**:

35 file.close()

36 print('File closed')

37

38 linkedin_scraper('[https://www.linkedin.com/jobs-guest/jobs...mp;start=)', 0)

We added a few print() statements as visual feedback. After running the code, the script will create a CSV file with the data scraped:

[Image: 1657063537747?e=1662595200&v=beta&t=d1Pz...ck3Ijd8TD8]

[Image: 1657063557314?e=1662595200&v=beta&t=2nPb...x7VDuchemU]

Following up, you can increase the limit in the if condition to scrape more pages, handle the keywords parameter to loop through different queries, and/or change the location parameter to scrape the same job title across different cities and countries.

Remember that if the project is larger than just a couple of pages, using ScraperAPI’s integration will help you avoid roadblocks and keep your IP safe from anti-scraping systems.

Until next time, happy scraping!
Larz60+ write Jul-06-2022, 04:31 PM:
Please post all code, output and errors (it it's entirety) between their respective tags. Refer to BBCode help topic on how to post. Use the "Preview Post" button to make sure the code is presented as you expect before hitting the "Post Reply/Thread" button.
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Build and deploy Python apps for the cloud Rk2 0 1,654 Dec-08-2022, 06:30 PM
Last Post: Rk2

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020