May-20-2019, 10:59 PM
Hi, I wrote something like this. I'm not sure how reliable it will be. It finds the "Next" (next page) button using its' xpath (which changed once when I was testing it for some reason, but then went back to the value written in the code, which is
Anyway, I tested it and it worked alright except one time described above, the excel file was created.
It uses BeautifulSoup, selenium and xlsxwriter that you can install by opening your command prompt (possibly as administrator if you have python on C partition like me) and executing:
py -3 -m pip install beautifulsoup4 selenium xlsxwriter
You'll also have to install google chrome driver (I have version 74).
Once you run it, it will ask you to login and the search results will appear, then you have to press "enter" as described in the output
'//*[@id="ember282"]'
).Anyway, I tested it and it worked alright except one time described above, the excel file was created.
It uses BeautifulSoup, selenium and xlsxwriter that you can install by opening your command prompt (possibly as administrator if you have python on C partition like me) and executing:
py -3 -m pip install beautifulsoup4 selenium xlsxwriter
You'll also have to install google chrome driver (I have version 74).
Once you run it, it will ask you to login and the search results will appear, then you have to press "enter" as described in the output
Output:Login, wait for page with search to load and press enter...
''' 1st part of the code to download the data ''' from selenium import webdriver from bs4 import BeautifulSoup from time import sleep, time url = "https://www.linkedin.com/people/search?firstName=Adam&lastName=Smith&trk=homepage-jobseeker_people-search-bar_search-submit" def get_src(): # used to refresh the content global br # browser object return br.find_element_by_xpath("//*").get_attribute("outerHTML") # get some kind of "root" element in order to get the actual source def scroll_consistently(seconds): s = time() while time() - seconds < s: br.execute_script("window.scrollTo(0, document.body.scrollHeight);") sleep(0.2) def separate_odd_even_elements(l): ''' Input = [7,2,1,3] Output = [ [7,1][2,3] ] ''' a = [] b = [] for i in range(0, len(l), 2): a.append( l[ i ] ) b.append( l[ i+1 ] ) return [a, b] def get_city_name_from_location(loc): if ',' in loc: return loc.split(',')[0] return loc br = webdriver.Chrome() br.get(url) input("Login, wait for page with search to load and press enter...") total_first_names = [] total_surnames = [] total_workinfos = [] total_locations = [] # search results are displayed on many pages # loop below will get the results from 10 pages for i in range(10): # scroll to the bottom (otherwise some results are not displayed) # repeat it multiple times during 4 seconds scroll_consistently(4) # get fresh page source soup = BeautifulSoup(get_src(), "html.parser") # find the data names = [e.get_text() for e in soup.find_all('span', class_='name actor-name')] workinfos_with_locations = [e.get_text() for e in soup.find_all('span', {'dir':'ltr'})] # just to make sure that there are 2 (work info + location) per 1 name assert (len(names) * 2) == len(workinfos_with_locations) # separate workinfos and locations from 1 list into 2 workinfos, locations = separate_odd_even_elements(workinfos_with_locations) # exctact first names (first word of name = first name) total_first_names += [n.split()[0] for n in names] # extract surnames (last word of name = surname) total_surnames += [n.split()[-1] for n in names] # I don't know how to separate position from workplace name # sometimes it's separated by word "at", # sometimes it's separated by comma # I think the risk of mistake is too high so I just leave it raw total_workinfos += workinfos # extract city names from locations total_locations += [get_city_name_from_location(loc) for loc in locations] # click "next" button and scroll again br.find_elements_by_xpath('//*[@id="ember282"]')[0].click() ''' 2nd part of the code that will be used to create and save the excel file ''' import xlsxwriter #total_names = ['abc', 'cde'] #total_workinfos = ['111', '222'] #total_locations = ['here', 'there'] workbook = xlsxwriter.Workbook('my_file.xlsx') worksheet = workbook.add_worksheet("Linkedin people data") # Write headings worksheet.write(0, 0, "First name") worksheet.write(0, 1, "Surname") worksheet.write(0, 2, "Workinfo") worksheet.write(0, 3, "Location") row = 1 col = 0 for f_name, s_name, workinfo, location in zip(total_first_names, total_surnames, total_workinfos, total_locations): worksheet.write(row, col, f_name) worksheet.write(row, col + 1, s_name) worksheet.write(row, col + 2, workinfo) worksheet.write(row, col + 3, location) row += 1 workbook.close()Excel file contents:
Output:First name Surname Workinfo Location
Adam Smith Managing Director at SSK Recruitment Limited Sunderland
Adam Smith Founder & CEO The Real Junk Food Project Wakefield
Adam Smith Senior Sales Account Manager Reading
Adam Smith Shebang Security...…. Your Protection Guaranteed. Stevenage
Adam Smith Business Development Manager at Datapharm Londyn
Adam Smith Business Development Manager at Orlo Birmingham
Adam Smith IT Management Professional currently seeking new opportunity - immediate availability - Call 07341 810387 to discuss Bolton