Python Forum
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Help to extract data from web
#1
Hi,

i want to export user data like first name,last name,company,job title,location from the linked in search to excel sheet,is that possible to develop a script using python as i am not a python developer can anyone help me please if possible, as there are 100's of users to export into excel sheet,please refer below image for the required data,Thanks in advance. Smile

[Image: EvxHCnh]
Reply
#2
(May-18-2019, 05:26 AM)prasadmathe Wrote: ,is that possible to develop a script using python as i am not a python developer can anyone help me please if possible, as there are 100's of users to export into excel shee

We are glad to help, but we are not going to do your work for you. Please post your code in python tags, full traceback if you get any errors - in error tags and ask specific questions.
If you can't explain it to a six year old, you don't understand it yourself, Albert Einstein
How to Ask Questions The Smart Way: link and another link
Create MCV example
Debug small programs

Reply
#3
As i am not a python developer,i cannot write python code,so asked here if someone help me to get the script,sorry if it is against the forum rules.
Reply
#4
As buran said, we're happy to help you learn Python, but we're not going to do your work for you.
Feel like you're not getting the answers you want? Checkout the help/rules for things like what to include/not include in a post, how to use code tags, how to ask smart questions, and more.

Pro-tip - there's an inverse correlation between the number of lines of code posted and my enthusiasm for helping with a question :)
Reply
#5
Hi, I wrote something like this. I'm not sure how reliable it will be. It finds the "Next" (next page) button using its' xpath (which changed once when I was testing it for some reason, but then went back to the value written in the code, which is '//*[@id="ember282"]').

Anyway, I tested it and it worked alright except one time described above, the excel file was created.

It uses BeautifulSoup, selenium and xlsxwriter that you can install by opening your command prompt (possibly as administrator if you have python on C partition like me) and executing:
py -3 -m pip install beautifulsoup4 selenium xlsxwriter

You'll also have to install google chrome driver (I have version 74).

Once you run it, it will ask you to login and the search results will appear, then you have to press "enter" as described in the output
Output:
Login, wait for page with search to load and press enter...
'''
1st part of the code to download the data
'''

from selenium import webdriver
from bs4 import BeautifulSoup
from time import sleep, time

url = "https://www.linkedin.com/people/search?firstName=Adam&lastName=Smith&trk=homepage-jobseeker_people-search-bar_search-submit"

def get_src(): # used to refresh the content
    global br # browser object
    return br.find_element_by_xpath("//*").get_attribute("outerHTML") # get some kind of "root" element in order to get the actual source

def scroll_consistently(seconds):
    s = time()
    while time() - seconds < s:
        br.execute_script("window.scrollTo(0, document.body.scrollHeight);")
        sleep(0.2)
    
def separate_odd_even_elements(l):
    ''' Input = [7,2,1,3]
        Output = [
            [7,1][2,3]
            ]
    '''
    a = []
    b = []
    for i in range(0, len(l), 2):
        a.append( l[ i ] )
        b.append( l[ i+1 ] )
    return [a, b]

def get_city_name_from_location(loc):
    if ',' in loc:
        return loc.split(',')[0]
    return loc

br = webdriver.Chrome()

br.get(url)

input("Login, wait for page with search to load and press enter...")

total_first_names = []
total_surnames = []
total_workinfos = []
total_locations = []

# search results are displayed on many pages
# loop below will get the results from 10 pages
for i in range(10):
    # scroll to the bottom (otherwise some results are not displayed)
    # repeat it multiple times during 4 seconds
    scroll_consistently(4)

    # get fresh page source
    soup = BeautifulSoup(get_src(), "html.parser")

    # find the data
    names = [e.get_text() for e in soup.find_all('span', class_='name actor-name')]
    workinfos_with_locations = [e.get_text() for e in soup.find_all('span', {'dir':'ltr'})]

    # just to make sure that there are 2 (work info + location) per 1 name
    assert (len(names) * 2) == len(workinfos_with_locations)

    # separate workinfos and locations from 1 list into 2
    workinfos, locations = separate_odd_even_elements(workinfos_with_locations)

    # exctact first names (first word of name = first name)
    total_first_names += [n.split()[0] for n in names]
    
    # extract surnames (last word of name = surname)
    total_surnames += [n.split()[-1] for n in names]

    # I don't know how to separate position from workplace name
    # sometimes it's separated by word "at",
    # sometimes it's separated by comma
    # I think the risk of mistake is too high so I just leave it raw
    total_workinfos += workinfos
    
    # extract city names from locations
    total_locations += [get_city_name_from_location(loc) for loc in locations]

    # click "next" button and scroll again
    br.find_elements_by_xpath('//*[@id="ember282"]')[0].click()



'''
2nd part of the code that will be used to create and save the excel file
'''


import xlsxwriter

#total_names = ['abc', 'cde']
#total_workinfos = ['111', '222']
#total_locations = ['here', 'there']

workbook = xlsxwriter.Workbook('my_file.xlsx')
worksheet = workbook.add_worksheet("Linkedin people data")

# Write headings
worksheet.write(0, 0, "First name")
worksheet.write(0, 1, "Surname")
worksheet.write(0, 2, "Workinfo")
worksheet.write(0, 3, "Location")

row = 1
col = 0
for f_name, s_name, workinfo, location in zip(total_first_names, total_surnames, total_workinfos, total_locations): 
    worksheet.write(row, col, f_name) 
    worksheet.write(row, col + 1, s_name)
    worksheet.write(row, col + 2, workinfo)
    worksheet.write(row, col + 3, location)
    row += 1
  
workbook.close() 
Excel file contents:

Output:
First name Surname Workinfo Location Adam Smith Managing Director at SSK Recruitment Limited Sunderland Adam Smith Founder & CEO The Real Junk Food Project Wakefield Adam Smith Senior Sales Account Manager Reading Adam Smith Shebang Security...…. Your Protection Guaranteed. Stevenage Adam Smith Business Development Manager at Datapharm Londyn Adam Smith Business Development Manager at Orlo Birmingham Adam Smith IT Management Professional currently seeking new opportunity - immediate availability - Call 07341 810387 to discuss Bolton
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Extract data from sports betting sites nestor 3 2,793 Mar-30-2021, 04:37 PM
Last Post: Larz60+
  Extract data from a table Bob_M 3 1,102 Aug-14-2020, 03:36 PM
Last Post: Bob_M
  Extract data with Selenium and BeautifulSoup nestor 3 1,717 Jun-06-2020, 01:34 AM
Last Post: Larz60+
  Extract json-ld schema markup data and store in MongoDB Nuwan16 0 1,272 Apr-05-2020, 04:06 PM
Last Post: Nuwan16
  Extract data from a webpage cycloneseb 5 1,454 Apr-04-2020, 10:17 AM
Last Post: alekson
  Cannot Extract data through charts online AgileAVS 0 830 Feb-01-2020, 01:47 PM
Last Post: AgileAVS
  Cannot extract data from the next pages nazmulfinance 4 1,296 Nov-11-2019, 08:15 PM
Last Post: nazmulfinance
  How to use Python to extract data from Zoho Creator software on the web dan7055 2 2,332 Jul-05-2019, 05:11 PM
Last Post: DeaD_EyE
  Python/BeautiifulSoup. list of urls ->parse->extract data to csv. getting ERROR IanTheLMT 2 2,219 Jul-04-2019, 02:31 AM
Last Post: IanTheLMT
  [Python 3] - Extract specific data from a web page using lxml module Takeshio 9 4,479 Aug-25-2018, 08:46 AM
Last Post: leotrubach

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020