Python Forum

Full Version: facebook friends crawler
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Pages: 1 2
hey guys,

so recently i started getting into python again and i was thinking about taking on a bigger challenge: a facebook friends list crawler.
i've done a crawler before using 'requests' and 'beautifulsoup' modules and it was kind of ok but nothing special (you can find it in my previous posts).
i've reused some of the code just to get me started but i've gotten to a sticking point.

first i looked around someone's fb main page just to see how it would look like and the home page looks pretty straight forward:
facebook.com/[id] (where id is person's facebook id; most of them start with 1000...)

then i went over to 'friends' section and i noticed it has the following format:
facebook.com/friends = [your id] & [some id you find in person's homepage source code] & [some id i'm assuming fb sends back to create a session] = friends
this is obviously grossly over simplified but it's just to get an idea.

i also noticed that the session id (the whole link) stays consistent throughout the entire session. meaning, if i go on a friend's page and go to 'friends' and generate the session link; then if i open a new tab and copy/paste the session link, it takes me to the same page.
if i close all tabs and paste the link later it shows a blank page.
and out of those 3 nrs, the first 2 stay constant for 1 friend and the 3rd fb session id nr changes every time.

so i took the link and kept the session open just to be able to access it with requests and bs4 and here is where i'm stuck right now:
every friend's name appears under a html tag <div> with another tag inside called <class="fsl fwb fcb">. then you have an <a> tag with href="friend's fb homepage" and then a <data-gt> where you have the fb id, which is what we're after in our crawler.
problem is that when python makes the request it looks like it's an anonymous request and it only takes you to the person's homepage and it says "log into fb to continue".
i found this out by printing the requests.get() of the webpage because i kept getting nothing when i was doing print('data-gt').

so my question is: you guys have any idea how to make a request from python that is not anonymous? meaning i log in using my id. as if i were browsing from my homepage.
if you want i can post the source code there's next to nothing in there at the moment.

thanks
after scrapping facebook with their API VS selenium. I much more prefer selenium as i am not so restricted. You will be OK as long as you imitate human delays. You might trigger a captcha if you do it a lot, but if you are not processing multiple accounts automatically, then you would most likely be there at the PC to pass the captcha and let the program move on afterwords.
hey, thanks a lot.

i played a bit with webdriver and chromedriver and all i can say is: 'facebook api what?'. :)))
i setup a fb acc. and made my way to a user's home page but i keep getting stuck at pressing the friends button.
i use the 
chrome.find_element_by_css_selector
function but apparently the value for the 'friends' button differs from one account to another and even from one browser to another so i have to keep running into errors and place the right code into the script.
hopefully i can make the crawler retrieve the css value on its own so i don't have to baby sit it all the time. :D

and by the way, is there any way of doing this faster? like not having to load a chrome page and just make the code do the actions in the background? (but not just run the browser in the background, but not run it at all and make python do the connection itself)

and also, is there any way of using just regular chrome instead of the chromedriver client? it makes it a bit of a pain to keep having to choose notification settings and password settings and all that stuff.
Quote: apparently the value for the 'friends' button differs from one account to another
If i go to my account and inspect the friends button on multiple accounts the link is always with the class name of _6-6. Also are you adding a delay? You need to make sure the you finished logging in before trying to click a button on the following page.

To use it headless you would use PhantomJS instead of Chrome or Firefox. Its just that though, headless, it is still loading the page and takes about the same time. This is required though for javascript as python cannot really get around that. And facebook has plenty of that.

You can use straight python to connect, but you are going to run into a lot of hindrances. They purposely have anti-bot measures, and obfuscate their code to stop you from doing this, and not to mention the big one...javascript.
hmm, you're right. the html tag has a <class="_6-6"> tag that stays consistent throughout.
i'll try and use the html tag instead of the css and see what happens.

say, can i at least make chromedriver save cookies, so it doesn't have to login and press 'block notifications' every time?
i was writing a script to show how you could select friends button...but it also gets rid of notifications as well. This is more geared towards chrome, not phantomjs.

from selenium import webdriver
import time
import os

URL = 'https://www.facebook.com/'
CHROMEPATH = '/home/metulburr/chromedriver'
PHANTOMPATH = '/home/metulburr/phantomjs'
EMAIL = ''
PASSWORD = ''

class App:
	def __init__(self):
		self.setup_chrome()
		#self.setup_headless()
		self.login()
		self.to_home()
		self.to_friends()
		time.sleep(100000) #keep alive to view html
		
	def delay(self):
		time.sleep(3)
	
	def chrome_prep(self):
		'''get rid of asking to save password and notifications popup'''
		chrome_options = webdriver.ChromeOptions()
		chrome_options.add_experimental_option(
			'prefs', {
				'credentials_enable_service': False,
				"profile.default_content_setting_values.notifications" : 2,
				'profile': {
					'password_manager_enabled': False
				}
			}
		)
		return chrome_options
		
	def setup_chrome(self):
		options = self.chrome_prep()
		os.environ["webdriver.chrome.driver"] = CHROMEPATH
		self.browser = webdriver.Chrome(CHROMEPATH, chrome_options=options)
		self.browser.set_window_position(0,0)
		self.delay()
		
	def setup_headless(self):
		self.browser = webdriver.PhantomJS(PHANTOMPATH)
		self.delay()
		
	def login(self):
		self.browser.get(URL) 
		time.sleep(1) 
		username = self.browser.find_element_by_id("email")
		password = self.browser.find_element_by_id("pass")
		username.send_keys(EMAIL)
		password.send_keys(PASSWORD)
		login_attempt = self.browser.find_element_by_xpath("//*[@type='submit']")
		login_attempt.submit()
		self.delay()
		
	def to_home(self):
		self.browser.execute_script("document.getElementsByClassName('linkWrap noCount')[0].click()")
		self.delay()
	
	def to_friends(self):
		self.browser.execute_script("document.getElementsByClassName('_6-6')[2].click()")
		self.delay()
		
App()
hey man, you're not gonna believe how ridiculously easy this was, like, embarrasingly easy. xD

browser = webdriver.Chrome(CHROMEDRIVER LOCATION)

def facebook():
    # here goes the login part of the code
    friends = browser.get('https://facebook.com/FRIEND ID/friends')
yep, that is it; no clicking, no selenium, no nothing. xD
i will wanna do it with selenium though cuz it feels more proish but you don't need to make that special session request manually, you can just access the href link and the browser makes it for you.
also, notice that i left the browser outside of the function. that way it doesn't close after the function is done and you don't have to specify an infinite sleep timer either. :)

also, is there any way of filtering a 'find_element_' function through multiple requests?
for ex: if i say: browser.find_element_by_link_text('Friends') it's gonna find like 4 or 5 results.
is there any way of storing this into a list and then applying browser.find_element_by_SOMETHING_ELSE() to get the exact button i'm looking for?

i'm still working on the chrome options portion of it, i'll post more code when i figure it out.
hey guys

so i've successfully managed to load the friends page and make it grab all the friends' names and ids but problem is that the window only shows 20 friends until you scroll down when it loads more.
now i figured how to make selenium scroll down indefinitely but, and this kind of a stupid question, now i'm trying to make the scroll stop when it get's to the 'more about FRIEND' tag line.

i've put some code together here just to get an idea of what we're working with.

from selenium import webdriver
import time

chrome = webdriver.Chrome(r'C:\chromedriver_win32\chromedriver.exe')

def facebook():
    chrome.get('https://facebook.com/login')
    user = chrome.find_element_by_css_selector('#email')
    user.send_keys('')
    password = chrome.find_element_by_css_selector('#pass')
    password.send_keys('')
    login = chrome.find_element_by_css_selector('#loginbutton')
    login.click()

    chrome.get('https://www.facebook.com/FRIEND NAME/friends')
    time.sleep(2)
    friend_name = chrome.find_element_by_id('fb-timeline-cover-name')
    print(friend_name.text)

    more_about = str('More About ' + str(friend_name.text))
    print(more_about)
    while True:
        chrome.execute_script('window.scrollTo(0, document.body.scrollHeight);')

facebook()
i know i'm supposed to use a
for
or a
while
loop but can't get my head around any of them.
hope you guys know more than i do.
thanks a lot :)
Find something at the bottom of the page to signify that it is at the bottom and that you have to scroll to see, then constantly check for that in the loop, if its not there, then you need to keep scrolling, otherwise stop scrolling.

The other option is you can grab the total number of friends int for "All Friends" and then check all the li tags compared to that number. So if you had 33 friends, once the total of the li tags sums up to 33 in the friends ul's tags, then you have hit the bottom.

EDIT:
actually this is better and faster
	def scroll_to_bottom(self):
		SCROLL_PAUSE_TIME = 0.5
		# Get scroll height
		last_height = driver.execute_script("return document.body.scrollHeight")

		while True:
			# Scroll down to bottom
			driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")

			# Wait to load page
			time.sleep(SCROLL_PAUSE_TIME)

			# Calculate new scroll height and compare with last scroll height
			new_height = driver.execute_script("return document.body.scrollHeight")
			if new_height == last_height:
				break
			last_height = new_height
Pages: 1 2