Python Forum

Full Version: Scraping a page with log in data (security, proxies)
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Hey guys,

I have a few questions regarding the safety issues whilst scraping the web.

Im working for an agency selling tickets for a big flight company and i want to scrape client data from my agency page of the flight company website.
In order to scrape the data i have to do the followining.

1. Login
2. Click on all the needed buttons to get to a page with a list of my clients (buttons/links)
3. Click on first person (button/link)
4. Scrape person nr. 1
5. Go back to people list
6. Click on following person and scrape data
7. Repeat 6.-7.

I already buildt a program but i am hesitating to start it, because i am scared the traveling agency could find out. The script should basically run 2/3s of a day 5 times a week since i work there and i dont want to write that shit down Doh. And since i am using my login information, the flight company should know where the requests are all coming from even if i rotate through proxies and so on. I am not well read in internet security or how the web is buildt in general so i need some help.

Here are the questions i dont really have a good answer too:
1.) What would be the safest way to keep such an infrastructure alive as long as possible and not be tracked or spotted by the traveling company.
2.) Currently i am using selenium, rotate through proxies and user agents, but i dont know what would be smarter: since i am scraping while i am logged in, can the flight company (webpage) find out, that the same login data is being used from different ip-adresses, so from different places? If thats the case, i think it would be counter-productive to use rotating proxies. Or at least i would need proxies from my country i guess.
3.) If the traveling company cant find out the connection between my log in data and my ip adresses: do i have to switch proxies and log in again? Or can I stay logged in and switch my proxy?
4.) I am also imitating human behavior. I have a huge list of probabilities with times and the code randomly chooses a probable one a human could possibly wait at that point . So sometimes 1 second, sometimes 10-20 for example.
6.) Can the web page detect my mouse movement?
If yes, then i’d have to consider mouse movement in 4.)?!


I can add some code if my explanation is not clear enough, but my problem is not due to lack of coding skill but due to lack of knowledge about all this security and web stuff Doh

I would even be thankful if you could redirect me to some topics that i can learn or study to try some new approaches. But currently i am stuck and i dont even know what to look for.
Any books or documentations regarding that topic would also be very much appreciated.

Thanks