Python Forum
Thread Rating:
  • 1 Vote(s) - 2 Average
  • 1
  • 2
  • 3
  • 4
  • 5
How complex is this project?
#1
For the last month or so, I have been reading up on Python, learning as much as I can, with the hope of eventually building a web-crawler to pull a lot of data from a particular website. Recently I was mapping out what the project might look like, and realised, having only just begun with Scrapy (and still very much being a Python amateur), that the item pipeline requirements for this project might be quite complicated, and I am seeking an expert opinion on just how workable my project is, so I don't waste time.

In short, I am aiming to scrape several thousand job advertisements for text from an Australian website, Seek.com.au, with the aim of analysing the frequency of particular linguistic expressions and the relationship between reported frequency and a number of variables (locations (7), Job Classifications (30), sub-classifications (352), pay brackets (10) and work-types (4). 

Ideally, my spider would crawl through each of these categories, and for every job ad, feed to it to a specified location for that specific set of categories (NSW, Accounting, book-keeping, 0-40K p/a, FT) for example, so that when I figure out how I'm actually going to map or process the data, it's in the right location.  

Based on the number of variables, however, it seems my spider would have to crawl some ~300 000 unique combinations. Or... that's where I come to my question. How should I be thinking about structuring a way to filter the data? Should my scraper only focus on catching data and doing minimal sorting, and I find a different way to sort?

It seemed like a fairly simple idea when I started, scrape some job ads, sure. But now I'm realising I want to do it properly, I wonder how long it would take me to become proficient enough at Scrapy to actually built it. 

Any feedback or thoughts would be greatly appreciated, or redirection to a more appropriate place. I thought about posting this on the Scrapy forum at StackOverflow, but they seem to be more code based. This is coding issue but in a more roundabout way, I suppose. Cheers.

NS
Reply


Messages In This Thread
How complex is this project? - by Nested_Sunlight - Jan-24-2017, 06:29 AM
RE: How complex is this project? - by iFunKtion - Jan-26-2017, 01:15 PM
RE: How complex is this project? - by Larz60+ - Jan-26-2017, 09:30 PM
RE: How complex is this project? - by scriptso - Feb-06-2017, 11:26 PM

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020