Python Forum
Thread Rating:
  • 1 Vote(s) - 2 Average
  • 1
  • 2
  • 3
  • 4
  • 5
How complex is this project?
#1
For the last month or so, I have been reading up on Python, learning as much as I can, with the hope of eventually building a web-crawler to pull a lot of data from a particular website. Recently I was mapping out what the project might look like, and realised, having only just begun with Scrapy (and still very much being a Python amateur), that the item pipeline requirements for this project might be quite complicated, and I am seeking an expert opinion on just how workable my project is, so I don't waste time.

In short, I am aiming to scrape several thousand job advertisements for text from an Australian website, Seek.com.au, with the aim of analysing the frequency of particular linguistic expressions and the relationship between reported frequency and a number of variables (locations (7), Job Classifications (30), sub-classifications (352), pay brackets (10) and work-types (4). 

Ideally, my spider would crawl through each of these categories, and for every job ad, feed to it to a specified location for that specific set of categories (NSW, Accounting, book-keeping, 0-40K p/a, FT) for example, so that when I figure out how I'm actually going to map or process the data, it's in the right location.  

Based on the number of variables, however, it seems my spider would have to crawl some ~300 000 unique combinations. Or... that's where I come to my question. How should I be thinking about structuring a way to filter the data? Should my scraper only focus on catching data and doing minimal sorting, and I find a different way to sort?

It seemed like a fairly simple idea when I started, scrape some job ads, sure. But now I'm realising I want to do it properly, I wonder how long it would take me to become proficient enough at Scrapy to actually built it. 

Any feedback or thoughts would be greatly appreciated, or redirection to a more appropriate place. I thought about posting this on the Scrapy forum at StackOverflow, but they seem to be more code based. This is coding issue but in a more roundabout way, I suppose. Cheers.

NS
Reply
#2
Hi, newbie programmer myself, but i have tackled this myself. This is not too complex, getting the data out is quite easy, it is conditioning it so that it can be read with your script in order to filter it that is the difficult part. The biggest issue I have had with using external data is trying to get as much of the data as possible. NaN is a really annoying data type used to tell you that a piece of data is not a number (how useful is that?) That aside, there are some really good tutorials out there, google "python data analysis tutorial" and do the video search and you should find a series that will give you what you are after, you will be wanting pandas and numpy python libraries installed. Good luck.
Reply
#3
You might want to start here: https://python-forum.io/Thread-Web-scrap...ht=scraper
Reply
#4
Sounds like you talking about using neural networking... NLP. Scrapy will get the content but wont natively perform any analysis...
(Jan-24-2017, 06:29 AM)Nested_Sunlight Wrote: In short, I am aiming to scrape several thousand job advertisements for text from an Australian website, Seek.com.au, with the aim of analysing the frequency of particular linguistic expressions and the relationship between reported frequency and a number of variables (locations (7), Job Classifications (30), sub-classifications (352), pay brackets (10) and work-types (4).

NLTK would provide you with the Frequency of expression with ability to customize the search to have a logic for misspellings etc. Off the top of my head my first design flow would be something like....
Scrape: Title, post content + your varibles described per post
Using NLTK, reduce the total content by removing all stop words... then tokenize sentences (words as well?), chunk it all then chink all that.... LMAO you still reading?

AS far as getting analysys for the process im describing would depend on what OTHER variables I would like to incorporate such as date or w.e else I can think of....

BUt my point is scrapy alone wont get the job done....

PS: Natural language or neural networking to perform data analysis is THE most complex type of development. Not because of the tools used in python... if anything they are the simplet... sckit, nltk being the more popular... But because of the amount of data parsed.
Reply


Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020