Python Forum

Full Version: Capstone Project - Room for Python?
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Pages: 1 2
Hello All, very newbie general Python question here.

I am a grad student in BI/BA and starting my capstone in the Fall. Quick summary, its just an 8 month long project to implement a BI environment (the entire framework, database, project plan, data governance, BI tools, ETL, etc…) The teams have essentially open access to choose the question being asked and what technologies to employ.

In addition to getting a good grade I want to learn something new and use this as a portfolio to show potential employers. Looking around at job boards I see a lot of Python (much more than R) listed in many data centric roles.

So my question is, what would I use Python for in a project like this? Could I integrate Python into part of this project? I always thought Python was similar to R but the more I read/look into it, the more I'm not so sure. Think

Just for reference, our team project is going to build a BI system around Fantasy Football statistics. Nothing too crazy or difficult but lots of stats and both a trend and real-time component. Looking at using MySQL for the database, Postgre for the datawarehouse, Jaspersoft ETL , and Pentaho BI suite.


Thanks Wink
(Aug-04-2017, 02:02 AM)QueenBee Wrote: [ -> ]So my question is, what would I use Python for in a project like this?

um, everything? If your sample database is expected to be small to modest, you could use Python's builtin "sqllite3", if it is expected to be larger, or you want more flexibility I would go with Postgre, though Python will work with most compliant databases. Don't really see the need to have two separate types. For number crunching, you could use "numpy", want some snappy graphs and plots, use "matplotlib", create a pretty user interface, toss in wxPython.  Python is also cross-platform, so you could write your program to work on Windows, Linux, *nix and Mac operating systems. Python is easy to learn (even for an old geezer like me :-) ). With thousands of 3rd party modules, I think you would be hard pressed to not find a solution to a particular need.
Sorry for the delay in responding and thank you for responding to this.

I had a feeling you were going to say everything and I mean that positively. One of my friends is an R guru and anytime I have almost any data issue his response is "R can do that". Scaling this response down, so you would recommend Postgre over MySQL or MariaDB because of the integration with Python? I was going to use MySQL as the database but since I'll be mining the data warehouse maybe I'll use Postgre instead. I've just never used Postgre before and not sure of its functionality.

Thanks all.

QB
Your friend is right that any time you have a data issue R can do that. That's because R was designed from the start to be a statistics language. Python is a general purpose language. But more and more statistics and data packages have been added to Python, and it can now handle a lot of the stuff that R can. If you are doing really serious statistics (it doesn't sound like you are) I would go with R. If you want to do things beyond the data analysis, like have an interface, I would go with Python.

In the long run, as a student in an analysis field, I would learn both if you can.
Yea, it went way beyond Data issues with my friend. It was almost like anything I said, "ya know, R can do that", guy was passionate about R.

Well, that's the context, I'm not necessarily going into a Data Science role. Although I see the value of learning R, my 'learning time' is finite and I feel going into a Data Analyst/Data Engineering role, Python would server me well in addition to Database Skills and ETL skills as the top 3 skills to master first.

I've already had roles as a BI Developer and Operations Analyst and I've seen a ton of postings lately with Python as a required or recommend skill which is why I brought it up, but if I'm looking at this wrong, do tell.

Still not clear how to integrate it into my project though.

QB
(Aug-07-2017, 12:55 PM)QueenBee Wrote: [ -> ]so you would recommend Postgre over MySQL or MariaDB because of the integration with Python?

No, I only mention it because several members have voiced positive remarks about it. This site ODBC Drivers lists the DB's which can then be used with the appropriate Python ODBC interface. So, if you are more comfortable using MySQL (or other DB), by all means use it.

As to the actual data, rather than entering it all by hand you might rather use web scraping to gather the information automatically from an existing source like this site fftoday stats (note: this is not a recommendation nor promotion for this site, it's just the first one that came up in a web search). We have some very good tutorials on how to do this here Web Scraping and here Web sraping with Scrapy. (Yes, those are both blatant promotions Angel ). You could collect the data as often as you like, every hour, every week, every month (or until they block your IP address if you do it too often  Dodgy )

Finally, once you have your data,  you will want to present it to the end user, either as a stand alone GUI interface or perhaps even an interactive web page.

If, as ichabod801 points out, that R is better at data manipulation, then use it for that. If Python is better at presenting that information then use it for that. Perhaps your 'team' could be divided in a way where one group is focused to the data collection, another on data manipulation and another on presentation.

What ever you decide to do, I for one, would be interested in your final decision and why you chose to go that route.
(Aug-07-2017, 08:58 PM)sparkz_alot Wrote: [ -> ]If, as ichabod801 points out, that R is better at data manipulation

I didn't say that. When I want data manipulation, I go to SAS. R is better at statistics than data manipulation. I am not up to speed on the newer Python packages to compare it to SAS or R on data manipulation. But R is made for statistics, and is big in academia, so the newer statistical methods tend to get into R first. But neither R nor SAS are good general purpose languages.
R is better at statistics than data manipulation
My apologies for misquoting, ichabod.
(Aug-08-2017, 11:41 AM)sparkz_alot Wrote: [ -> ]My apologies for misquoting, ichabod.

No problem, sparkz. I don't expect everyone to automatically clue in on my picky distinctions between statistical programming languages.
Pages: 1 2