Bottom Page

Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
 An architecture choice to build reports on top of Django
Hi everyone,
I am quite new to this forum, however I believe there is a lot of nice python-fellows, so I came here with a problem   Rolleyes

Here is a deal,
I have a typical Django app which has about 30 models, a quite small one. The issue I am facing with is that customers need a lot of different reports based on data that is stored in PostgreSQL. Each report involves a lot of different tables which makes it heavy enough to slow down the app, and 10 seconds of waiting for a response with a report is not sufficient. Even after optimizing Django ORM, I do not want to go down to the plain SQL if it is possible.

The idea I have in my mind is that we can duplicate some data and store pieces of reports in separate tables, so API for getting those reports will work very fast with a minimum amount of read-only queries to the DB. 

Is there any other ways of achieving my goal? One more time, reports involve data from multiple tables and operations over it. So far I can't see any other way than processing it at night. Where can I find more information about an architecture or best practices of systems like that.
Can you view the raw sql that the orm generates in any way?  Some db clients let you view currently running queries, or query the statistics tables to view long-running queries.  That sort of thing should be your first step, to see if simply adding an index or two solves your problem.  10 second report times are completely unacceptable, so definitely some sort of caching should be used if indexes can't solve the problem.

If all your reports are historical data (ie, yesterday, last month, etc) rather than right NOW, Today, The Past Five Minutes sort of thing, you could fake an OLAP cube by pre-compiling sums/averages using whatever groupings you use, and store those in a new table.  That way, when you run a report, all you do is grab the pre-calculated values, which would be lightning fast.  Just don't tell users you do that, because they have a tendency to freak out when you tell them you're doing all the math ahead of time and just showing them numbers that might not actually reflect that data anymore.

To summarize:
  • check if indexes can save the day.
  • eliminate everything you can from the orm (extra fields/groupings/math/anything).  The less work you make the database do, the better.
  • consider trying to do things asynchronously.  If you can run 5 queries at the same time, instead of having 4 of them waiting for the first to finish, you'll get the same data, the same way, but be much faster doing it.  ...I don't actually know if Django's orm is thread safe, but coroutines are an equally valid option.
  • cache some (or all) of the results, or parts of how you get those results (join a pre-calculated table, instead of a massive Transactions table, for example)
  • rewrite the query in sql.  You might not like it, but sometimes when it comes to "complicated" things, an orm just won't cut it.
Hi @nilamo,
Thank you for reply, I appreciate it.

Optimizing queries, raw SQL and caching is a way to go and I would like to explore more options just in case.

You mentioned something interesting, OLAP cube. I would like to understand it little better. Let's say I want to do some analysis on the data that is stored in SQL. It is not online reports in terms of minutes or hours. Last day is the maximum we need. Would you use any OLAP servers in this case and do you have any recommendations?
There's free demos available so you can try it out. I've never used a cube in a real production environment, but I've created a few and messed around with them enough to be aware of the idea of them. The thing with a cube, is it's not possible to get information about individual things. EVERYTHING is rolled up in various ways, depending on how you set it up (grouped by time period, product type, sales person, that sort of thing). "creating" a cube is really just running a wizard that runs a bunch of sum()/avg()/group by queries and storing the results so other queries are very fast.

...they're mainly used to hand to end users (they access the data via something like Excel), so they can build their own reports. Sort of if they have a different idea for how they want to look at data each day, instead of having a set report that they always go to.

It's neat, but with a little work you can display any information in a multidimensional array like a cube does. The one I played with (microsoft's version) was basically everything rolled up into one number, and then you add columns and rows to "drill down" into that number, to break it apart by month/week/day of the week/hour, and then again by department/salesperson/product type/product.

I was unimpressed, but that was probably because when someone says "OLAP cube" I was expecting it to do a whole lot more than just pre-calculate summarized data for me, and make that available to excel lol
Reading the table or part of it in to memory is one way to short the response time.

But first of, are you sure that the queries are causing the problem. Try it locally. Do a bunch of queries locally in the terminal and see how is doing. A short bash script should to it good enough. Time it. A bad Django code may slow down the response time too. Try a different template engine. Try to optimize the code of the current one.
"As they say in Mexico 'dosvidaniya'. That makes two vidaniyas."

Top Page

Forum Jump:

Users browsing this thread: 1 Guest(s)