Understand order of magnitude performance gap between python and C++

ThelannOryat · Mar-15-2021, 11:02 AM

**Summary: ** I observe a ~1000 performance gap between a python code and a C+ code doing the same job despite the use of parallelization, vectorization, just in time compilation and machine code conversion using Numba in the context of scientific calculation. CPU wont be used at full, and I don't understand why

Hello everybody,

I just started in a laboratory doing simulation of various material, including simulation of the growth of biological-like tissues. To do that we create a 3D version of said tissue (collection of vertices stored in a numpy array) and we apply different functions on it to mimic physic/biology.

We have a C++ code doing just that, which takes approximately 10 second to run. Someone converted said code to python, but this version takes about 2h30 hours to process. We tried every trick in the book we knew to accelerate the code. We used numba to accelerate numpy where appropriate, parallelized the code as much as we could, tried to vectorize what could be, but still the gap remains. In fact the earlier version of the code took days to proceed.

When the code execute, multiple cores are properly used, as monitored using the build-in system monitor. However, they are not used at full, and in fact deactivating cores manually does not seem to hit performances too much. At first I thought it could be due to the GIL, but releasing it had no effect on performances either. Somehow it makes me think of a bottleneck in memory transfer between the CPU and the ram, but I cannot understand why the C version would not have the same problem. I also have the feeling that there is a performance cost for calling functions. One of my earlier tasks was to refactor the code, thus decomposing complicated functions into smaller elements. I since have a small performance degradation compared to the earlier version.

I must say I am really wondering where my bottleneck is and how it could be tested/improved. Any idea would be very welcome.

I am aware my question is kind of a complicated one, so let me know if you would need additional information, I would be happy to provide.

jefsummers · Mar-15-2021, 11:59 AM

Compiled languages typically are faster than interpreted or hybrid, depending on factors such as the efficiency of the compiler, bytecode interpreter, etc. As one learns Python, you find that doing things in the "pythonic" way can be quite fast - list comprehensions, Pandas operations etc.

You started with a program optimized for C++. Translation of that to any other language is almost guaranteed to be less efficient.

So, first that I would do is take the Python version, put it up on Google Colab, and run it using GPU or TPU hardware acceleration and see if that perks things up. In once online course that ran faster than using IBM's Watson. That, while doing a rewrite in Python that starts from scratch and uses pythonic code.

Marbelous · Mar-15-2021, 01:32 PM

Have you tried PyPy? https://www.pypy.org/features.html

Another option is to determine if there is a calculation intensive section of the code causing a bottleneck and calling your fast C++ code from python through a DLL.

**deanhystad** · Mar-15-2021, 06:27 PM

Interesting. I would expect a 10x to 100x performance difference with no optimization at all. 1000x is surprising and leads me to think that your optimizations are making matters worse. Or you just have a nasty logic bug in you Pyhon program.

ThelannOryat · Mar-17-2021, 03:39 PM

Thanks for all your inputs.

I also expected a 100x performance degradation at most. The current code is partially optimized using Numba, the original version took days to compute. The code is a simulation created by a physicist originally, so time is simulated in very small steps over the whole simulation function, which I know is not where python shines. Still, the code is fairly pythonic where it can be (heavy use of numpy build in functions and vectorization).

I did not have a go at pypy, but I did try Cython. The optimisation gain was similar to the one I got from Numba, still an order of magnitude less than what I would need.

Maybe the question could be more generally rephrased to: Is it possible for python to achieve similar (within 10x) performances than C++ for simulations with small time steps. And if so, what should one be aware of.

I profiled the code and identify the main functions responsible for the lack of performance. However I could not understand, yet, why these functions don't make use of the available CPU power despite parallelization and such.

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Understand what it means that everything in Python is an object...	bytecrunch	8	5,227	Mar-19-2021, 04:47 PM Last Post: nilamo
	Trying to understand the python code	spalisetty	2	2,646	Mar-16-2020, 08:11 AM Last Post: javiertzr01
	I do not understand why my python looks different from tutorials.	noodlespinbot	2	6,797	Oct-12-2019, 09:56 PM Last Post: noodlespinbot
	performance	kerzol81	1	2,358	Oct-07-2019, 10:19 AM Last Post: buran
	I dont understand bytes in python.	blackknite	3	6,631	Oct-02-2019, 07:39 PM Last Post: Gribouillis
	Python performance	rvkstudent	4	3,854	Apr-25-2019, 09:29 AM Last Post: rvkstudent
	Python Turtle and order of implementation query	Parsleigh	2	3,427	Mar-04-2019, 02:43 PM Last Post: Parsleigh
	Trying to understand how import works in python	patrick99e99	3	4,784	Jun-12-2018, 04:48 AM Last Post: patrick99e99
	Python 3.6 dict key iteration order	insearchofanswers87	7	6,704	May-22-2018, 05:33 PM Last Post: snippsat
	Harmonics and their Magnitude values	khanna	4	6,000	May-03-2018, 10:53 AM Last Post: khanna

Understand order of magnitude performance gap between python and C++

User Panel Messages

Announcements