Performance options for sys.stdout.writelines

dgrunwal · Aug-19-2022, 08:28 PM

Pythonistas,

I am trying to improve performance of a simple script reading and comparing two 20KB .dat files. The diff assignment takes 42 seconds and the sys.stdout.writelines 82 seconds. Is there an approach to shaving down these times using standard library features. Would concurrent.futures work. It would be nice to have a settable thread pool like used in multiprocessing module but that doesn't come in the standard library.

with multiprocessing.Pool(processes=(2)) as pool

Is there a way to use multiple threads or processes than can be "set" depending on the size of the .dat?

import difflib
import sys
import time
import os

fromfile = "some1.dat"
tofile = "some2.dat"

fromlines = open(fromfile, "r").readlines()
tolines = open(tofile, "r").readlines()

# takes 42 secs
diff = difflib.HtmlDiff().make_file(fromlines,tolines,fromfile,tofile)

#takes 82 secs
sys.stdout.writelines(diff)

Best,
Dave

**Larz60+** · Aug-19-2022, 10:46 PM

I don't know if threading will help, 20 Kb fikes are very small, and I can't imaging it taking 42 seconds to read, unless you are using a very slow processor. 20Kb should read in a fraction of a second.
take a look at: https://stackoverflow.com/a/19007888

Pedroski55 · Aug-20-2022, 07:52 AM

These 2 php files are are < 8kb and they only have 5 lines which are not the same.
This takes zero.something seconds to complete. About as long as it takes to press enter.
Maybe buy a new computer? I recommend Ryzen R9 processors!

path1 = '/var/www/html/20BE1cw/20BE1sW3.html.php'
path2 = '/var/www/html/20BE2cw/20BE2sW3.html.php'

with open(path1) as f1, open(path2) as f2:
    lines1 = f1.readlines()
    lines2 = f2.readlines()

# do this or you will get and index error when you get to the end of the shorter file.
if len(lines1) == len(lines2):
    data1 = lines1
    data2 = lines2
elif len(lines1) > len(lines2):
    data1 = lines1
    data2 = lines2
else:
    data1 = lines2
    data2 = lines1

diff = []

for d in range(len(data1)):
    if not data1[d] == data2[d]:
        diff.append(data2[d])

Output:>>> len(diff)
5

**Gribouillis** · Aug-20-2022, 07:58 AM

(Aug-20-2022, 07:52 AM)Pedroski55 Wrote: Maybe buy a new computer? I recommend Ryzen R9 processors!

Python has a built-in profiler (the profile module). Strengthen your arguments by producing profiler output!

Pedroski55

According to cProfile no time passes, exactly the same as for objects travelling at the speed of light!

But I got an error, don't know why!

import cProfile

def myApp():    
    path1 = '/var/www/html/20BE1cw/20BE1sW3.html.php'
    path2 = '/var/www/html/20BE2cw/20BE2sW3.html.php'

    with open(path1) as f1, open(path2) as f2:
        lines1 = f1.readlines()
        lines2 = f2.readlines()

    if len(lines1) == len(lines2):
        data1 = lines1
        data2 = lines2
    elif len(lines1) > len(lines2):
        data1 = lines1
        data2 = lines2
    else:
        data1 = lines2
        data2 = lines1

    diff = []

    for d in range(len(data1)):
        if not data1[d] == data2[d]:
            diff.append(data2[d])

    for line in diff:
        print(line)
                    
cProfile.run(myApp())

Output:>>> cProfile.run(myApp())
This is for <b> 20BE2 students </b> , Summer Term 2022. <br>

This webpage will switch off on March 09 2022 17:10:00. <br>

<form action="php/getcw20BE2sW3.php" method="POST" name="myForm" onsubmit="return checkForExpiration();" > 

var terminate = new Date("March 09 2022 17:10:00");

    expMsg.innerHTML = "Sorry, but I said send the CLASSWORK before March 09 2022 17:10:00 today. You are too late.";

         2 function calls in 0.000 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    0.000    0.000 {built-in method builtins.exec}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}

Error:Traceback (most recent call last):
  File "/usr/lib/python3.8/idlelib/run.py", line 559, in runcode
    exec(code, self.locals)
  File "<pyshell#3>", line 1, in <module>
  File "/usr/lib/python3.8/cProfile.py", line 16, in run
    return _pyprofile._Utils(Profile).run(statement, filename, sort)
  File "/usr/lib/python3.8/profile.py", line 53, in run
    prof.run(statement)
  File "/usr/lib/python3.8/cProfile.py", line 95, in run
    return self.runctx(cmd, dict, dict)
  File "/usr/lib/python3.8/cProfile.py", line 100, in runctx
    exec(cmd, globals, locals)
TypeError: exec() arg 1 must be a string, bytes or code object

My laptop only has R7 processors, but it's pretty fast! Ubuntu boots in about 10 seconds!

**Yoriz** · Aug-21-2022, 07:38 AM

Try it with

cProfile.run(myApp())

changed to

cProfile.run("myApp()")

dgrunwal · Aug-22-2022, 04:52 PM

Sorry guys, the files are 20MB (not KB) anyway, there may be even larger files so I wanted to see if there was a simple way to expedite processing the sys.stdout.writelines portion of the original code.

All the Best,
David

**Gribouillis** · Aug-22-2022, 05:08 PM

(Aug-22-2022, 04:52 PM)dgrunwal Wrote: I wanted to see if there was a simple way to expedite processing the sys.stdout.writelines portion

If you send 20MB to sys.stdout, the time that it takes depends on what sys.stdout actually is. If you are writing to a terminal, it will take a lot of time. On the other hand if sys.stdout is a file in RAM, it will take very little time. There are many other options such as a pipe a socket or whatever. I'm not sure the performance issue is on the Python side.

dgrunwal · (This post was last modified: Aug-22-2022, 05:57 PM by dgrunwal.)

All was able to get the script 20% improvement, seems the sys.stdout.writelines takes the longest. Not sure if it can be enhanced for speed in case larger files come.

Dave

import difflib
import concurrent.futures
import sys
import time
import os

fromfile = "some20MB.dat"
tofile = "another20MB.dat"

fromlines = open(fromfile, "r").readlines()
tolines = open(tofile, "r").readlines()

with concurrent.futures.ProcessPoolExecutor() as executor:
diff = difflib.HtmlDiff().make_file(fromlines,tolines,fromfile,tofile)

with concurrent.futures.ProcessPoolExecutor() as executor:
sys.stdout.writelines(diff)

**Gribouillis** · (This post was last modified: Aug-22-2022, 06:10 PM by Gribouillis.)

(Aug-22-2022, 05:57 PM)dgrunwal Wrote: seems the sys.stdout.writelines takes the longest.

See if it runs faster if you redirect stdout to a file for example, like in the command

Output:
python myprogram.py > output.txt

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	[subprocess] Why stdout sent to stderr?	Winfried	3	489	Jan-26-2024, 07:26 PM Last Post: snippsat
	writelines only writes one line to file	gr3yali3n	2	2,391	Dec-05-2021, 10:02 PM Last Post: gr3yali3n
	changing stdout and stderr	Skaperen	4	2,709	Dec-02-2020, 08:58 PM Last Post: Skaperen
	print a line break in writelines() method	leodavinci1990	1	6,482	Oct-12-2020, 06:36 AM Last Post: DeaD_EyE
	Get stdout of a running process	yok0	0	3,037	Aug-20-2020, 10:12 AM Last Post: yok0
	Can argparse support undocumented options?	pjfarley3	3	2,224	Aug-14-2020, 06:13 AM Last Post: pjfarley3
	Performance enhancement	fimmu	0	1,630	Feb-12-2020, 02:42 PM Last Post: fimmu
	will with sys.stdout as f: close sys.stdout?	Skaperen	9	4,629	Nov-03-2019, 07:57 AM Last Post: Gribouillis
	performance	kerzol81	1	1,919	Oct-07-2019, 10:19 AM Last Post: buran
	Help with options	raiden	1	1,937	Aug-30-2019, 12:57 AM Last Post: scidam

Performance options for sys.stdout.writelines

User Panel Messages

Announcements