Python Forum
Performance options for sys.stdout.writelines
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Performance options for sys.stdout.writelines
#1
Pythonistas,

I am trying to improve performance of a simple script reading and comparing two 20KB .dat files. The diff assignment takes 42 seconds and the sys.stdout.writelines 82 seconds. Is there an approach to shaving down these times using standard library features. Would concurrent.futures work. It would be nice to have a settable thread pool like used in multiprocessing module but that doesn't come in the standard library.

with multiprocessing.Pool(processes=(2)) as pool
Is there a way to use multiple threads or processes than can be "set" depending on the size of the .dat?

import difflib
import sys
import time
import os

fromfile = "some1.dat"
tofile = "some2.dat"

fromlines = open(fromfile, "r").readlines()
tolines = open(tofile, "r").readlines()

# takes 42 secs
diff = difflib.HtmlDiff().make_file(fromlines,tolines,fromfile,tofile)

#takes 82 secs
sys.stdout.writelines(diff)
Best,
Dave
Reply
#2
I don't know if threading will help, 20 Kb fikes are very small, and I can't imaging it taking 42 seconds to read, unless you are using a very slow processor. 20Kb should read in a fraction of a second.
take a look at: https://stackoverflow.com/a/19007888
Gribouillis likes this post
Reply
#3
These 2 php files are are < 8kb and they only have 5 lines which are not the same.
This takes zero.something seconds to complete. About as long as it takes to press enter.
Maybe buy a new computer? I recommend Ryzen R9 processors!

path1 = '/var/www/html/20BE1cw/20BE1sW3.html.php'
path2 = '/var/www/html/20BE2cw/20BE2sW3.html.php'

with open(path1) as f1, open(path2) as f2:
    lines1 = f1.readlines()
    lines2 = f2.readlines()

# do this or you will get and index error when you get to the end of the shorter file.
if len(lines1) == len(lines2):
    data1 = lines1
    data2 = lines2
elif len(lines1) > len(lines2):
    data1 = lines1
    data2 = lines2
else:
    data1 = lines2
    data2 = lines1

diff = []

for d in range(len(data1)):
    if not data1[d] == data2[d]:
        diff.append(data2[d])
Output:
>>> len(diff) 5
Reply
#4
(Aug-20-2022, 07:52 AM)Pedroski55 Wrote: Maybe buy a new computer? I recommend Ryzen R9 processors!
Python has a built-in profiler (the profile module). Strengthen your arguments by producing profiler output!
Reply
#5
According to cProfile no time passes, exactly the same as for objects travelling at the speed of light!

But I got an error, don't know why!

import cProfile

def myApp():    
    path1 = '/var/www/html/20BE1cw/20BE1sW3.html.php'
    path2 = '/var/www/html/20BE2cw/20BE2sW3.html.php'

    with open(path1) as f1, open(path2) as f2:
        lines1 = f1.readlines()
        lines2 = f2.readlines()

    if len(lines1) == len(lines2):
        data1 = lines1
        data2 = lines2
    elif len(lines1) > len(lines2):
        data1 = lines1
        data2 = lines2
    else:
        data1 = lines2
        data2 = lines1

    diff = []

    for d in range(len(data1)):
        if not data1[d] == data2[d]:
            diff.append(data2[d])

    for line in diff:
        print(line)
                    
cProfile.run(myApp())
Output:
>>> cProfile.run(myApp()) This is for <b> 20BE2 students </b> , Summer Term 2022. <br> This webpage will switch off on March 09 2022 17:10:00. <br> <form action="php/getcw20BE2sW3.php" method="POST" name="myForm" onsubmit="return checkForExpiration();" > var terminate = new Date("March 09 2022 17:10:00"); expMsg.innerHTML = "Sorry, but I said send the CLASSWORK before March 09 2022 17:10:00 today. You are too late."; 2 function calls in 0.000 seconds Ordered by: standard name ncalls tottime percall cumtime percall filename:lineno(function) 1 0.000 0.000 0.000 0.000 {built-in method builtins.exec} 1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects}
Error:
Traceback (most recent call last): File "/usr/lib/python3.8/idlelib/run.py", line 559, in runcode exec(code, self.locals) File "<pyshell#3>", line 1, in <module> File "/usr/lib/python3.8/cProfile.py", line 16, in run return _pyprofile._Utils(Profile).run(statement, filename, sort) File "/usr/lib/python3.8/profile.py", line 53, in run prof.run(statement) File "/usr/lib/python3.8/cProfile.py", line 95, in run return self.runctx(cmd, dict, dict) File "/usr/lib/python3.8/cProfile.py", line 100, in runctx exec(cmd, globals, locals) TypeError: exec() arg 1 must be a string, bytes or code object
My laptop only has R7 processors, but it's pretty fast! Ubuntu boots in about 10 seconds!
Reply
#6
Try it with
cProfile.run(myApp())
changed to
cProfile.run("myApp()")
Reply
#7
Sorry guys, the files are 20MB (not KB) anyway, there may be even larger files so I wanted to see if there was a simple way to expedite processing the sys.stdout.writelines portion of the original code.

All the Best,
David
Reply
#8
(Aug-22-2022, 04:52 PM)dgrunwal Wrote: I wanted to see if there was a simple way to expedite processing the sys.stdout.writelines portion
If you send 20MB to sys.stdout, the time that it takes depends on what sys.stdout actually is. If you are writing to a terminal, it will take a lot of time. On the other hand if sys.stdout is a file in RAM, it will take very little time. There are many other options such as a pipe a socket or whatever. I'm not sure the performance issue is on the Python side.
Reply
#9
All was able to get the script 20% improvement, seems the sys.stdout.writelines takes the longest. Not sure if it can be enhanced for speed in case larger files come.

Dave

import difflib
import concurrent.futures
import sys
import time
import os

fromfile = "some20MB.dat"
tofile = "another20MB.dat"

fromlines = open(fromfile, "r").readlines()
tolines = open(tofile, "r").readlines()

with concurrent.futures.ProcessPoolExecutor() as executor:
diff = difflib.HtmlDiff().make_file(fromlines,tolines,fromfile,tofile)

with concurrent.futures.ProcessPoolExecutor() as executor:
sys.stdout.writelines(diff)
Reply
#10
(Aug-22-2022, 05:57 PM)dgrunwal Wrote: seems the sys.stdout.writelines takes the longest.
See if it runs faster if you redirect stdout to a file for example, like in the command
Output:
python myprogram.py > output.txt
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  [subprocess] Why stdout sent to stderr? Winfried 3 489 Jan-26-2024, 07:26 PM
Last Post: snippsat
  writelines only writes one line to file gr3yali3n 2 2,391 Dec-05-2021, 10:02 PM
Last Post: gr3yali3n
  changing stdout and stderr Skaperen 4 2,709 Dec-02-2020, 08:58 PM
Last Post: Skaperen
  print a line break in writelines() method leodavinci1990 1 6,482 Oct-12-2020, 06:36 AM
Last Post: DeaD_EyE
  Get stdout of a running process yok0 0 3,037 Aug-20-2020, 10:12 AM
Last Post: yok0
  Can argparse support undocumented options? pjfarley3 3 2,224 Aug-14-2020, 06:13 AM
Last Post: pjfarley3
  Performance enhancement fimmu 0 1,630 Feb-12-2020, 02:42 PM
Last Post: fimmu
  will with sys.stdout as f: close sys.stdout? Skaperen 9 4,629 Nov-03-2019, 07:57 AM
Last Post: Gribouillis
  performance kerzol81 1 1,919 Oct-07-2019, 10:19 AM
Last Post: buran
  Help with options raiden 1 1,937 Aug-30-2019, 12:57 AM
Last Post: scidam

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020