Python Forum

So i am having to process a few different text files, for each file i process, i need too capture runtime duration, record count and timestamp

Since i'm already using the scan_files() function is it possible to incorporate the mapcount() functionality into the main function?
I mean if im already opening up the first text file, why not get the count returned while its open. so far just adding the readline logic below the strip() in the main logic isnt working..

# ROUTINE TO OPEN THE APPROPRIATE TEXT FILES TO PROCESS THE IP LIST
def scan_files():
    directory = '.'
    for entry in os.scandir(directory):
        if entry.is_file() and entry.name.endswith('.txt'):
            if 'ip_list' in entry.name:
                pt = directory + '/' + entry.name
                with open(pt) as file:
                    for ip in file:
                        yield ip.strip()

# ROUTINE TO GET FILE COUNT                        
def mapcount(filename):
    with open(filename, "r+") as f:
        buf = mmap.mmap(f.fileno(), 0)
        lines = 0
        readline = buf.readline
        while readline():
            lines += 1
        return lines

I want to open the file once, get the count and move on to the rest of the script.

There is a problem with the interface: to client code scan_files() is nothing but an iterable of IPs. Suppose it also computes the number of lines in the files, how is it going to output that number?

Also the time taken by the generator to run depends on client code. If client code requests the next IP every 10 minutes, the generator will be very long to consume.

(Aug-13-2023, 05:00 PM)cubangt Wrote: [ -> ]So i am having to process a few different text files, for each file i process, i need too capture runtime duration, record count and timestamp

You should link to other Thread or continue there,because i guess all these adds should work with code already written.

Quote:Since i'm already using the scan_files() function is it possible to incorporate the mapcount() functionality into the main function?

Have to change scan_files() function and also try to add this in with exiting code.
So start could be something like this.

import time, os
import subprocess
from concurrent.futures import ThreadPoolExecutor
import pandas as pd

def ping(ip):
    return (
        ip,
        subprocess.run(
            f"ping {ip} -n 1", stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL
        ).returncode,
    )

def scan_files():
    directory = 'G:/div_code/egg/ping'
    ip_files = []
    for entry in os.scandir(directory):
        if entry.is_file() and entry.name.endswith('.txt'):
            if 'ip_list' in entry.name:
                pt = directory + '/' + entry.name
                ip_files.append(pt)
    return ip_files

if __name__ == '__main__':
    df_lst = []
    for fname in scan_files():
        start = time.time()
        with open(fname) as file:
            park = [ip.strip() for ip in file]
            executor = ThreadPoolExecutor(12)
            df = pd.DataFrame(executor.map(ping, park))
            #df.to_csv(r'ip_output.csv', header=False, index=False, quoting=None)
            print(df)
            end = time.time()
            time_used = end - start
            df_lst.append(df)
            df_lst.append(f'File <{fname}> used {time_used:.2f} sec')
            print(f'File <{fname}> used {time_used:.2f} sec')

Output:                   0  1
0    python-forum.io  0
1        youtube.com  0
2      youtube.com99  1
3          www.vg.no  0
4  python-forum.io99  1
File <G:/div_code/egg/ping/ip_list1.txt> used 0.13 sec
                   0  1
0    python-forum.io  0
1        youtube.com  0
2      youtube.com99  1
3          www.vg.no  0
4  python-forum.io99  1
File <G:/div_code/egg/ping/ip_list2.txt> used 0.07 sec

Output over is also in list df_lst.
So it will take time on each files,and Pandas index will work as file count for each file.
As what's in df_lst is still Pandas so can eg use count lines in files or if dive be 2 will get file count.

>>> df_lst[0]
                   0  1
0    python-forum.io  0
1        youtube.com  0
2      youtube.com99  1
3          www.vg.no  0
4  python-forum.io99  1
>>> df_lst[0].count()
0    5
1    5
dtype: int64
>>> len(df_lst) // 2
2
>>> df_lst[0].count() + df_lst[2].count()
0    10
1    10
dtype: int64

I don't think it applies to this problem, but you can also pass information through the iterable and user function. In this example I pass the filename and the line number of the IP address in the file. The user function passes all this info along with the ping status and a timestamp.

def ping(args):
    ip, file, counter = args
    returncode = subprocess.run(
        f"ping {ip} -n 1", stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL
    ).returncode
    return ip, file, counter, datetime.now(), returncode


def scan_files():
    directory = Path('.')
    for name in directory.glob("*ip_list*.txt"):
        count = 0
        with open(name, "r") as file:
            for ip in file:
                count += 1
                yield ip.strip(), name.name, count


executor = ThreadPoolExecutor(125)
df = pd.DataFrame(executor.map(ping, scan_files()))
print(df)

Output:                   0            1  2                          3  4
0    python-forum.io  ip_list.txt  1 2023-08-13 21:53:42.935575  0
1        youtube.com  ip_list.txt  2 2023-08-13 21:53:42.929570  0
2      youtube.com99  ip_list.txt  3 2023-08-13 21:53:42.910397  1
3          www.vg.no  ip_list.txt  4 2023-08-13 21:53:43.117031  0
4  python-forum.io99  ip_list.txt  5 2023-08-13 21:53:42.911398  1

Here code where i use Loguru one of my favorite Python libraries.
So now logged all into ip.log file.
Now that also log a date/time stamp when file run,have line number,name of file and time used.
So also just bye looking at 19:00:07 and next file start 19:00:16,that time used is ca 9-sec.
If uncomment pd.set_option('display.max_rows', None) it will log all ip and not just make head/tail as shown now.

import time, os
import subprocess
from concurrent.futures import ThreadPoolExecutor
import pandas as pd
#pd.set_option('display.max_rows', None)
from loguru import logger
logger.remove() # Only info to file
logger.add("ip.log", rotation="2 day", format="{time:YYYY-MM-DD at HH:mm:ss}\n{message}")

def ping(ip):
    return (
        ip,
        subprocess.run(
            f"ping {ip} -n 1", stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL
        ).returncode,
    )

def scan_files():
    directory = 'G:/div_code/egg/ping'
    ip_files = []
    for entry in os.scandir(directory):
        if entry.is_file() and entry.name.endswith('.txt'):
            if 'ip_list' in entry.name:
                pt = directory + '/' + entry.name
                ip_files.append(pt)
    return ip_files

if __name__ == '__main__':
    for fname in scan_files():
        start = time.time()
        with open(fname) as file:
            park = [ip.strip() for ip in file] * 100
            executor = ThreadPoolExecutor(12)
            df = pd.DataFrame(executor.map(ping, park), columns=["address", "state"])
            df.index += 1
            end = time.time()
            time_used = end - start
            logger.info(f'{df}\n File <{fname}> used {time_used:.2f} sec\n')

Output:2023-08-14 at 19:00:07
               address  state
1      python-forum.io      0
2          youtube.com      0
3        youtube.com99      1
4            www.vg.no      0
5    python-forum.io99      1
..                 ...    ...
496    python-forum.io      0
497        youtube.com      0
498      youtube.com99      1
499          www.vg.no      0
500  python-forum.io99      1

[500 rows x 2 columns]
 File <G:/div_code/egg/ping/ip_list1.txt> used 9.04 sec

2023-08-14 at 19:00:16
               address  state
1      python-forum.io      0
2          youtube.com      0
3        youtube.com99      1
4            www.vg.no      0
5    python-forum.io99      1
..                 ...    ...
496    python-forum.io      0
497        youtube.com      0
498      youtube.com99      1
499          www.vg.no      0
500  python-forum.io99      1

[500 rows x 2 columns]
 File <G:/div_code/egg/ping/ip_list2.txt> used 8.53 sec

cubangt

Gribouillis

snippsat

deanhystad

snippsat