So i am having to process a few different text files, for each file i process, i need too capture runtime duration, record count and timestamp
Since i'm already using the scan_files() function is it possible to incorporate the mapcount() functionality into the main function?
I mean if im already opening up the first text file, why not get the count returned while its open. so far just adding the readline logic below the strip() in the main logic isnt working..
# ROUTINE TO OPEN THE APPROPRIATE TEXT FILES TO PROCESS THE IP LIST
def scan_files():
directory = '.'
for entry in os.scandir(directory):
if entry.is_file() and entry.name.endswith('.txt'):
if 'ip_list' in entry.name:
pt = directory + '/' + entry.name
with open(pt) as file:
for ip in file:
yield ip.strip()
# ROUTINE TO GET FILE COUNT
def mapcount(filename):
with open(filename, "r+") as f:
buf = mmap.mmap(f.fileno(), 0)
lines = 0
readline = buf.readline
while readline():
lines += 1
return lines
I want to open the file once, get the count and move on to the rest of the script.
There is a problem with the interface: to client code scan_files()
is nothing but an iterable of IPs. Suppose it also computes the number of lines in the files, how is it going to output that number?
Also the time taken by the generator to run depends on client code. If client code requests the next IP every 10 minutes, the generator will be very long to consume.
(Aug-13-2023, 05:00 PM)cubangt Wrote: [ -> ]So i am having to process a few different text files, for each file i process, i need too capture runtime duration, record count and timestamp
You should link to other
Thread or continue there,because i guess all these adds should work with code already written.
Quote:Since i'm already using the scan_files() function is it possible to incorporate the mapcount() functionality into the main function?
Have to change scan_files() function and also try to add this in with exiting code.
So start could be something like this.
import time, os
import subprocess
from concurrent.futures import ThreadPoolExecutor
import pandas as pd
def ping(ip):
return (
ip,
subprocess.run(
f"ping {ip} -n 1", stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL
).returncode,
)
def scan_files():
directory = 'G:/div_code/egg/ping'
ip_files = []
for entry in os.scandir(directory):
if entry.is_file() and entry.name.endswith('.txt'):
if 'ip_list' in entry.name:
pt = directory + '/' + entry.name
ip_files.append(pt)
return ip_files
if __name__ == '__main__':
df_lst = []
for fname in scan_files():
start = time.time()
with open(fname) as file:
park = [ip.strip() for ip in file]
executor = ThreadPoolExecutor(12)
df = pd.DataFrame(executor.map(ping, park))
#df.to_csv(r'ip_output.csv', header=False, index=False, quoting=None)
print(df)
end = time.time()
time_used = end - start
df_lst.append(df)
df_lst.append(f'File <{fname}> used {time_used:.2f} sec')
print(f'File <{fname}> used {time_used:.2f} sec')
Output:
0 1
0 python-forum.io 0
1 youtube.com 0
2 youtube.com99 1
3 www.vg.no 0
4 python-forum.io99 1
File <G:/div_code/egg/ping/ip_list1.txt> used 0.13 sec
0 1
0 python-forum.io 0
1 youtube.com 0
2 youtube.com99 1
3 www.vg.no 0
4 python-forum.io99 1
File <G:/div_code/egg/ping/ip_list2.txt> used 0.07 sec
Output over is also in list
df_lst
.
So it will take time on each files,and Pandas index will work as file count for each file.
As what's in
df_lst
is still Pandas so can eg use count lines in files or if dive be 2 will get file count.
>>> df_lst[0]
0 1
0 python-forum.io 0
1 youtube.com 0
2 youtube.com99 1
3 www.vg.no 0
4 python-forum.io99 1
>>> df_lst[0].count()
0 5
1 5
dtype: int64
>>> len(df_lst) // 2
2
>>> df_lst[0].count() + df_lst[2].count()
0 10
1 10
dtype: int64
I don't think it applies to this problem, but you can also pass information through the iterable and user function. In this example I pass the filename and the line number of the IP address in the file. The user function passes all this info along with the ping status and a timestamp.
def ping(args):
ip, file, counter = args
returncode = subprocess.run(
f"ping {ip} -n 1", stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL
).returncode
return ip, file, counter, datetime.now(), returncode
def scan_files():
directory = Path('.')
for name in directory.glob("*ip_list*.txt"):
count = 0
with open(name, "r") as file:
for ip in file:
count += 1
yield ip.strip(), name.name, count
executor = ThreadPoolExecutor(125)
df = pd.DataFrame(executor.map(ping, scan_files()))
print(df)
Output:
0 1 2 3 4
0 python-forum.io ip_list.txt 1 2023-08-13 21:53:42.935575 0
1 youtube.com ip_list.txt 2 2023-08-13 21:53:42.929570 0
2 youtube.com99 ip_list.txt 3 2023-08-13 21:53:42.910397 1
3 www.vg.no ip_list.txt 4 2023-08-13 21:53:43.117031 0
4 python-forum.io99 ip_list.txt 5 2023-08-13 21:53:42.911398 1
Here code where i use
Loguru one of my favorite Python libraries.
So now logged all into
ip.log
file.
Now that also log a date/time stamp when file run,have line number,name of file and time used.
So also just bye looking at 19:00:07 and next file start 19:00:16,that time used is ca 9-sec.
If uncomment
pd.set_option('display.max_rows', None)
it will log all ip and not just make head/tail as shown now.
import time, os
import subprocess
from concurrent.futures import ThreadPoolExecutor
import pandas as pd
#pd.set_option('display.max_rows', None)
from loguru import logger
logger.remove() # Only info to file
logger.add("ip.log", rotation="2 day", format="{time:YYYY-MM-DD at HH:mm:ss}\n{message}")
def ping(ip):
return (
ip,
subprocess.run(
f"ping {ip} -n 1", stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL
).returncode,
)
def scan_files():
directory = 'G:/div_code/egg/ping'
ip_files = []
for entry in os.scandir(directory):
if entry.is_file() and entry.name.endswith('.txt'):
if 'ip_list' in entry.name:
pt = directory + '/' + entry.name
ip_files.append(pt)
return ip_files
if __name__ == '__main__':
for fname in scan_files():
start = time.time()
with open(fname) as file:
park = [ip.strip() for ip in file] * 100
executor = ThreadPoolExecutor(12)
df = pd.DataFrame(executor.map(ping, park), columns=["address", "state"])
df.index += 1
end = time.time()
time_used = end - start
logger.info(f'{df}\n File <{fname}> used {time_used:.2f} sec\n')
Output:
2023-08-14 at 19:00:07
address state
1 python-forum.io 0
2 youtube.com 0
3 youtube.com99 1
4 www.vg.no 0
5 python-forum.io99 1
.. ... ...
496 python-forum.io 0
497 youtube.com 0
498 youtube.com99 1
499 www.vg.no 0
500 python-forum.io99 1
[500 rows x 2 columns]
File <G:/div_code/egg/ping/ip_list1.txt> used 9.04 sec
2023-08-14 at 19:00:16
address state
1 python-forum.io 0
2 youtube.com 0
3 youtube.com99 1
4 www.vg.no 0
5 python-forum.io99 1
.. ... ...
496 python-forum.io 0
497 youtube.com 0
498 youtube.com99 1
499 www.vg.no 0
500 python-forum.io99 1
[500 rows x 2 columns]
File <G:/div_code/egg/ping/ip_list2.txt> used 8.53 sec