Python Forum
Python script - search Apache access_log.txt for all of the JavaScript (.js)
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Python script - search Apache access_log.txt for all of the JavaScript (.js)
#1
What is expected from the python script:
1. read all the lines from given file access_log.txt and look for any presence of JavaScript files .js
2. from the lines found remove everything and keep just the "name.js"
3. remove any of duplicated rows
4. sort the lines

So far I was able to get to this stage but I do not know how to remove the duplicated rows, unwanted rows with *.css and how to sort them.
I was trying different Python functions and as well regex expression but at the moment too difficult for me.
I would appreciate any help to show me a way how my script can be updated or to show me a new solution.
Thank you in advance.
jnovak

kali@kali:~$ more test10.py 
#!/usr/bin/python

import re

f = open('/home/kali/Desktop/access_log.txt', "r")
for line in f:
    if re.match("(.*).js", line):
        print(line.split()[6].split('/')[2])
Current result of my script:

kali@kali:~$ python test10.py
jquery.jshowoff.min.js
jquery.js
jquery.jshowoff.min.js
jquery.js
jshowoff.css
jquery.js
jquery.js
jquery.jshowoff2.js
jquery.jshowoff.min.js
jshowoff.css
jquery.js
jshowoff.css
jquery.jshowoff.min.js
jquery.js
jquery.js
jquery.js
jquery.js
Reply
#2
Maybe the link to this thread will help
https://python-forum.io/Thread-Trying-to...of-strings
I welcome all feedback.
The only dumb question, is one that doesn't get asked.
My Github
How to post code using bbtags


Reply
#3
I guess this is homework?
(May-03-2020, 05:19 AM)jnovak Wrote: So far I was able to get to this stage but I do not know how to remove the duplicated rows, unwanted rows with *.css and how to sort them.
.css should not be there so something most happen with raw data that we don't see.
Can run your code again on the output shown,then .css is gone.
import re

log = '''\
jquery.jshowoff.min.js
jquery.js
jquery.jshowoff.min.js
jquery.js
jshowoff.css
jquery.js
jquery.js
jquery.jshowoff2.js
jquery.jshowoff.min.js
jshowoff.css
jquery.js
jshowoff.css
jquery.jshowoff.min.js
jquery.js
jquery.js
jquery.js
jquery.js'''

for line in log.splitlines():
    if re.match("(.*).js", line):
        print(line)
Output:
jquery.jshowoff.min.js jquery.js jquery.jshowoff.min.js jquery.js jquery.js jquery.js jquery.jshowoff2.js jquery.jshowoff.min.js jquery.js jquery.jshowoff.min.js jquery.js jquery.js jquery.js jquery.js
As for remove duplicate and sorting,first collect the loop over line in eg a list.
Then look into eg set() and sorted().
Reply
#4
Hi snippsat,
Thx. a lot for looking into my first python script try. I have replicated your change and I confirm it works fine as expected. Unfortunately, even running your version again through the access_log.txt is giving the same result adding as well the *.css strings.
Yes, this is kind of homework which I gave to myself Cool, as I would like to learn at least one programming language except bash scripting. So far not too good on it as you can see.
I have found on my own both functions which you recommend to me (sort) and set(set).
To be honest, I have spent 1 day and half trying to figure out how to connect any of those functions to my script but with no luck. This is the reason why I have at the first time in my life asked for help on any forum during my IT carrier.
In case you would be interested still to advise I can provide you with the access_log.txt to try on your own, but I did not find on this forum any option to attach a file?
Have a nice day,
JN
Reply
#5
You most have more than 5-post to attach a file.
Can use Ge.tt
Reply
#6
Hi snippsat, thx. for a useful hint.
access_log file location: http://ge.tt/1E3vZy23
JN
Reply
#7
The regex r"(.*).js" has a mistake.
It will match also foojs, because the dot represent all chars.
You've to escape the dot with backslash: r"(.*)\n.js"

You should put your code in a function, then you use yield instead of return and then you have a generator.
import re
 
def log_reader(file):
    with open(file) as fd:
        for line in fd:
            if re.match("(.*)\.js", line):
                yield line.split()[6].split('/')[2]


my_reader = log_reader('/home/kali/Desktop/access_log.txt')
# nothing happens
# generator evaluates lazy
# consume the generator

paths = set(my_reader) # unique elements
# paths has now elements and my_reader is exhausted / empty

print(paths)
# sort unique paths
print(sorted(paths))
You can solve it also without regex:
def read_log(file, allowed_method=None):
    # use a contextmanger
    with open(file) as fd:
        # fd is a iterator and it iterates lines
        # line end is not stripped
        for line in fd:
            # splitting the log line by " brings a good result
            _, request, *_ = line.split('"')
            # the request is in the second field
            # _ are placeholder for throw away object
            # *_ consumes the rest of the elements
            # request is what you need
            meth, path, proto = request.split()
            # A request consists of: Method, Path, Protocol-Version
            #
            # Evaluate allowed_method first
            # if it's None, the second part after the end is not evaluated
            # this allows to set allowed_method to None to
            # skip this check
            if allowed_method and meth.upper() != allowed_method:
                continue
                # otherwise continue, if the method is a different
            if path.endswith(".js"):
                yield path.rsplit("/", 1)[-1]
Accessing the generator:
log_file = "access.log"
js_files = sorted(set(read_log(log_file)))

# first set consumes the generator read_log
# then sorted consumes set
# sorted returns a sorted list
And if you need to do something with your data for each file:
for js_file in js_files:
    print(js_file)
    # code
    ...


If you don't want a generator, you need two lines more:
def log_reader(file):
    results = set()
    with open(file) as fd:
        for line in fd:
            if re.match("(.*)\.js", line):
                results.add( line.split()[6].split('/')[2] )
    return sorted(results)
In this case I return a unique sorted list instead of a generator.
To add an element to a set, you have to use the add method.
A list has append to add en element.
Almost dead, but too lazy to die: https://sourceserver.info
All humans together. We don't need politicians!
Reply
#8
As a alterative an other regex,also using finditer() and compile() can speed stuff up,
not that make so much difference with this kind of file size.
import re

with open('access_log.txt') as f:
    log = f.read()

pattern = re.compile(r"/\w*.\.js.*\.js|\/\w*.\.js")
for match in pattern.finditer(log):
    print(match.group().lstrip('/'))
Output:
jquery.jshowoff.min.js jquery.js jquery.jshowoff.min.js jquery.js jquery.js jquery.js jquery.jshowoff2.js jquery.jshowoff.min.js jquery.js jquery.jshowoff.min.js jquery.js jquery.js jquery.js jquery.js
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Is there a *.bat DOS batch script to *.py Python Script converter? pstein 3 3,009 Jun-29-2023, 11:57 AM
Last Post: gologica
  install apache-airflow[postgres,google] on Python 3.8.12 virtual env ShahajaK 1 6,376 Oct-07-2021, 03:05 PM
Last Post: Larz60+
  Apache 2.0 Licensed Python code Furkan 0 1,597 Jul-26-2021, 11:12 PM
Last Post: Furkan
Photo Integration of apache spark and Kafka on eclipse pyspark aupres 1 3,701 Feb-27-2021, 08:38 AM
Last Post: Serafim
  How to kill a bash script running as root from a python script? jc_lafleur 4 5,793 Jun-26-2020, 10:50 PM
Last Post: jc_lafleur
  crontab on RHEL7 not calling python script wrapped in shell script benthomson 1 2,254 May-28-2020, 05:27 PM
Last Post: micseydel
  Package python script which has different libraries as a single executable or script tej7gandhi 1 2,583 May-11-2019, 08:12 PM
Last Post: keames
  Twitter listen script, dynamic search value? quitte74 0 1,860 Nov-01-2018, 01:09 PM
Last Post: quitte74
  How to run python script which has dependent python script in another folder? PrateekG 1 3,106 May-23-2018, 04:50 PM
Last Post: snippsat
  How to call one python script and use its output in another python script lravikumarvsp 3 32,288 May-16-2018, 02:08 AM
Last Post: lravikumarvsp

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020