Python Forum
First completed 'project' (basic text parser).
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
First completed 'project' (basic text parser).
#1
Hello there, been learning Python for around 5 days so I am very new and want to form good coding practices/standards early on. I'm sure my code is very messy, badly optomized and not very elegant, just some pointers in the right direction would be amazing. Some comments within my code are likely not needed, they were mostly for my own learning/remembering. I purposely haven't investigated any Python libararies that can do what my code does as for now I want to get a solid grasp of the fundamentals and basics before exploring libraries etc. I'm highly dedicated and spending a minimum of 3 hours each day to learning, I'm likely to increase this to 5-6 hours per day as I learn where to go next/start another course.

Code does the following:
- Takes a long list of emails within text.
- You need to only parse and gather emails that are in lines that start with 'From '.
- Tells you the top 5 most common hours, days, months and emails that occurs within the text.
- Prints them out in a very ugly way (need to make it much better).

Text can be found here:
http://www.py4inf.com/code/mbox-short.txt

My code which needs a lot of work but it's my first ever attempt at something.
# Parse a list of emails, find most common hours, days, months and emails.
filename = input("Enter filename: ")
try :
    filehandle = open(filename)
except :
    print("Unable to open file:", filename)
    quit()

# Take a list and a dictionary, store a list of tuples, swap key / value, reverse order.
def dict_to_list(list, diction) :
    for key, value in diction.items() :
        temptup = (value, key)
        list.append(temptup)
    list = sorted(list, reverse=True)
    return list

hours = dict()
days = dict()
months = dict()
emails = dict()
# Go through each line, extract relevant data, store in dicts.
for line in filehandle :
    if line.startswith("From ") :
        words = line.split()
        email = words[1]
        emails[email] = emails.get(email, 0) + 1
        day = words[2]
        days[day] = days.get(day, 0) + 1
        month = words[3]
        months[month] = months.get(month, 0) + 1
        for word in words :
            if word.find(":") > 1 :
                time = word.split(":")
                hour = time[0]
                hours[hour] = hours.get(hour, 0) + 1

# Create sorted lists of tuples using custom dict_to_list function.
listhours = list()
hourslist = dict_to_list(listhours, hours)
listdays = list()
dayslist = dict_to_list(listdays, days)
listmonths = list()
monthslist = dict_to_list(listmonths, months)
listemails = list()
emailslist = dict_to_list(listemails, emails)

# Print the top 5 most common from each list.
for k, v in hourslist[:5] :
    print("Most common hour:", v, "occurs", k, "times.")
for k, v in dayslist[:5] :
    print("Most common day:", v, "occurs", k, "times.")
for k, v in monthslist[:5] :
    print("Most common month:", v, "occurs", k, "times.")
for k, v in emailslist[:5] :
    print("Most common email:", v, "occurs", k, "times.")
Reply
#2
You didn't close the file. I would recommend to use context-manager form for this (with-statement).
You put a space before colon (is this a new pep8 rule?!);
Wildcard exceptions usually discouraged; probably OSError, IOError will be more suitable here.

dict_to_list can be replaced with list-comprehension expression (inline); Moreover,
you could define dict_to_list without first argument, but define it within the function's scope:

def dict_to_list(diction):
    result_list = list()
    # ... append to result_list
    #...
    return result_list
Reply
#3
(Sep-25-2020, 10:49 PM)scidam Wrote: You didn't close the file. I would recommend to use context-manager form for this (with-statement).
You put a space before colon (is this a new pep8 rule?!);
Wildcard exceptions usually discouraged; probably OSError, IOError will be more suitable here.

dict_to_list can be replaced with list-comprehension expression (inline); Moreover,
you could define dict_to_list without first argument, but define it within the function's scope:

def dict_to_list(diction):
    result_list = list()
    # ... append to result_list
    #...
    return result_list

Thank you very much for the response, I have never used a context-manager before, I have tried my best to implement it and hope the following code matches up. I need to read up on pep8 again to make sure I am following the proper standard, I have fixed the issues with the colon. In regards to using a wild exception I have changed it to a IOError and in future I'll make sure I am handling situations with appropriate exceptions.

I did read up on list-comprehention expressions and decided at the moment it isn't entirely clear to me and I decided to go with creating the function. That one change you suggested to me regarding not needing to pass a list as an arg made my code look much better already, below are the changes, I hope I have done it all correctly. Again a massive thank you for taking the time to review my code.

# Parse a list of emails, find most common hours, days, months and emails.

# Take a list and a dictionary, store a list of tuples, swap key / value, reverse order.
def dict_to_list(diction):
    result_list = list()
    for key, value in diction.items():
        temptup = (value, key)
        result_list.append(temptup)
    result_list = sorted(result_list, reverse=True)
    return result_list

filename = input("Enter filename: ")

hours = dict()
days = dict()
months = dict()
emails = dict()
try:
    with open(filename) as filehandle:
        # Go through each line, extract relevant data, store in dicts.
        for line in filehandle:
            if line.startswith("From "):
                words = line.split()
                email = words[1]
                emails[email] = emails.get(email, 0) + 1
                day = words[2]
                days[day] = days.get(day, 0) + 1
                month = words[3]
                months[month] = months.get(month, 0) + 1
                for word in words:
                    if word.find(":") > 1:
                        time = word.split(":")
                        hour = time[0]
                        hours[hour] = hours.get(hour, 0) + 1
except IOError as err:
    print(err)
    quit()

# Create sorted lists of tuples using custom dict_to_list function.
hourslist = dict_to_list(hours)
dayslist = dict_to_list(days)
monthslist = dict_to_list(months)
emailslist = dict_to_list(emails)

# Print the top 5 most common from each list.
for k, v in hourslist[:5]:
    print("Most common hour:", v, "occurs", k, "times.")
for k, v in dayslist[:5]:
    print("Most common day:", v, "occurs", k, "times.")
for k, v in monthslist[:5]:
    print("Most common month:", v, "occurs", k, "times.")
for k, v in emailslist[:5]:
    print("Most common email:", v, "occurs", k, "times.")
Reply
#4
I would rewrite dict_to_list function as follows:
def dict_to_list(diction):
    return sorted(diction.items(), reverse=True, key=lambda x: (x[0], x[1]))
Another equivalent version (without lambdas) of this function could be the following: for results printing,

from operator import itemgetter
def dict_to_list(diction):
    return sorted(diction.items(), reverse=True, key=itemgetter(0, 1))
There are still too much of semi-repeated code. Probably, the way of refactoring I would suggest is to define
a class responsible to parsing, e.g. ParseMboxString and maybe define a method parse_header and auxiliary methods parse_day, parse_hour etc.
Reply


Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020