Python Forum
Using dictionary to find the most sent emails from a file
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Using dictionary to find the most sent emails from a file
#1
Hi Everyone,
I have been stuck with this question for a whole week. The staff of our course is not helpful at all. Hope I could get any clue from you guys. Thanks.
This is my assignment:
Write a program to read through the mbox-short.txt and figure out who has sent the greatest number of mail messages. The program looks for 'From ' lines and takes the second word of those lines as the person who sent the mail. The program creates a Python dictionary that maps the sender's mail address to a count of the number of times they appear in the file. After the dictionary is produced, the program reads through the dictionary using a maximum loop to find the most prolific committer.

File mbox-short:https://www.py4e.com/code3/mbox-short.txt

This is my code to it:
name=input("Enter file: ")
fh=open(name)
largest=None
counts=dict()
for line in fh:
    if line.startswith("From "):
        x=line.split()
        emails=x[1]
        print(emails)
        for word in emails:
            counts[word]=counts.get(word,0)+1
            if largest is None or counts[word] > largest:
                largest=counts[word]
            print(counts,largest)
It comes out to count every alphabet but not a single email. How can I count the emails?
I tried to loop over the x as"For word in x:", then it comes out to count everything such as from, emails, time and dates. In this case how can I pick up only the emails and its counts? Thank you!
Reply
#2
Hi,

Without seeing the format of the mbox file, this seems very straightforward.
All you need is lines that start with "From" and you use line[0] and line[1] after the split.
The keys of the dictionary could be a set() of line[1] occurences.
The number of Froms = number of emails sent. Why count anything else?
Unless, and this is not clear from your question, the From sender may also appear elsewhere in the file (recipient?), and
you need to count those too ?

Paul
Reply
#3
The From lines looks like this:
Output:
From alice@edu Thu Jun 16 16:12:12 2005 From bob@gov Thu Jun 16 18:13:12 2005 From ted@com Thu Jul 28 09:53:31 2005 From bob@gov Thu Jul 28 09:59:31 2005 From ted@com Thu Jul 28 15:53:31 2005
I got it from an example: doc/networkx-2.1/examples/drawing/unix_email.mbox
Output:
From alice@edu Thu Jun 16 16:12:12 2005 From: Alice <alice@edu> Subject: NetworkX Date: Thu, 16 Jun 2005 16:12:13 -0700 To: Bob <bob@gov> Status: RO Content-Length: 86 Lines: 5 Bob, check out the new networkx release - you and Carol might really like it. Alice From bob@gov Thu Jun 16 18:13:12 2005 Return-Path: <bob@gov> Subject: Re: NetworkX From: Bob <bob@gov> To: Alice <alice@edu> Content-Type: text/plain Date: Thu, 16 Jun 2005 18:13:12 -0700 Status: RO Content-Length: 26 Lines: 4 Thanks for the tip. Bob From ted@com Thu Jul 28 09:53:31 2005 Return-Path: <ted@com> Subject: Graph package in Python? From: Ted <ted@com> To: Bob <bob@gov> Content-Type: text/plain Date: Thu, 28 Jul 2005 09:47:03 -0700 Status: RO Content-Length: 90 Lines: 3 Hey Ted - I'm looking for a Python package for graphs and networks. Do you know of any? From bob@gov Thu Jul 28 09:59:31 2005 Return-Path: <bob@gov> Subject: Re: Graph package in Python? From: Bob <bob@gov> To: Ted <ted@com> Content-Type: text/plain Date: Thu, 28 Jul 2005 09:59:03 -0700 Status: RO Content-Length: 180 Lines: 9 Check out the NetworkX package - Alice sent me the tip! Bob >> bob@gov scrawled: >> Hey Ted - I'm looking for a Python package for >> graphs and networks. Do you know of any? From ted@com Thu Jul 28 15:53:31 2005 Return-Path: <ted@com> Subject: get together for lunch to discuss Networks? From: Ted <ted@com> To: Bob <bob@gov>, Carol <carol@gov>, Alice <alice@edu> Content-Type: text/plain Date: Thu, 28 Jul 2005 15:47:03 -0700 Status: RO Content-Length: 139 Lines: 5 Hey everyrone! Want to meet at that restaurant on the island in Konigsburg tonight? Bring your laptops and we can install NetworkX. Ted



from collections import defaultdict
from collections import Counter

# test = defaultdict(int)
# test["Not Existing Key"] -> 0
# text["Not Existing Key"] += 1
# then "Not Existing Key" -> 1


# The counter counts unique objects in a list or
# from other iterables. If it's a collection like a dict or defaultdict,
# the results are also copied


name = input("Enter file: ")
fh = open(name)
# a context manager is better

counts = defaultdict(int)
for line in fh:
    if line.startswith("From "):
        email = line.split(maxsplit=3)[1]
        # maxsplit limits the split of 3 elements.
        # from, email, rest ....
        # we need only the email, which is the second element
        # the third element is the rest of the line
        counts[email] += 1
        # defaultdict and counter supports this
        # if you use a defaultdict, then the initial datatype
        # must be int


# you've forgotten to close the file
# this could not happen with a context manager
fh.close()


print("Results:")
for email, count in counts.items():
    print(email, "->", count)


# we have already the Results in `counts`
# using Counter to reuse the data

counts2 = Counter(counts)
# Counter has the method most_common
print()
print("Top 5:")
for email, count in counts2.most_common(5):
    print(email, "->", count)
Counter could be use in the first place instead of defaultdict.
You can do this also manually, which is good to learn how to memorize elements.

If you make your own Counter, then use a set() as place to store seen E-Mails.

emails = ["a", "a", "c", "b"]
seen = set()
result = {}
for email in emails:
    if email in seen:
        result[email] += 1
    else:
        result[email] = 1
        seen.add(email)
        # a set uses add to add elements
        # a list uses append
        # a set has only unique elements
        # and is very fast in checking containment of an element in the set
I hope this helps a little to understand.
Almost dead, but too lazy to die: https://sourceserver.info
All humans together. We don't need politicians!
Reply
#4
This is homework so special attention must be paid to terms and conditions.

- The program looks for 'From ' lines and takes the second word of those lines as the person who sent the mail.
- The program creates a Python dictionary that maps the sender's mail address to a count of the number of times they appear in the file.
- After the dictionary is produced, the program reads through the dictionary using a maximum loop to find the most prolific committer.

I have trouble understanding what is 'maximum loop', therefore I will use built-in max.

counter = dict()

with open('mbox-short.txt', 'r') as f:
    for line in f:
        if line.startswith('From '):
            address = line.split(maxsplit=2)[1]
            try:
                counter[address] += 1
            except KeyError:
                counter[address] = 1


print(max(counter, key=lambda rec: rec[1]))
# [email protected]
I'm not 'in'-sane. Indeed, I am so far 'out' of sane that you appear a tiny blip on the distant coast of sanity. Bucky Katt, Get Fuzzy

Da Bishop: There's a dead bishop on the landing. I don't know who keeps bringing them in here. ....but society is to blame.
Reply
#5
Iterating over a dict yields only the keys and not the items (key, value).

This should raise an IndexError,
if the key has a len of 1 or 0 and if the key is longer than 1,
it will sort by the second character of the key.

print(max(counter, key=lambda rec: rec[1]))
Use instead the items method, which return for each item a tuple with (key, value).
print(max(counter.items(), key=lambda rec: rec[1]))
lambda is an anonymous function.
It's used in this case as a key-function to sort the values and not the keys.
The values are your counts and the keys are the email addresses.

key_function = lambda rec: rec[1]

# is similar to:

def key_function(rec):
    return rec[1]
This key_function could be used as a key for: max, min, sorted, itertools.groupby

These functions are taking the return value from the key_function for comparison.


A tiny example:

mapping = {"a": 3, "b": 2, "c": 1}
print(mapping)

# max via key
print("Biggest key (lexicographical order) ->", max(mapping.items(), key=lambda item: item[0]))


# max via value
print("Biggest integer", max(mapping.items(), key=lambda item: item[1]))
Output:
Biggest key (lexicographical order) -> ('c', 1) Biggest integer ('a', 3)
Almost dead, but too lazy to die: https://sourceserver.info
All humans together. We don't need politicians!
Reply
#6
(Apr-22-2021, 03:58 PM)DeaD_EyE Wrote: Iterating over a dict yields only the keys and not the items (key, value).

You are absolutely correct.

Strangely enough on this particular dataset this code produced correct result (gwen count is 5 and this is max in this dictionary). It is gentle reminder that code should be always tested...
I'm not 'in'-sane. Indeed, I am so far 'out' of sane that you appear a tiny blip on the distant coast of sanity. Bucky Katt, Get Fuzzy

Da Bishop: There's a dead bishop on the landing. I don't know who keeps bringing them in here. ....but society is to blame.
Reply
#7
Thank you for your reply so much!
I tried the code. It shows traceback on "counter[address] += 1",saying it must be integers or slices, not str.





(Apr-22-2021, 03:30 PM)perfringo Wrote: This is homework so special attention must be paid to terms and conditions.

- The program looks for 'From ' lines and takes the second word of those lines as the person who sent the mail.
- The program creates a Python dictionary that maps the sender's mail address to a count of the number of times they appear in the file.
- After the dictionary is produced, the program reads through the dictionary using a maximum loop to find the most prolific committer.

I have trouble understanding what is 'maximum loop', therefore I will use built-in max.

counter = dict()

with open('mbox-short.txt', 'r') as f:
    for line in f:
        if line.startswith('From '):
            address = line.split(maxsplit=2)[1]
            try:
                counter[address] += 1
            except KeyError:
                counter[address] = 1


print(max(counter, key=lambda rec: rec[1]))
# [email protected]
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  to find in dictionary given parameter 'name' and to output position Liki 10 1,362 Oct-08-2023, 06:38 AM
Last Post: Pedroski55
  dictionary output to text file (beginner) Delg_Dankil 2 1,187 Jul-12-2023, 11:45 AM
Last Post: deanhystad
  Updating dictionary in another py file tommy_voet 1 4,894 Mar-28-2021, 07:25 PM
Last Post: buran
  Making a dictionary from a file instyabam 0 1,509 Oct-27-2020, 11:59 AM
Last Post: instyabam
  how can i create a dictionary of dictionaries from a file Astone 2 2,263 Oct-26-2020, 02:40 PM
Last Post: DeaD_EyE
  Convert all actions through functions, fill the dictionary from a file Astone 3 2,444 Oct-26-2020, 09:11 AM
Last Post: DeaD_EyE
  Sending Emails in Portuguese RenanPereira10 1 2,977 Jul-24-2020, 12:42 AM
Last Post: nilamo
  how to find 'cycle' for key-value pairs in a dictionary? junnyfromthehood 1 3,586 Sep-29-2019, 01:07 AM
Last Post: ichabod801
  how to put text file in a dictionary infected400 2 3,006 Jan-06-2019, 04:43 PM
Last Post: micseydel
  Dictionary to .txt or .csv file stanthaman42 9 4,685 Aug-08-2018, 03:37 PM
Last Post: Vysero

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020