Python Forum
Extracing unique email address from a folder of emails
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Extracing unique email address from a folder of emails
#1
I have tested the code at https://gist.github.com/dideler/5219706#...om_text-py , and it works okay. Also to remove duplicates I have tested that as
Quote:python file-extract_emails_from_text.py 80 81 82 83  | sort | uniq

, and it works okay.. How do I modify the script so that I can simply pass "*.*" instead of having to pass all the filenames ?  There are hundreds of files
Reply
#2
What is the operating system?  Given that command line, it looks similar to linux.  I don't see any reason that it should fail if you give it "*".  What happens when you try?
jehoshua likes this post
Reply
#3
(Oct-12-2020, 11:14 PM)bowlofred Wrote: What is the operating system?  Given that command line, it looks similar to linux.  I don't see any reason that it should fail if you give it "*".  What happens when you try?

It is Kubuntu, linux flavour. I tried this and now it is certainly including all files, thank you

Quote:python file-extract_emails_from_text.py * | sort | uniq

but now I can see that the actual code isn't do the expected work properly. For example, if there was an email address like [email protected] it has put out

[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
Reply
#4
Maybe the code is working properly, as there would no doubt be broken lines in email headers ?
Reply
#5
A modified version which can filter unique e-mail addresses and keeping the order of first occurrence.
In addition, you can sort.

#!/usr/bin/env python3
"""
Extracts email addresses from one or more plain text files.
"""
import re
from argparse import ArgumentParser

regex = re.compile(
    "([a-z0-9!#$%&'*+\/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+\/=?^_`"
    "{|}~-]+)*(@|\sat\s)(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?(\.|"
    "\sdot\s))+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?)"
)


def get_emails(text):
    """Returns an iterator of matched emails found in string text."""
    # Removing lines that start with '//' because the regular expression
    # mistakenly matches patterns like 'http://[email protected]' as '//[email protected]'.
    return (email[0] for email in regex.findall(text) if not email[0].startswith("//"))


def sort_by_tld(email):
    user, tld = email.split("@", maxsplit=1)
    return (tld, user)


def main(files, unique, sort):
    emails = []
    for file in files:
        with open(file) as fd:
            emails.extend(get_emails(fd.read().lower()))
    if unique:
        emails = list(dict.fromkeys(emails))
    if sort:
        emails = sorted(emails, key=sort_by_tld)
    for email in emails:
        print(email)


def get_args():
    parser = ArgumentParser(description=__doc__)
    parser.add_argument("files", nargs="+", help="files to parse for e-mail addresses")
    parser.add_argument("-u", action="store_true", help="only unique emails")
    parser.add_argument("-s", action="store_true", help="sort result")
    return parser.parse_args()


if __name__ == "__main__":
    args = get_args()
    main(args.files, args.u, args.s)
Then you call the program:
python3 get_mails.py *.txt -u -s > emails.txt
The shell replaces the *.txt with matching filenames.

To understand this behavior, you can make a small program for testing:
import sys

print(sys.argv)
Then execute the program. I have given it the name args_print.py:
python args_print.py a b c d *
The * is replaced with files in current working directory. Hidden files which start with a dot are excluded.

PS: The Windows PowerShell and Terminal does not have this behavior. In this case * is not replaced with matching files and you get instead the * as argument.
jehoshua likes this post
Almost dead, but too lazy to die: https://sourceserver.info
All humans together. We don't need politicians!
Reply
#6
(Oct-13-2020, 11:02 AM)DeaD_EyE Wrote: A modified version which can filter unique e-mail addresses and keeping the order of first occurrence.
In addition, you can sort.

#!/usr/bin/env python3
"""
Extracts email addresses from one or more plain text files.
"""
#  {SNIP}

Thanks for that, it works for me.  :)
(Oct-13-2020, 11:02 AM)DeaD_EyE Wrote: Then you call the program:
python3 get_mails.py *.txt -u -s > emails.txt
The shell replaces the *.txt with matching filenames.

That was exactly what I was looking for, thanks. Having the code do the sorting and the uniqueness. Because the next stage is to validate the email addresses. There are lines in the output like
Quote:vi1pr07mb607848358ad2c06eed460240a3740@vi1pr07mb8.eurprd07.prod.outlook.com
[email protected]
[email protected]

and of course those broken lines are all through it. Have found some Python code to validate emails at https://github.com/karolyi/py3-validate-email , however it seems an overkill for this situation.

Would be great to filter out a lot of the garbage data first, to reduce the number of queries over the internet, to check if the address is valid.
(Oct-13-2020, 11:02 AM)DeaD_EyE Wrote: To understand this behavior, you can make a small program for testing:
import sys

print(sys.argv)
Then execute the program. I have given it the name args_print.py:
python args_print.py a b c d *
The * is replaced with files in current working directory. Hidden files which start with a dot are excluded.

Thanks, I have tested that, and it is nice to be able to see what is being parsed to the Python program. The email files are all from Claws and have no file extension. Wondering how to parse only those, but exclude '*.txt' and *.py' ?
Reply
#7
This library at https://pypi.org/project/email-validator/ may be useful. It has picked up invalid domains, so a lot of those 'broken lines' may fail the test.
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Compare folder A and subfolder B and display files that are in folder A but not in su Melcu54 3 567 Jan-05-2024, 05:16 PM
Last Post: Pedroski55
  Sending Emails on Autopilot Gyga_Hawk 3 1,700 Mar-15-2022, 08:20 AM
Last Post: Larz60+
  Trying out the parsing/reading of emails from my outlook cubangt 0 6,186 Jan-12-2022, 08:59 PM
Last Post: cubangt
  Compare filename with folder name and copy matching files into a particular folder shantanu97 2 4,510 Dec-18-2021, 09:32 PM
Last Post: Larz60+
  Move file from one folder to another folder with timestamp added end of file shantanu97 0 2,485 Mar-22-2021, 10:59 AM
Last Post: shantanu97
Heart How to address a folder??? metebeder 1 1,947 Mar-14-2021, 09:26 PM
Last Post: Larz60+
  Python Cut/Copy paste file from folder to another folder rdDrp 4 5,081 Aug-19-2020, 12:40 PM
Last Post: rdDrp
  reading shared outlook emails zarize 0 2,454 Mar-03-2020, 01:47 PM
Last Post: zarize
  Python Library for Reading POP Emails? bmccollum 1 3,645 Jan-06-2020, 06:37 PM
Last Post: micseydel
  Read in trades from emails semantina 2 2,117 Nov-06-2019, 06:12 PM
Last Post: semantina

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020