Parsing Attached .MSG Files with Python3

ericl42 · Apr-10-2019, 02:08 PM

I'm trying to monitor a phishing inbox that could receive both normal emails (i.e. HTML/text based with potential attachments) as well as emails that have a .MSG file attached to it.

The goal is to have users send emails to [email protected] and once I parse out the various links (potentially malicious) as well as attachments (also potentially malicious, I'll perform some analysis on them.

The issue I'm running into is the body of the .msg file that is attached.

With the code below, I'm able to pull the to, from, subject, and all links within the original email. It also pulls down any attachments with the .msg file (i.e. on my test I was able to pull down a PDF within the .msg). However, I cannot get any of the to, from, subject, or body of the .msg file.

When I print it out as raw I get some of it in a very ugly format, but apparently with the multi-parts, I'm doing something wrong to get that piece of information.

I'm fairly new to Python so any help would be greatly appreciated.

import imaplib
import base64
import os
import email
from bs4 import BeautifulSoup

server = 'mail.server.com'
email_user = '[email protected]'
email_pass = 'XXXXXXXXXXXX'
output_dir = '/tmp/attachments/'
body = ""

def get_body(msg):
    if msg.is_multipart():
        return get_body(msg.get_payload(0))
    else:
        return msg.get_payload(None, True)

def get_attachments(msg):
    for part in msg.walk():
        if part.get_content_maintype()=='multipart':
            continue
        if part.get('Content-Disposition') is None:
            continue
        fileName = part.get_filename()

        if bool(fileName):
            filePath = os.path.join(output_dir, fileName)
            with open(filePath,'wb') as f:
                f.write(part.get_payload(decode=True))

mail = imaplib.IMAP4_SSL(server)
mail.login(email_user, email_pass)
mail.select('INBOX')

result, data = mail.search(None, 'UNSEEN')
mail_ids = data[0]
id_list = mail_ids.split()
print(id_list)

for emailid in id_list:
    result, email_data = mail.fetch(emailid, '(RFC822)')
    raw_email = email_data[0][1]
    raw_email_string = raw_email.decode('utf-8')
    email_message = email.message_from_string(raw_email_string)
    email_from = str(email.header.make_header(email.header.decode_header(email_message['From'])))
    email_to = str(email.header.make_header(email.header.decode_header(email_message['To'])))
    subject = str(email.header.make_header(email.header.decode_header(email_message['Subject'])))
    print('From: ' + email_from)
    print('To: ' + email_to)
    print('Subject: ' + subject)
	
    get_attachments(raw_email)

    for part in email_message.walk():
        body = part.get_payload(0)
        content = body.get_payload(decode=True)
        soup = BeautifulSoup(content, 'html.parser')
        for link in soup.find_all('a'):
            print('Link: ' + link.get('href'))
        break

ericl42 · Apr-12-2019, 06:28 PM

I got this working with the following code. I basically had to do multiple for loops within the .msg walk and then only pull out the relevant information within the text/html sections.

for emailid in id_list:
    result, data = mail.fetch(emailid, '(RFC822)')
    raw = email.message_from_bytes(data[0][1])
    get_attachments(raw)
#    print(raw)

    header_from = mail.fetch(emailid, "(BODY[HEADER.FIELDS (FROM)])")
    header_from_str = str(header_from)
    mail_from = re.search('From:\s.+<(\S+)>', header_from_str)

    header_subject = mail.fetch(emailid, "(BODY[HEADER.FIELDS (SUBJECT)])")
    header_subject_str = str(header_subject)
    mail_subject = re.search('Subject:\s(.+)\'\)', header_subject_str)
    #mail_body = mail.fetch(emailid, "(BODY[TEXT])")
    print(mail_from.group(1))
    print(mail_subject.group(1))


    for part in raw.walk():
        if part.get_content_type() == 'message/rfc822':
            part_string = str(part)
            original_from = re.search('From:\s.+<(\S+)>\n', part_string)
            original_to = re.search('To:\s.+<(\S+)>\n', part_string)
            original_subject = re.search('Subject:\s(.+)\n', part_string)
            print(original_from.group(1))
            print(original_to.group(1))
            print(original_subject.group(1))
        if part.get_content_type() == 'text/html':
            content = part.get_payload(decode=True)
            #print(content)
            soup = BeautifulSoup(content, 'html.parser')
            for link in soup.find_all('a'):
                print('Link: ' + link.get('href'))

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	opening files and output of parsing	leodavinci1990	4	2,529	Oct-12-2020, 06:52 AM Last Post: bowlofred
	Parsing Xml files >3gb using lxml iterparse	Prit_Modi	2	2,340	May-16-2020, 06:53 AM Last Post: Prit_Modi
	Spawning a new process that is not attached to python	cman234	3	1,891	Apr-25-2020, 05:24 PM Last Post: cman234
	Gnuradio python3 is not compatible python3 xmlrpc library How Can I Fix İt ?	muratoznnnn	3	4,897	Nov-07-2019, 05:47 PM Last Post: DeaD_EyE
	My objective is to get the shape parameters of the particles in attached image	chad221	0	1,779	Oct-26-2019, 10:27 AM Last Post: chad221
	parsing local xml files to csv	erdem_ustunmu	8	5,128	Feb-27-2019, 12:37 PM Last Post: erdem_ustunmu
	[Help] Convert integer to Roman numerals? {Screenshot attached}	vanicci	10	9,171	Aug-06-2018, 05:19 PM Last Post: vanicci
	[Help] sorted() in while loop with user's input() {Screenshot attached}	vanicci	5	4,008	Aug-04-2018, 08:59 PM Last Post: vanicci
	[Help] How to end While Loop using counter? {Screenshot attached}	vanicci	2	3,082	Aug-02-2018, 10:09 PM Last Post: vanicci
	Address WS2811 LED matrix attached to Raspberry Pi 3	sebar	4	5,865	Jan-08-2018, 07:59 PM Last Post: sebar

Parsing Attached .MSG Files with Python3

User Panel Messages

Announcements