Python Forum
Parsing Attached .MSG Files with Python3
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Parsing Attached .MSG Files with Python3
#1
I'm trying to monitor a phishing inbox that could receive both normal emails (i.e. HTML/text based with potential attachments) as well as emails that have a .MSG file attached to it.

The goal is to have users send emails to [email protected] and once I parse out the various links (potentially malicious) as well as attachments (also potentially malicious, I'll perform some analysis on them.

The issue I'm running into is the body of the .msg file that is attached.

With the code below, I'm able to pull the to, from, subject, and all links within the original email. It also pulls down any attachments with the .msg file (i.e. on my test I was able to pull down a PDF within the .msg). However, I cannot get any of the to, from, subject, or body of the .msg file.

When I print it out as raw I get some of it in a very ugly format, but apparently with the multi-parts, I'm doing something wrong to get that piece of information.

I'm fairly new to Python so any help would be greatly appreciated.

import imaplib
import base64
import os
import email
from bs4 import BeautifulSoup

server = 'mail.server.com'
email_user = '[email protected]'
email_pass = 'XXXXXXXXXXXX'
output_dir = '/tmp/attachments/'
body = ""

def get_body(msg):
    if msg.is_multipart():
        return get_body(msg.get_payload(0))
    else:
        return msg.get_payload(None, True)

def get_attachments(msg):
    for part in msg.walk():
        if part.get_content_maintype()=='multipart':
            continue
        if part.get('Content-Disposition') is None:
            continue
        fileName = part.get_filename()

        if bool(fileName):
            filePath = os.path.join(output_dir, fileName)
            with open(filePath,'wb') as f:
                f.write(part.get_payload(decode=True))

mail = imaplib.IMAP4_SSL(server)
mail.login(email_user, email_pass)
mail.select('INBOX')

result, data = mail.search(None, 'UNSEEN')
mail_ids = data[0]
id_list = mail_ids.split()
print(id_list)

for emailid in id_list:
    result, email_data = mail.fetch(emailid, '(RFC822)')
    raw_email = email_data[0][1]
    raw_email_string = raw_email.decode('utf-8')
    email_message = email.message_from_string(raw_email_string)
    email_from = str(email.header.make_header(email.header.decode_header(email_message['From'])))
    email_to = str(email.header.make_header(email.header.decode_header(email_message['To'])))
    subject = str(email.header.make_header(email.header.decode_header(email_message['Subject'])))
    print('From: ' + email_from)
    print('To: ' + email_to)
    print('Subject: ' + subject)
	
    get_attachments(raw_email)

    for part in email_message.walk():
        body = part.get_payload(0)
        content = body.get_payload(decode=True)
        soup = BeautifulSoup(content, 'html.parser')
        for link in soup.find_all('a'):
            print('Link: ' + link.get('href'))
        break
Reply
#2
I got this working with the following code. I basically had to do multiple for loops within the .msg walk and then only pull out the relevant information within the text/html sections.

for emailid in id_list:
    result, data = mail.fetch(emailid, '(RFC822)')
    raw = email.message_from_bytes(data[0][1])
    get_attachments(raw)
#    print(raw)

    header_from = mail.fetch(emailid, "(BODY[HEADER.FIELDS (FROM)])")
    header_from_str = str(header_from)
    mail_from = re.search('From:\s.+<(\S+)>', header_from_str)

    header_subject = mail.fetch(emailid, "(BODY[HEADER.FIELDS (SUBJECT)])")
    header_subject_str = str(header_subject)
    mail_subject = re.search('Subject:\s(.+)\'\)', header_subject_str)
    #mail_body = mail.fetch(emailid, "(BODY[TEXT])")
    print(mail_from.group(1))
    print(mail_subject.group(1))


    for part in raw.walk():
        if part.get_content_type() == 'message/rfc822':
            part_string = str(part)
            original_from = re.search('From:\s.+<(\S+)>\n', part_string)
            original_to = re.search('To:\s.+<(\S+)>\n', part_string)
            original_subject = re.search('Subject:\s(.+)\n', part_string)
            print(original_from.group(1))
            print(original_to.group(1))
            print(original_subject.group(1))
        if part.get_content_type() == 'text/html':
            content = part.get_payload(decode=True)
            #print(content)
            soup = BeautifulSoup(content, 'html.parser')
            for link in soup.find_all('a'):
                print('Link: ' + link.get('href'))
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  opening files and output of parsing leodavinci1990 4 2,529 Oct-12-2020, 06:52 AM
Last Post: bowlofred
  Parsing Xml files >3gb using lxml iterparse Prit_Modi 2 2,340 May-16-2020, 06:53 AM
Last Post: Prit_Modi
  Spawning a new process that is not attached to python cman234 3 1,891 Apr-25-2020, 05:24 PM
Last Post: cman234
  Gnuradio python3 is not compatible python3 xmlrpc library How Can I Fix İt ? muratoznnnn 3 4,897 Nov-07-2019, 05:47 PM
Last Post: DeaD_EyE
  My objective is to get the shape parameters of the particles in attached image chad221 0 1,779 Oct-26-2019, 10:27 AM
Last Post: chad221
  parsing local xml files to csv erdem_ustunmu 8 5,128 Feb-27-2019, 12:37 PM
Last Post: erdem_ustunmu
Question [Help] Convert integer to Roman numerals? {Screenshot attached} vanicci 10 9,171 Aug-06-2018, 05:19 PM
Last Post: vanicci
  [Help] sorted() in while loop with user's input() {Screenshot attached} vanicci 5 4,008 Aug-04-2018, 08:59 PM
Last Post: vanicci
Question [Help] How to end While Loop using counter? {Screenshot attached} vanicci 2 3,082 Aug-02-2018, 10:09 PM
Last Post: vanicci
  Address WS2811 LED matrix attached to Raspberry Pi 3 sebar 4 5,865 Jan-08-2018, 07:59 PM
Last Post: sebar

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020