Python Forum

Full Version: URL ReGex missing out URL
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Hi Guys,

I cannot figure this issue out, i'm downloading emails via pop3:

code:

def pop3_downloader(username, password, pop3server, port, use_ssl):
    try:
        server = ''
        if use_ssl == "no":
            server = poplib.POP3(pop3server, port)
        elif use_ssl == "yes":
            server = poplib.POP3_SSL(pop3server, port)
        else:
            pass

        server.user(username)
        server.pass_(password)
        numMessages = len(server.list()[1])

        print("--> # Of Messages: " + str(numMessages))

        email_container = []
        for i in range(numMessages) :
            (server_msg, body, octets) = server.retr(i+1)
            for j in body:
                try:
                    msg = email.message_from_string(j.decode("utf-8"))
                    email_body = msg.get_payload()
                    email_extract_urls = re.findall(r'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', email_body)
                    if len(email_extract_urls) > 0:
                        activation_links = "/activate/|registration.activate&token="
                        #if any(s in email_extract_urls for s in activation_links.split('|')):
                        email_container.append(email_extract_urls)
                except:
                    pass
            #server.dele(i+1)
        server.quit()
        return email_container

    except Exception as e: 
        print_exception()
Which is working, a few emails contain:

https://www.site1.com/
https://www.site1.com/wp-login.php?wfls-email-verification=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJ1c2VyIjp7ImRhdGEiOiJCdlZZNHVWXC9Id2tQYStjR2RGOXRYUT09IiwiaXYiOiJmeDVlY1wvbXBib0I0M1VkMlcrb09EUT09In0sIl9leHAiOjE1NjQyNTM5MTV9.ZOpt4jXq5NHdecYygh0EnX5G5v8EMkSMuM2zhuPExmg
In that order, these are extracted fine, some emails contain:

http://site2.com/
[b]http://site2.com/index.php?option=com_users&task=registration.activate&token=xxxxxxxxxxxxxxxxxx[/b] <-- not being extracted
In this case above it will only extract the first url, the one in bold is always missed out, i cannot see why.

any help would be appreciated!

regards

Graham