Python Forum
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
email crawler in python
#1
Hello. We have a Python code. This code is part of a larger code. This code first receives a URL from the user, and then searches at a depth of 2 in the URL received from the user and extracts the email addresses. The goal is to have no limits for depth and to search all subdomains and links in the received URL without any restrictions. Please guide me and give me the modified code.



def extractUrl(url):
    print ("Searching, please wait...")
    print ("This operation may take several minutes")
    try:
        count = 0

        listUrl = []

        conn = urllib.request.urlopen(url)

        html = conn.read().decode('utf-8')

        emails = re.findall(r"[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,4}", html)
        print ("Searching in " + url)

        for email in emails:
            if (email not in listUrl):
                    count += 1
                    print(str(count) + " - " + email)
                    listUrl.append(email)


        soup = BeautifulSoup(html, "lxml")
        links = soup.find_all('a')

        for tag in links:
            link = tag.get('href', None)
            if link is not None:
                try:
                    print ("Searching in " + link)
                    if(link[0:4] == 'http'):
                        f = urllib.request.urlopen(link)
                        s = f.read().decode('utf-8')
                        emails = re.findall(r"[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,4}", s)
                        for email in emails:
                            if (email not in listUrl):
                                count += 1
                                print(str(count) + " - " + email)
                                listUrl.append(email)
                                if(searchEmail("EmailCrawler.db", email, "Especific Search") == 0):
                                    insertEmail("EmailCrawler.db", email, "Especific Search", url)
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Python crawler reports errors for some Chinese characters yliu315 0 929 Sep-11-2022, 06:17 PM
Last Post: yliu315
  Django send email - email form Remek953 2 2,246 Sep-18-2020, 07:07 AM
Last Post: Remek953
  Python web crawler and input command not having the correct results see below for mor samlee916 0 1,460 Jul-25-2020, 08:24 PM
Last Post: samlee916
  Web Crawler help Mr_Mafia 2 1,846 Apr-04-2020, 07:20 PM
Last Post: Mr_Mafia
  Web Crawler help takaa 39 26,853 Apr-26-2019, 12:14 PM
Last Post: stateitreal
  Python - Why multi threads are not working in this web crawler? ratanbhushan 1 2,766 Nov-17-2017, 05:21 PM
Last Post: Larz60+

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020