Python Forum
Building a webcrawler for research (HELP!)
Thread Rating:
  • 1 Vote(s) - 4 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Building a webcrawler for research (HELP!)
#1
Hi there,

If got a little problem with my code as I did not code python before but want to or have to do it for my research project.
I want to crawl a website for a set of data. My research project is to gather data from their website and put in in excel.

Here is my code so far:

import requests 
from bs4 import BeautifulSoup 
 
# Erstellen eines Crawlers fuer die Seite icobench.com der die jeweiligen Links (Unterseiten) aller beendeten ICOs zum 
# aktuellen Zeitpunkt aufruft und deren Titel ausgibt. 
 
 
def ended_ico_spider(max_pages): 
    page = 1 
    while page <= max_pages: 
        url = "https://icobench.com/icos?&filterBonus=&filterBounty=&filterMvp=&filterTeam=&filterExpert=&" \ 
              "filterSort=&filterCategory=all&filterRating=any&filterStatus=ended&filterPublished=&" \ 
              "filterCountry=any&filterRegistration=0&filterExcludeArea=none&filterPlatform=any&filterCurrency=any&" \ 
              "filterTrading=any&s=&filterStartAfter=&filterEndBefore=0&page= " + str(page) 
        source_code = requests.get(url) 
        plain_text = source_code.text 
        soup = BeautifulSoup(plain_text, "lxml") 
        for link in soup.findAll('a', {'class': 'name'}): 
            href = "https://icobench.com/" + link.get('href') 
            title = link.string 
            print (title) 
            get_single_ico_rating(href) 
            get_single_ico_fixed_data(href) 
            get_single_ico_financial_token_info(href) 
            get_single_ico_financial_investment_info(href) 
            # get_single_ico_whitepaper(href) 
        page += 1 
 
    # Abrufen der einzelnen Datenbloecke, der jeweiligen Unterseite. Felder wurden entsprechend des HTML-Codes benannt. 
 
 
def get_single_ico_rating(single_item_url): 
    source_code = requests.get(single_item_url) 
    plain_text = source_code.text 
    soup = BeautifulSoup(plain_text, "lxml") 
    # Daten aus dem Wertungsfeld 
    for data in soup.findAll('div', {'class': ['rate color1', 'rate color2', 'rate color3', 'rate color4', 
                                               'rate color5', 'col_4 col_3']}): 
        print(data.text), 
 
 
def get_single_ico_fixed_data(single_item_url): 
    source_code = requests.get(single_item_url) 
    plain_text = source_code.text 
    soup = BeautifulSoup(plain_text, "lxml") 
    for fixed_data in soup.findAll('div', {'class': 'col_2'}): 
        print(fixed_data.text) 
 
 
def get_single_ico_financial_token_info(single_item_url): 
    source_code = requests.get(single_item_url) 
    plain_text = source_code.text 
    soup = BeautifulSoup(plain_text, "lxml") 
    for financial_token_info in soup.findAll('div', {'class': 'box_left'}): 
        print(financial_token_info.text) 
 
 
def get_single_ico_financial_investment_info(single_item_url): 
    source_code = requests.get(single_item_url) 
    plain_text = source_code.text 
    soup = BeautifulSoup(plain_text, "lxml") 
    for investment_info in soup.findAll('div', {'class': 'box_right'}): 
        print(investment_info.text) 
 
# Ich moechte hier herausfinden, ob ein whitepaper auf der jeweiligen Unterseite vorhanden ist oder nicht. Falls eins 
    #  vorhanden ist kann ein Wert X zurueckgegeben werden, ansonsten ein Wert Y. 
 
 
# def get_single_ico_whitepaper(href): 
    # source_code = requests.get(href) 
    # plain_text = source_code.text 
    # soup = BeautifulSoup(plain_text, "lxml") 
    # for whitepaper_link in soup.findAll('div', {'class': 'onclick'}): 
        # print(whitepaper_link.text) 
 
 
ended_ico_spider(1) 
Well and there are some parts missing and I would be glad for any help you could offer me. Here are the missing points I couldn't solve even though I search the web for hours (guess I'm just a noob in python Angel )

1. I need to find out if every single ICO (sub pages) has a whitepaper or not. As it's an onclick field I don't know how to search for it and see if there is a whitepaper or not.

2. The export of the data to a csv file (excel): The print looks kinda messy atm. Some parts are in lines orther in columns etc. As you might guess I need to make a beautiful chart with each ICO in a seperate line and the different elements in different columns to be able to use R or some other program to do the statistics.


I would be glad for any help!! Rolleyes Thanks a lot in advance for any support.
aStudent (in urgent need of help)
Reply
#2
Do you get any content after requesting the web page? At first look, I see that at the end of the address there will be space between the page number and the rest of the address.
"As they say in Mexico 'dosvidaniya'. That makes two vidaniyas."
https://freedns.afraid.org
Reply
#3
(May-31-2018, 09:37 AM)wavic Wrote: Do you get any content after requesting the web page? At first look, I see that at the end of the address there will be space between the page number and the rest of the address.

Yes, I do get all the data already besides - if there is a whitepaper or not. Even tough if it#s printed it's not very "pretty".
The number is just set to '1' because I'm testing atm if the code is working. Once it does and the output is fine I will let it read and print everything til max page.
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Webcrawler with Selenium and Browsermob, Har file not complete Smartkoncept 0 1,304 Jul-23-2020, 08:06 AM
Last Post: Smartkoncept

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020