Python Forum

Full Version: Building a webcrawler for research (HELP!)
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Hi there,

If got a little problem with my code as I did not code python before but want to or have to do it for my research project.
I want to crawl a website for a set of data. My research project is to gather data from their website and put in in excel.

Here is my code so far:

import requests 
from bs4 import BeautifulSoup 
# Erstellen eines Crawlers fuer die Seite der die jeweiligen Links (Unterseiten) aller beendeten ICOs zum 
# aktuellen Zeitpunkt aufruft und deren Titel ausgibt. 
def ended_ico_spider(max_pages): 
    page = 1 
    while page <= max_pages: 
        url = "" \ 
              "filterSort=&filterCategory=all&filterRating=any&filterStatus=ended&filterPublished=&" \ 
              "filterCountry=any&filterRegistration=0&filterExcludeArea=none&filterPlatform=any&filterCurrency=any&" \ 
              "filterTrading=any&s=&filterStartAfter=&filterEndBefore=0&page= " + str(page) 
        source_code = requests.get(url) 
        plain_text = source_code.text 
        soup = BeautifulSoup(plain_text, "lxml") 
        for link in soup.findAll('a', {'class': 'name'}): 
            href = "" + link.get('href') 
            title = link.string 
            print (title) 
            # get_single_ico_whitepaper(href) 
        page += 1 
    # Abrufen der einzelnen Datenbloecke, der jeweiligen Unterseite. Felder wurden entsprechend des HTML-Codes benannt. 
def get_single_ico_rating(single_item_url): 
    source_code = requests.get(single_item_url) 
    plain_text = source_code.text 
    soup = BeautifulSoup(plain_text, "lxml") 
    # Daten aus dem Wertungsfeld 
    for data in soup.findAll('div', {'class': ['rate color1', 'rate color2', 'rate color3', 'rate color4', 
                                               'rate color5', 'col_4 col_3']}): 
def get_single_ico_fixed_data(single_item_url): 
    source_code = requests.get(single_item_url) 
    plain_text = source_code.text 
    soup = BeautifulSoup(plain_text, "lxml") 
    for fixed_data in soup.findAll('div', {'class': 'col_2'}): 
def get_single_ico_financial_token_info(single_item_url): 
    source_code = requests.get(single_item_url) 
    plain_text = source_code.text 
    soup = BeautifulSoup(plain_text, "lxml") 
    for financial_token_info in soup.findAll('div', {'class': 'box_left'}): 
def get_single_ico_financial_investment_info(single_item_url): 
    source_code = requests.get(single_item_url) 
    plain_text = source_code.text 
    soup = BeautifulSoup(plain_text, "lxml") 
    for investment_info in soup.findAll('div', {'class': 'box_right'}): 
# Ich moechte hier herausfinden, ob ein whitepaper auf der jeweiligen Unterseite vorhanden ist oder nicht. Falls eins 
    #  vorhanden ist kann ein Wert X zurueckgegeben werden, ansonsten ein Wert Y. 
# def get_single_ico_whitepaper(href): 
    # source_code = requests.get(href) 
    # plain_text = source_code.text 
    # soup = BeautifulSoup(plain_text, "lxml") 
    # for whitepaper_link in soup.findAll('div', {'class': 'onclick'}): 
        # print(whitepaper_link.text) 
Well and there are some parts missing and I would be glad for any help you could offer me. Here are the missing points I couldn't solve even though I search the web for hours (guess I'm just a noob in python Angel )

1. I need to find out if every single ICO (sub pages) has a whitepaper or not. As it's an onclick field I don't know how to search for it and see if there is a whitepaper or not.

2. The export of the data to a csv file (excel): The print looks kinda messy atm. Some parts are in lines orther in columns etc. As you might guess I need to make a beautiful chart with each ICO in a seperate line and the different elements in different columns to be able to use R or some other program to do the statistics.

I would be glad for any help!! Rolleyes Thanks a lot in advance for any support.
aStudent (in urgent need of help)
Do you get any content after requesting the web page? At first look, I see that at the end of the address there will be space between the page number and the rest of the address.
(May-31-2018, 09:37 AM)wavic Wrote: [ -> ]Do you get any content after requesting the web page? At first look, I see that at the end of the address there will be space between the page number and the rest of the address.

Yes, I do get all the data already besides - if there is a whitepaper or not. Even tough if it#s printed it's not very "pretty".
The number is just set to '1' because I'm testing atm if the code is working. Once it does and the output is fine I will let it read and print everything til max page.