Posts: 12
Threads: 4
Joined: Aug 2019
Hi guys.
At my work, I was asked for a solution for this problem. Basically, I work at a newspaper, who wish to scrap some urls for data, where the data lies in a tables, which you download as csv. It so some of the journalist, can keep track on suddens numbers on suddens subjects. The things is, they are asking for somekind of tool, which can scrap the data and where they can also see the data - Basically, a tool which is quite user friendly.
I hope I explained myself well enough, so it can be understood. Do you know of any solution, to scrap the internet, without writing a script in python ?
Have super awesome day!
//Kasper
Posts: 8,159
Threads: 160
Joined: Sep 2016
It depends a lot on the particular website (i.e. it could be as easy as using Data->fro Web in Excel), your goals, etc.
Also you can check https://scrapinghub.com/
Posts: 12
Threads: 4
Joined: Aug 2019
I though about using excel actually, because that is a program that everybody knows. But more, if there were any other tools out there.
Thank you for your reply!
Posts: 5,151
Threads: 396
Joined: Sep 2016
Oct-10-2019, 06:10 PM
(This post was last modified: Oct-10-2019, 06:11 PM by metulburr.)
Writing a script to scrape some data on a site is not that hard. If you can download the data as csv files, then you dont need to scrape the website at all. Actually cant you import csv files directly to Excel/OpenOffice Calc?
Of course you can easily create a python csv reader and put it to GUI to be user friendly.
Recommended Tutorials:
Posts: 7,317
Threads: 123
Joined: Sep 2016
Oct-10-2019, 07:03 PM
(This post was last modified: Oct-10-2019, 07:03 PM by snippsat.)
(Oct-10-2019, 01:10 PM)kasper1903 Wrote: who wish to scrap some urls for data, where the data lies in a tables, One simplest way is to use Pandas.
Will find any table on a web-site,an can easy read it to excel df.to_excel() .
import pandas as pd
wiki_timeline = pd.read_html('https://en.wikipedia.org/wiki/Timeline_of_programming_languages', match='Guido Van Rossum')
wiki_timeline[0].tail() ![[Image: 1p48av.jpg]](https://imagizer.imageshack.com/v2/xq90/922/1p48av.jpg)
Here use JupyterLab the the display look just like if would in excel.
Posts: 12
Threads: 4
Joined: Aug 2019
I totally agree! The problem is, that the constraint was, that they needed a tool they could change the things themselves in. Hence, I cant write my own script or use pandas - That is why, I probably will try out Excel, since that is something everybody knows.
Did not know of JupyterLab. I need to check that out for sure!
Thank you!
Posts: 12
Threads: 4
Joined: Aug 2019
Oct-11-2019, 09:56 AM
(This post was last modified: Oct-11-2019, 09:57 AM by kasper1903.)
Hi guys! Me again. I hope you can help me with something. As I was trying to scrape a specific url, it doesn't recognize any tables. Looking into the html, I can see div have been used alot, and it seems like it cannot not understand the structure of the raw data. I also tried with pandas, where I get the error "no tables found".
Both snippsat suggestions, but also with
import pandas as pd
from bs4 import BeautifulSoup
url = "https://www.sdk.dk/sdkbrugt/#/"
df = pd.read_html(url)
pd.read_html(requests.get(url).text) Do you know, how to solve the problem ?
The url I am using is: http://www.traktor-hostspecialisten.dk/b...er.html/#/ (You probably wont be able to understand the language ;) )
Posts: 5,151
Threads: 396
Joined: Sep 2016
That is because that site is using ul/li instead of tables
But i am unsure of how to use pandas to obtain that
Attached Files
Thumbnail(s)
Recommended Tutorials:
Posts: 12
Threads: 4
Joined: Aug 2019
Oct-11-2019, 12:58 PM
(This post was last modified: Oct-11-2019, 12:58 PM by kasper1903.)
Well, I got one step closer. But I expected to be something with that
- Anybody who know how to bypass ul/li ?
But if you look at this url: http://semleragro.dk/brugte-maskiner/brugte-maskiner/
- If is structured with tables. But the same errors occurs
Thank you for the help guys. Each answer have given me knew knowledge! :)
Posts: 5,151
Threads: 396
Joined: Sep 2016
that site has numerous tables, so you might be getting the wrong one from what you are expecting
Recommended Tutorials:
|