Python Forum
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Scraping Data from Website
#1
Hi, I am trying to learn Scraping Data from Website with python and i tried extract that list ( List of largest companies by revenue - Wikipedia ) but it shows 60 columns instead of 8. I added the picture where i confused. ( ‘USD millions’ should be the last column but it continues like 1, 2, 3…). and i added the code. How should i fix it?
That's the code:
from bs4 import BeautifulSoup
import requests

url = 'https://en.wikipedia.org/wiki/List_of_largest_companies_by_revenue'
page = requests.get(url)


soup = BeautifulSoup(page.text, 'html')


print(soup)

soup.find_all('table')




soup.find('table', class_ = 'wikitable sortable ')




table = soup.find_all('table')[1]

print(table)

world_titles = table.find_all('th')

world_titles

world_table_titles = [title.text.strip() for title in world_titles]
print ( world_table_titles)

import pandas as pd

df = pd.DataFrame(columns = world_table_titles)
df

column_data = table.find_all('tr')

for row in column_data[2:]: #baştaki boşluk gitti
    row_data = row.find_all('td')
    individual_row_Data = [data.text.strip() for data in row_data]
    lenght = len(df)
    df.loc[lenght] == individual_row_Data

df
[Image: Whats-App-Image-2023-09-21-at-16-57-48.jpg]
snippsat write Sep-21-2023, 02:58 PM:
Added code tag in your post,look at BBCode on how to use.
Reply
#2
Pandas can read a html table straight into a DataFrame.
Have to do some cleaning up because of how Revenue and Profit are structured.
import numpy as np
import pandas as pd
pd.set_option('display.expand_frame_repr', False)

df = pd.read_html('https://en.wikipedia.org/wiki/List_of_largest_companies_by_revenue')[0]
df.columns = df.columns.get_level_values(1)
df.columns = ['Rank', 'Name', 'Industry', 'Revenue', 'Profit', 'Employees', 'Headquarters[note 1]', 'State-owned', 'Ref.'] 
Test.
>>> df.head()
   Rank                             Name     Industry   Revenue    Profit  Employees Headquarters[note 1]  State-owned    Ref.
0     1                          Walmart       Retail  $611,289   $11,680    2100000        United States          NaN     [1]
1     2                     Saudi Aramco  Oil and gas  $603,651  $159,069      70496         Saudi Arabia          NaN     [4]
2     3  State Grid Corporation of China  Electricity  $530,009    $8,192     870287                China          NaN     [5]
3     4                 Amazon.com, Inc.       Retail  $513,983   −$2,722    1541000        United States          NaN     [6]
4     5                            Vitol  Commodities  $505,000   $15,000       1560          Switzerland          NaN  [7][8]

# look a Types
>>> df.dtypes
Rank                      int64
Name                     object
Industry                 object
Revenue                  object
Profit                   object
Employees                 int64
Headquarters[note 1]     object
State-owned             float64
Ref.                     object

# Fix Types
dtype: object
>>> df['Revenue'] = df['Revenue'].str.replace('[\$,]', '', regex=True).astype(float)
>>> df['Profit'] = df['Profit'].str.replace('[^\d.-]', '', regex=True)
>>> df['Profit'].replace('...', np.nan, inplace=True)
>>> df['Profit'] = df['Profit'].astype(float)

# Types ok
>>> df.dtypes
Rank                      int64
Name                     object
Industry                 object
Revenue                 float64
Profit                  float64
Employees                 int64
Headquarters[note 1]     object
State-owned             float64
Ref.                     object
so now have a working DataFrame.
>>> df['Revenue'].max()
611289.0

>>> df.loc[df['Revenue'].idxmax()]
Rank                                1
Name                          Walmart
Industry                       Retail
Revenue                      611289.0
Profit                        11680.0
Employees                     2100000
Headquarters[note 1]    United States
State-owned                       NaN
Ref.                              [1]
Name: 0, dtype: object
Gribouillis likes this post
Reply
#3
FYI:

This information can be extracted from an SEC dataset that is freely available to download and is updated quarterly Q3 for 2023 was just added in the past week or so.

Information on the dataset is contained in a PDF document which can be downloaded here: https://www.sec.gov/files/aqfsn_1.pdf

the actual dataset (earlier files are small, more recent can be up to half a gigabyte) can be downloaded as a single file for each quarter since 2009 here

you should be aware of the SEC disclaimer:
Quote:Financial Statement and Notes Data Sets
The data sets provide the text and detailed numeric information in all financial statements and their notes extracted from exhibits to corporate financial reports filed with the Commission using eXtensible Business Reporting Language (XBRL).
Updated Aug. 2023
Reply
#4
(Sep-21-2023, 04:16 PM)snippsat Wrote: df.loc[df['Revenue'].idxmax()]
thanks!
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Error 403 Scraping website cartonics 17 1,870 Oct-25-2023, 02:54 PM
Last Post: snippsat
  Code to retrieve data from a website charlie13255 0 994 Jul-07-2022, 07:53 PM
Last Post: charlie13255
  Is this possible to write a script for checking data from website? WanW 2 1,136 Jun-02-2022, 02:31 AM
Last Post: Larz60+
  Extracting data from a website tgottsc1 2 2,284 Jan-09-2021, 08:14 PM
Last Post: tgottsc1
  extracting data from json on website larry2311 2 5,085 Feb-09-2018, 01:27 AM
Last Post: larry2311
  header of website but no data Brian1210 2 3,953 Oct-07-2017, 11:59 PM
Last Post: Brian1210
  sending data from my raspberry pi to my website mohitsangavikar 2 17,836 Sep-05-2017, 06:55 PM
Last Post: wrybread
  Twitter scraping exclude some data Robbert 6 5,170 Sep-02-2017, 09:44 PM
Last Post: nilamo

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020