Scraping Data from Website

melkaray

Hi, I am trying to learn Scraping Data from Website with python and i tried extract that list ( List of largest companies by revenue - Wikipedia ) but it shows 60 columns instead of 8. I added the picture where i confused. ( ‘USD millions’ should be the last column but it continues like 1, 2, 3…). and i added the code. How should i fix it?
That's the code:

from bs4 import BeautifulSoup
import requests

url = 'https://en.wikipedia.org/wiki/List_of_largest_companies_by_revenue'
page = requests.get(url)


soup = BeautifulSoup(page.text, 'html')


print(soup)

soup.find_all('table')




soup.find('table', class_ = 'wikitable sortable ')




table = soup.find_all('table')[1]

print(table)

world_titles = table.find_all('th')

world_titles

world_table_titles = [title.text.strip() for title in world_titles]
print ( world_table_titles)

import pandas as pd

df = pd.DataFrame(columns = world_table_titles)
df

column_data = table.find_all('tr')

for row in column_data[2:]: #baştaki boşluk gitti
    row_data = row.find_all('td')
    individual_row_Data = [data.text.strip() for data in row_data]
    lenght = len(df)
    df.loc[lenght] == individual_row_Data

df

[Image: Whats-App-Image-2023-09-21-at-16-57-48.jpg]

snippsat write Sep-21-2023, 02:58 PM:
Added code tag in your post,look at BBCode on how to use.

***snippsat*** · (This post was last modified: Sep-21-2023, 04:16 PM by snippsat.)

Pandas can read a html table straight into a DataFrame.
Have to do some cleaning up because of how Revenue and Profit are structured.

import numpy as np
import pandas as pd
pd.set_option('display.expand_frame_repr', False)

df = pd.read_html('https://en.wikipedia.org/wiki/List_of_largest_companies_by_revenue')[0]
df.columns = df.columns.get_level_values(1)
df.columns = ['Rank', 'Name', 'Industry', 'Revenue', 'Profit', 'Employees', 'Headquarters[note 1]', 'State-owned', 'Ref.']

Test.

>>> df.head()
   Rank                             Name     Industry   Revenue    Profit  Employees Headquarters[note 1]  State-owned    Ref.
0     1                          Walmart       Retail  $611,289   $11,680    2100000        United States          NaN     [1]
1     2                     Saudi Aramco  Oil and gas  $603,651  $159,069      70496         Saudi Arabia          NaN     [4]
2     3  State Grid Corporation of China  Electricity  $530,009    $8,192     870287                China          NaN     [5]
3     4                 Amazon.com, Inc.       Retail  $513,983   −$2,722    1541000        United States          NaN     [6]
4     5                            Vitol  Commodities  $505,000   $15,000       1560          Switzerland          NaN  [7][8]

# look a Types
>>> df.dtypes
Rank                      int64
Name                     object
Industry                 object
Revenue                  object
Profit                   object
Employees                 int64
Headquarters[note 1]     object
State-owned             float64
Ref.                     object

# Fix Types
dtype: object
>>> df['Revenue'] = df['Revenue'].str.replace('[\$,]', '', regex=True).astype(float)
>>> df['Profit'] = df['Profit'].str.replace('[^\d.-]', '', regex=True)
>>> df['Profit'].replace('...', np.nan, inplace=True)
>>> df['Profit'] = df['Profit'].astype(float)

# Types ok
>>> df.dtypes
Rank                      int64
Name                     object
Industry                 object
Revenue                 float64
Profit                  float64
Employees                 int64
Headquarters[note 1]     object
State-owned             float64
Ref.                     object

so now have a working DataFrame.

>>> df['Revenue'].max()
611289.0

>>> df.loc[df['Revenue'].idxmax()]
Rank                                1
Name                          Walmart
Industry                       Retail
Revenue                      611289.0
Profit                        11680.0
Employees                     2100000
Headquarters[note 1]    United States
State-owned                       NaN
Ref.                              [1]
Name: 0, dtype: object

**Larz60+** · (This post was last modified: Sep-21-2023, 10:39 PM by Larz60+.)

FYI:

This information can be extracted from an SEC dataset that is freely available to download and is updated quarterly Q3 for 2023 was just added in the past week or so.

Information on the dataset is contained in a PDF document which can be downloaded here: https://www.sec.gov/files/aqfsn_1.pdf

the actual dataset (earlier files are small, more recent can be up to half a gigabyte) can be downloaded as a single file for each quarter since 2009 here

you should be aware of the SEC disclaimer:

Quote:Financial Statement and Notes Data Sets
The data sets provide the text and detailed numeric information in all financial statements and their notes extracted from exhibits to corporate financial reports filed with the Commission using eXtensible Business Reporting Language (XBRL).
Updated Aug. 2023

melkaray · Sep-22-2023, 12:41 PM

(Sep-21-2023, 04:16 PM)snippsat Wrote: df.loc[df['Revenue'].idxmax()]

thanks!

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Error 403 Scraping website	cartonics	17	1,870	Oct-25-2023, 02:54 PM Last Post: snippsat
	Code to retrieve data from a website	charlie13255	0	994	Jul-07-2022, 07:53 PM Last Post: charlie13255
	Is this possible to write a script for checking data from website?	WanW	2	1,136	Jun-02-2022, 02:31 AM Last Post: Larz60+
	Extracting data from a website	tgottsc1	2	2,284	Jan-09-2021, 08:14 PM Last Post: tgottsc1
	extracting data from json on website	larry2311	2	5,085	Feb-09-2018, 01:27 AM Last Post: larry2311
	header of website but no data	Brian1210	2	3,953	Oct-07-2017, 11:59 PM Last Post: Brian1210
	sending data from my raspberry pi to my website	mohitsangavikar	2	17,836	Sep-05-2017, 06:55 PM Last Post: wrybread
	Twitter scraping exclude some data	Robbert	6	5,170	Sep-02-2017, 09:44 PM Last Post: nilamo

Scraping Data from Website

User Panel Messages

Announcements