Python Forum
Scraping Data from Website - Printable Version

+- Python Forum (https://python-forum.io)
+-- Forum: Python Coding (https://python-forum.io/forum-7.html)
+--- Forum: General Coding Help (https://python-forum.io/forum-8.html)
+--- Thread: Scraping Data from Website (/thread-40768.html)



Scraping Data from Website - melkaray - Sep-21-2023

Hi, I am trying to learn Scraping Data from Website with python and i tried extract that list ( List of largest companies by revenue - Wikipedia ) but it shows 60 columns instead of 8. I added the picture where i confused. ( ‘USD millions’ should be the last column but it continues like 1, 2, 3…). and i added the code. How should i fix it?
That's the code:
from bs4 import BeautifulSoup
import requests

url = 'https://en.wikipedia.org/wiki/List_of_largest_companies_by_revenue'
page = requests.get(url)


soup = BeautifulSoup(page.text, 'html')


print(soup)

soup.find_all('table')




soup.find('table', class_ = 'wikitable sortable ')




table = soup.find_all('table')[1]

print(table)

world_titles = table.find_all('th')

world_titles

world_table_titles = [title.text.strip() for title in world_titles]
print ( world_table_titles)

import pandas as pd

df = pd.DataFrame(columns = world_table_titles)
df

column_data = table.find_all('tr')

for row in column_data[2:]: #baştaki boşluk gitti
    row_data = row.find_all('td')
    individual_row_Data = [data.text.strip() for data in row_data]
    lenght = len(df)
    df.loc[lenght] == individual_row_Data

df
[Image: Whats-App-Image-2023-09-21-at-16-57-48.jpg]


RE: Scraping Data from Website - snippsat - Sep-21-2023

Pandas can read a html table straight into a DataFrame.
Have to do some cleaning up because of how Revenue and Profit are structured.
import numpy as np
import pandas as pd
pd.set_option('display.expand_frame_repr', False)

df = pd.read_html('https://en.wikipedia.org/wiki/List_of_largest_companies_by_revenue')[0]
df.columns = df.columns.get_level_values(1)
df.columns = ['Rank', 'Name', 'Industry', 'Revenue', 'Profit', 'Employees', 'Headquarters[note 1]', 'State-owned', 'Ref.'] 
Test.
>>> df.head()
   Rank                             Name     Industry   Revenue    Profit  Employees Headquarters[note 1]  State-owned    Ref.
0     1                          Walmart       Retail  $611,289   $11,680    2100000        United States          NaN     [1]
1     2                     Saudi Aramco  Oil and gas  $603,651  $159,069      70496         Saudi Arabia          NaN     [4]
2     3  State Grid Corporation of China  Electricity  $530,009    $8,192     870287                China          NaN     [5]
3     4                 Amazon.com, Inc.       Retail  $513,983   −$2,722    1541000        United States          NaN     [6]
4     5                            Vitol  Commodities  $505,000   $15,000       1560          Switzerland          NaN  [7][8]

# look a Types
>>> df.dtypes
Rank                      int64
Name                     object
Industry                 object
Revenue                  object
Profit                   object
Employees                 int64
Headquarters[note 1]     object
State-owned             float64
Ref.                     object

# Fix Types
dtype: object
>>> df['Revenue'] = df['Revenue'].str.replace('[\$,]', '', regex=True).astype(float)
>>> df['Profit'] = df['Profit'].str.replace('[^\d.-]', '', regex=True)
>>> df['Profit'].replace('...', np.nan, inplace=True)
>>> df['Profit'] = df['Profit'].astype(float)

# Types ok
>>> df.dtypes
Rank                      int64
Name                     object
Industry                 object
Revenue                 float64
Profit                  float64
Employees                 int64
Headquarters[note 1]     object
State-owned             float64
Ref.                     object
so now have a working DataFrame.
>>> df['Revenue'].max()
611289.0

>>> df.loc[df['Revenue'].idxmax()]
Rank                                1
Name                          Walmart
Industry                       Retail
Revenue                      611289.0
Profit                        11680.0
Employees                     2100000
Headquarters[note 1]    United States
State-owned                       NaN
Ref.                              [1]
Name: 0, dtype: object



RE: Scraping Data from Website - Larz60+ - Sep-21-2023

FYI:

This information can be extracted from an SEC dataset that is freely available to download and is updated quarterly Q3 for 2023 was just added in the past week or so.

Information on the dataset is contained in a PDF document which can be downloaded here: https://www.sec.gov/files/aqfsn_1.pdf

the actual dataset (earlier files are small, more recent can be up to half a gigabyte) can be downloaded as a single file for each quarter since 2009 here

you should be aware of the SEC disclaimer:
Quote:Financial Statement and Notes Data Sets
The data sets provide the text and detailed numeric information in all financial statements and their notes extracted from exhibits to corporate financial reports filed with the Commission using eXtensible Business Reporting Language (XBRL).
Updated Aug. 2023



RE: Scraping Data from Website - melkaray - Sep-22-2023

(Sep-21-2023, 04:16 PM)snippsat Wrote: df.loc[df['Revenue'].idxmax()]
thanks!