![]() |
Scraping Data from Website - Printable Version +- Python Forum (https://python-forum.io) +-- Forum: Python Coding (https://python-forum.io/forum-7.html) +--- Forum: General Coding Help (https://python-forum.io/forum-8.html) +--- Thread: Scraping Data from Website (/thread-40768.html) |
Scraping Data from Website - melkaray - Sep-21-2023 Hi, I am trying to learn Scraping Data from Website with python and i tried extract that list ( List of largest companies by revenue - Wikipedia ) but it shows 60 columns instead of 8. I added the picture where i confused. ( ‘USD millions’ should be the last column but it continues like 1, 2, 3…). and i added the code. How should i fix it? That's the code: from bs4 import BeautifulSoup import requests url = 'https://en.wikipedia.org/wiki/List_of_largest_companies_by_revenue' page = requests.get(url) soup = BeautifulSoup(page.text, 'html') print(soup) soup.find_all('table') soup.find('table', class_ = 'wikitable sortable ') table = soup.find_all('table')[1] print(table) world_titles = table.find_all('th') world_titles world_table_titles = [title.text.strip() for title in world_titles] print ( world_table_titles) import pandas as pd df = pd.DataFrame(columns = world_table_titles) df column_data = table.find_all('tr') for row in column_data[2:]: #baştaki boşluk gitti row_data = row.find_all('td') individual_row_Data = [data.text.strip() for data in row_data] lenght = len(df) df.loc[lenght] == individual_row_Data df ![]() RE: Scraping Data from Website - snippsat - Sep-21-2023 Pandas can read a html table straight into a DataFrame. Have to do some cleaning up because of how Revenue and Profit are structured.import numpy as np import pandas as pd pd.set_option('display.expand_frame_repr', False) df = pd.read_html('https://en.wikipedia.org/wiki/List_of_largest_companies_by_revenue')[0] df.columns = df.columns.get_level_values(1) df.columns = ['Rank', 'Name', 'Industry', 'Revenue', 'Profit', 'Employees', 'Headquarters[note 1]', 'State-owned', 'Ref.']Test. >>> df.head() Rank Name Industry Revenue Profit Employees Headquarters[note 1] State-owned Ref. 0 1 Walmart Retail $611,289 $11,680 2100000 United States NaN [1] 1 2 Saudi Aramco Oil and gas $603,651 $159,069 70496 Saudi Arabia NaN [4] 2 3 State Grid Corporation of China Electricity $530,009 $8,192 870287 China NaN [5] 3 4 Amazon.com, Inc. Retail $513,983 −$2,722 1541000 United States NaN [6] 4 5 Vitol Commodities $505,000 $15,000 1560 Switzerland NaN [7][8] # look a Types >>> df.dtypes Rank int64 Name object Industry object Revenue object Profit object Employees int64 Headquarters[note 1] object State-owned float64 Ref. object # Fix Types dtype: object >>> df['Revenue'] = df['Revenue'].str.replace('[\$,]', '', regex=True).astype(float) >>> df['Profit'] = df['Profit'].str.replace('[^\d.-]', '', regex=True) >>> df['Profit'].replace('...', np.nan, inplace=True) >>> df['Profit'] = df['Profit'].astype(float) # Types ok >>> df.dtypes Rank int64 Name object Industry object Revenue float64 Profit float64 Employees int64 Headquarters[note 1] object State-owned float64 Ref. objectso now have a working DataFrame. >>> df['Revenue'].max() 611289.0 >>> df.loc[df['Revenue'].idxmax()] Rank 1 Name Walmart Industry Retail Revenue 611289.0 Profit 11680.0 Employees 2100000 Headquarters[note 1] United States State-owned NaN Ref. [1] Name: 0, dtype: object RE: Scraping Data from Website - Larz60+ - Sep-21-2023 FYI: This information can be extracted from an SEC dataset that is freely available to download and is updated quarterly Q3 for 2023 was just added in the past week or so. Information on the dataset is contained in a PDF document which can be downloaded here: https://www.sec.gov/files/aqfsn_1.pdf the actual dataset (earlier files are small, more recent can be up to half a gigabyte) can be downloaded as a single file for each quarter since 2009 here you should be aware of the SEC disclaimer: Quote:Financial Statement and Notes Data Sets RE: Scraping Data from Website - melkaray - Sep-22-2023 (Sep-21-2023, 04:16 PM)snippsat Wrote: df.loc[df['Revenue'].idxmax()]thanks! |