Scraping Columns with Pandas (Column Entries w/ more than 1 word writes two columns) - Printable Version +- Python Forum (https://python-forum.io) +-- Forum: Python Coding (https://python-forum.io/forum-7.html) +--- Forum: Web Scraping & Web Development (https://python-forum.io/forum-13.html) +--- Thread: Scraping Columns with Pandas (Column Entries w/ more than 1 word writes two columns) (/thread-36005.html) |
Scraping Columns with Pandas (Column Entries w/ more than 1 word writes two columns) - BrandonKastning - Jan-09-2022 Scraping Columns with Pandas (Column Entries w/ more than 1 word writes two columns) My code is the following: import pandas as pd url = "https://en.wikipedia.org/wiki/List_of_counties_in_Alabama" tables = pd.read_html(url) tables[1] print(tables[1]) df = tables[1] df.to_csv('AL_Counties.csv', sep='\t', encoding='utf-8', index=False)Using Pandas, Once the file is written; which works great. Any entries in any column; for instance as shown in the following screen shots in Libre Calc is writing the names into multiple columns. This makes my obstacles difficult. How do I remedy this? Screenshot #1: Screenshot #2: Thank you everyone for this forum! RE: Scraping Columns with Pandas (Column Entries w/ more than 1 word writes two columns) - snippsat - Jan-09-2022 (Jan-09-2022, 05:47 AM)BrandonKastning Wrote: for instance as shown in the following screen shots in Libre CalcDon't take the detour trough df.to_csv. Pandas can save to .odf format,need odfpy and openpyxl installed. Then you get same look as DataFrame show. pip install odfpy openpyxl # Or if use Anaconda conda install odfpy openpyxlCan also delete Map column. import pandas as pd url = "https://en.wikipedia.org/wiki/List_of_counties_in_Alabama" tables = pd.read_html(url) df = tables[1] df = df.drop('Map', axis=1) df.to_excel("wiki.ods", index=False, engine="odf") # To Excel would be df.to_excel('wiki.xlsx', index=False) RE: Scraping Columns with Pandas (Column Entries w/ more than 1 word writes two columns) - BrandonKastning - Jan-09-2022 snippsat, Thank you! This worked wonderfully! Much better formatting! Much obliged! :) Best Regards, Brandon Kastning (Jan-09-2022, 01:54 PM)snippsat Wrote:(Jan-09-2022, 05:47 AM)BrandonKastning Wrote: for instance as shown in the following screen shots in Libre CalcDon't take the detour trough df.to_csv. RE: Scraping Columns with Pandas (Column Entries w/ more than 1 word writes two columns) - BrandonKastning - Jan-09-2022 Should I open a new thread? How to download the map images and store them (Either DB or Local) ? RE: Scraping Columns with Pandas (Column Entries w/ more than 1 word writes two columns) - snippsat - Jan-10-2022 (Jan-09-2022, 09:52 PM)BrandonKastning Wrote: Should I open a new thread?It's part of same task,so no problem. (Jan-09-2022, 09:52 PM)BrandonKastning Wrote: How to download the map images and store them (Either DB or Local) ?You have to give it try,now need to use more common scraping tool. Here a demo how to start. import requests from bs4 import BeautifulSoup url = "https://en.wikipedia.org/wiki/List_of_counties_in_Alabama" response = requests.get(url) soup = BeautifulSoup(response.content, 'lxml') print(soup.find('h1').text) # First image img = soup.find_all('a', class_="image") img_link= img[0].find('img').get('src') img_link = img_link.replace('//', 'http://') print(img_link)
RE: Scraping Columns with Pandas (Column Entries w/ more than 1 word writes two columns) - BrandonKastning - Jan-13-2022 snippsat, Thank you for this! This is a great point for me to start regarding images and I believe since you wrote the comment "First Image" that I will need to learn loops. I will be coming back to this. In the meantime, I ran into some troubles with wikipedia panda table scraping. I changed from Counties to "Municipalities" and regardless of the tables[0], tables[1], tables[2] result in all the wrong data displayed on the wikipedia article. Code is as follows: import pandas as pd url = "https://en.wikipedia.org/wiki/List_of_municipalities_in_Alabama" tables = pd.read_html(url) df = tables[1] df.to_excel("AL_Alabama_Municipalities.ods", index=False, engine="odf")Thank you again for this forum! How do I determine the tables[#]? Is it a guessing game or is is there an attribute or property within the browser code that could aid me in finding the correct tables[#]? Best Regards, Brandon Kastning (Jan-10-2022, 06:52 PM)snippsat Wrote:(Jan-09-2022, 09:52 PM)BrandonKastning Wrote: Should I open a new thread?It's part of same task,so no problem. RE: Scraping Columns with Pandas (Column Entries w/ more than 1 word writes two columns) - snippsat - Jan-13-2022 (Jan-13-2022, 04:56 AM)BrandonKastning Wrote: Thank you again for this forum! How do I determine the tables[#]? Is it a guessing game or is is there an attribute or property within the browser code that could aid me in finding the correct tables[#]?A web site can have many tables,so have to look at site(count) or test out like tables[0], tables[1],tables[6].... and see if get wanted result. There is match in pandas.read_html that can use string or regex to match something i table wanted.Example Timeline of programming languages ,let say we want Python table we can match name Guido van Rossum. import pandas as pd df = pd.read_html('https://en.wikipedia.org/wiki/Timeline_of_programming_languages', match='Guido van Rossum') df[0].head(13) So if a match it will always be df[0] .Without match it would be table 9:df[9].head(13) RE: Scraping Columns with Pandas (Column Entries w/ more than 1 word writes two columns) - BrandonKastning - Jan-13-2022 snippsat, Thank you for this new knowledge and sharing this code. Wonderful example! Without applying your teaching to my code (yet); I found a work around that managed to pull the data in. I disabled: index=FalseFlag/Parameter. Then I had to do the following using Libre Calc Once disabled; Python picked up the correct table with tables[0] (Strange; I am still unsure what index=False truly does at this time). The output against the .ods it generated an additional row of header names. One that is different from another. I tried removing the duplicate row w/ 3 column headings that were under a single header name. Libre Calc gave me an error; so I decided to try copy and paste and the following worked great for a CSV save in 7 steps. Default Settings on Save As Dialogs were used and worked fine! then I use step2.py as the Payload after Manually Naming the Column Headers in Libre Calc! step1.py code (to generate the .ods): import pandas as pd url = "https://en.wikipedia.org/wiki/List_of_municipalities_in_Alabama" tables = pd.read_html(url) df = tables[0] #df = df.drop('Map', axis=1) #df.to_excel("AL_Alabama_Cities.ods", index=False, engine="odf") df.to_excel("AL_Alabama_Cities.ods", engine="odf")step2.py code: import pandas as pd import mysql.connector from sqlalchemy import create_engine myd = pd.read_csv('AL_Alabama_Cities.CSV.csv') engine = create_engine('mysql+mysqlconnector://brandon:[email protected]/Exodus_J3x_Dev_Bronson') myd.to_sql(name='AL_Cities_CSV', con=engine, if_exists='replace', index=False)Thank you again snippsat and everyone for this forum time/expertise! Best Regards, Brandon Kastning (Jan-13-2022, 02:46 PM)snippsat Wrote:(Jan-13-2022, 04:56 AM)BrandonKastning Wrote: Thank you again for this forum! How do I determine the tables[#]? Is it a guessing game or is is there an attribute or property within the browser code that could aid me in finding the correct tables[#]?A web site can have many tables,so have to look at site(count) or test out like tables[0], tables[1],tables[6].... and see if get wanted result. |