Python Forum
Scraping Wikipedia Article (Name in 1 column & URL in 2nd column) ->CSV! Anyone?
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Scraping Wikipedia Article (Name in 1 column & URL in 2nd column) ->CSV! Anyone?
#1
Question 
Scraping Wikipedia Article (Name in 1 column & URL in 2nd column) ->CSV! Anyone?

Targeted Columns & All Links

Does anyone know how to accomplish this feat?

I followed this tutorial/blog:

https://www.kindacode.com/article/extrac...ul-soup-4/

This code:

import requests

# BeautifulSoup is imported with the name bas4 
import bs4

URL = 'https://en.wikipedia.org/wiki/List_of_counties_in_Washington'

# Fetch all the HTML source from the url
response = requests.get(URL)


soup = bs4.BeautifulSoup(response.text, 'html.parser')
links = soup.select('a')

# Print out the result
for link in links:
  print(link.get_text())
  if link.get('href') != None:
    if 'https://' in link.get('href'):
      print(link.get('href'))
    else:
      print('https://en.wikipedia.org' + link.get('href')) # Convert relative URL to absolute URL

  print('----------------------------') # Just a line break
This prints the output with "name on top and URL on bottom", such as:

Counties
----------------------------
Adams
https://en.wikipedia.org/wiki/Adams_County,_Washington
----------------------------
Asotin
https://en.wikipedia.org/wiki/Asotin_County,_Washington
----------------------------
Benton
https://en.wikipedia.org/wiki/Benton_County,_Washington
----------------------------
I would like to store the name in 1 column and the URL in a 2nd column and send it to a CSV

Thank you everyone for this forum! I will append this thread if I start to find answers, rather than replies as I was doing earlier in error! Appreciate the correction and forum/assistance!

Best Regards,

Brandon Kastning
“And one of the elders saith unto me, Weep not: behold, the Lion of the tribe of Juda, the Root of David, hath prevailed to open the book,...” - Revelation 5:5 (KJV)

“And oppress not the widow, nor the fatherless, the stranger, nor the poor; and ...” - Zechariah 7:10 (KJV)

#LetHISPeopleGo

Reply
#2
(Jan-19-2022, 11:39 PM)BrandonKastning Wrote: I would like to store the name in 1 column and the URL in a 2nd column and send it to a CSV
I would first store it in a data structure like list,dict then create CSV from that.
Get a lot of garbage when get all links from Wiki,eg link to different languages link.
Should specify tag in html better,so get only table that you want
Output:
Magyar https://hu.wikipedia.org/wiki/Washington_megy%C3%A9inek_list%C3%A1ja ---------------------------- Nederlands https://nl.wikipedia.org/wiki/Lijst_van_county%27s_in_Washington ---------------------------- 日本語 https://ja.wikipedia.org/wiki/%E3%83%AF%E3%82%B7%E3%83%B3%E3%83%88%E3%83%B3%E5%B7%9E%E3%81%AE%E9%83%A1%E4%B8%80%E8%A6%A7
Example.
import requests
import bs4

URL = "https://en.wikipedia.org/wiki/List_of_counties_in_Washington"
response = requests.get(URL)
lst1 = []
lst2 = []
soup = bs4.BeautifulSoup(response.text, "html.parser")
links = soup.select("a")
for link in links:
    print(link.get_text())
    lst1.append(link.get_text())
    if link.get("href") != None:
        if "https://" in link.get("href"):
            print(link.get("href"))
            lst2.append(link.get("href"))
        else:
             print(f"https://en.wikipedia.org link.get('href')")
             lst2.append(f"https://en.wikipedia.org{link.get('href')}")
So now can zip()and also dict() this together.
Output:
>>> record = dict(zip(lst1, lst2)) >>> record {'': 'https://foundation.wikimedia.org/wiki/Privacy_policy', '"Area Transferred"': "https://en.wikipedia.org link.get('href')", '"Article XI, Section 3: New Counties"': 'https://en.wikipedia.org ' "link.get('href')", '"Chapter 77 (S.B. 297), Changing Name of Chehalis County"': 'https://en.wikipedia.org ' "link.get('href')", '"Chehalis – Thumbnail History"': "https://en.wikipedia.org link.get('href')",
I guess the info you want is in there,so if do call like this.
>>> record['Alabama']
'https://en.wikipedia.org/wiki/List_of_boroughs_and_census_areas_in_Alaska'
>>> record['Massachusetts']
'https://en.wikipedia.org/wiki/List_of_counties_in_Michigan'
Also if want the language link of in Hebrew📜
>>> record['עברית']
'https://frr.wikipedia.org/wiki/Washington_Counties'
Reply
#3
FYI: County data for all US states is available from: https://www.census.gov/geographies/refer...files.html Use link under Counties.
The link will get you a zip file which extracts to a text file containing State, GEOID, CountyName and links to additional data (files for which can be downloaded from the same page)
BrandonKastning likes this post
Reply
#4
Thumbs Up 
Larz60,

Thank you for the link to the U.S. Census Data! :)

(Jan-21-2022, 12:08 PM)Larz60+ Wrote: FYI: County data for all US states is available from: https://www.census.gov/geographies/refer...files.html Use link under Counties.
The link will get you a zip file which extracts to a text file containing State, GEOID, CountyName and links to additional data (files for which can be downloaded from the same page)
“And one of the elders saith unto me, Weep not: behold, the Lion of the tribe of Juda, the Root of David, hath prevailed to open the book,...” - Revelation 5:5 (KJV)

“And oppress not the widow, nor the fatherless, the stranger, nor the poor; and ...” - Zechariah 7:10 (KJV)

#LetHISPeopleGo

Reply
#5
Quote:Thank you for the link to the U.S. Census Data! :)
Your welcome.

You may also be interested in this url: https://www2.census.gov/
here, though it takes a while to find and know what each link is for, contains the entire download tree for census public files.
Lots of useful stuff here.
BrandonKastning likes this post
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
Question Scraping Columns with Pandas (Column Entries w/ more than 1 word writes two columns) BrandonKastning 7 3,153 Jan-13-2022, 10:52 PM
Last Post: BrandonKastning
  Python Obstacles | Krav Maga | Wiki Scraped Content [Column Copy] BrandonKastning 4 2,219 Jan-03-2022, 06:59 AM
Last Post: BrandonKastning
  Python Obstacles | Kapap | Wiki Scraped Content [Column Nulling] BrandonKastning 2 1,726 Jan-03-2022, 04:26 AM
Last Post: BrandonKastning
  fetching, parsing data from Wikipedia apollo 2 3,537 May-06-2021, 08:08 PM
Last Post: snippsat
  Django : OperationalError no such column: Utilisateurs_videos.user_id Adem 0 3,023 Mar-20-2021, 06:11 PM
Last Post: Adem
  Need help scraping wikipedia table bborusz2 6 3,233 Dec-01-2020, 11:31 PM
Last Post: snippsat
  Article Extraction - Wordpress svzekio 7 5,277 Jul-10-2020, 10:18 PM
Last Post: steve_shambles
  Jump to next empty column with Google Sheets & Python Biks 1 2,676 Jun-16-2020, 04:51 PM
Last Post: aguiatoma
  expecting value: line 1 column 1 (char 0) in print (r.json)) loutsi 3 7,647 Jun-05-2020, 08:38 PM
Last Post: nuffink
  How to capture Single Column from Web Html Table? ahmedwaqas92 5 4,348 Jul-29-2019, 02:17 AM
Last Post: ahmedwaqas92

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020