Python Forum

Full Version: Scraping Wikipedia Article (Name in 1 column & URL in 2nd column) ->CSV! Anyone?
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Scraping Wikipedia Article (Name in 1 column & URL in 2nd column) ->CSV! Anyone?

Targeted Columns & All Links

Does anyone know how to accomplish this feat?

I followed this tutorial/blog:

https://www.kindacode.com/article/extrac...ul-soup-4/

This code:

import requests

# BeautifulSoup is imported with the name bas4 
import bs4

URL = 'https://en.wikipedia.org/wiki/List_of_counties_in_Washington'

# Fetch all the HTML source from the url
response = requests.get(URL)


soup = bs4.BeautifulSoup(response.text, 'html.parser')
links = soup.select('a')

# Print out the result
for link in links:
  print(link.get_text())
  if link.get('href') != None:
    if 'https://' in link.get('href'):
      print(link.get('href'))
    else:
      print('https://en.wikipedia.org' + link.get('href')) # Convert relative URL to absolute URL

  print('----------------------------') # Just a line break
This prints the output with "name on top and URL on bottom", such as:

Counties
----------------------------
Adams
https://en.wikipedia.org/wiki/Adams_County,_Washington
----------------------------
Asotin
https://en.wikipedia.org/wiki/Asotin_County,_Washington
----------------------------
Benton
https://en.wikipedia.org/wiki/Benton_County,_Washington
----------------------------
I would like to store the name in 1 column and the URL in a 2nd column and send it to a CSV

Thank you everyone for this forum! I will append this thread if I start to find answers, rather than replies as I was doing earlier in error! Appreciate the correction and forum/assistance!

Best Regards,

Brandon Kastning
(Jan-19-2022, 11:39 PM)BrandonKastning Wrote: [ -> ]I would like to store the name in 1 column and the URL in a 2nd column and send it to a CSV
I would first store it in a data structure like list,dict then create CSV from that.
Get a lot of garbage when get all links from Wiki,eg link to different languages link.
Should specify tag in html better,so get only table that you want
Output:
Magyar https://hu.wikipedia.org/wiki/Washington_megy%C3%A9inek_list%C3%A1ja ---------------------------- Nederlands https://nl.wikipedia.org/wiki/Lijst_van_county%27s_in_Washington ---------------------------- 日本語 https://ja.wikipedia.org/wiki/%E3%83%AF%E3%82%B7%E3%83%B3%E3%83%88%E3%83%B3%E5%B7%9E%E3%81%AE%E9%83%A1%E4%B8%80%E8%A6%A7
Example.
import requests
import bs4

URL = "https://en.wikipedia.org/wiki/List_of_counties_in_Washington"
response = requests.get(URL)
lst1 = []
lst2 = []
soup = bs4.BeautifulSoup(response.text, "html.parser")
links = soup.select("a")
for link in links:
    print(link.get_text())
    lst1.append(link.get_text())
    if link.get("href") != None:
        if "https://" in link.get("href"):
            print(link.get("href"))
            lst2.append(link.get("href"))
        else:
             print(f"https://en.wikipedia.org link.get('href')")
             lst2.append(f"https://en.wikipedia.org{link.get('href')}")
So now can zip()and also dict() this together.
Output:
>>> record = dict(zip(lst1, lst2)) >>> record {'': 'https://foundation.wikimedia.org/wiki/Privacy_policy', '"Area Transferred"': "https://en.wikipedia.org link.get('href')", '"Article XI, Section 3: New Counties"': 'https://en.wikipedia.org ' "link.get('href')", '"Chapter 77 (S.B. 297), Changing Name of Chehalis County"': 'https://en.wikipedia.org ' "link.get('href')", '"Chehalis – Thumbnail History"': "https://en.wikipedia.org link.get('href')",
I guess the info you want is in there,so if do call like this.
>>> record['Alabama']
'https://en.wikipedia.org/wiki/List_of_boroughs_and_census_areas_in_Alaska'
>>> record['Massachusetts']
'https://en.wikipedia.org/wiki/List_of_counties_in_Michigan'
Also if want the language link of in Hebrew📜
>>> record['עברית']
'https://frr.wikipedia.org/wiki/Washington_Counties'
FYI: County data for all US states is available from: https://www.census.gov/geographies/refer...files.html Use link under Counties.
The link will get you a zip file which extracts to a text file containing State, GEOID, CountyName and links to additional data (files for which can be downloaded from the same page)
Larz60,

Thank you for the link to the U.S. Census Data! :)

(Jan-21-2022, 12:08 PM)Larz60+ Wrote: [ -> ]FYI: County data for all US states is available from: https://www.census.gov/geographies/refer...files.html Use link under Counties.
The link will get you a zip file which extracts to a text file containing State, GEOID, CountyName and links to additional data (files for which can be downloaded from the same page)
Quote:Thank you for the link to the U.S. Census Data! :)
Your welcome.

You may also be interested in this url: https://www2.census.gov/
here, though it takes a while to find and know what each link is for, contains the entire download tree for census public files.
Lots of useful stuff here.