Scraping Wikipedia Article (Name in 1 column & URL in 2nd column) ->CSV! Anyone?

BrandonKastning · Jan-19-2022, 11:39 PM

Scraping Wikipedia Article (Name in 1 column & URL in 2nd column) ->CSV! Anyone?

Targeted Columns & All Links

Does anyone know how to accomplish this feat?

I followed this tutorial/blog:
https://www.kindacode.com/article/extrac...ul-soup-4/

This code:

import requests

# BeautifulSoup is imported with the name bas4 
import bs4

URL = 'https://en.wikipedia.org/wiki/List_of_counties_in_Washington'

# Fetch all the HTML source from the url
response = requests.get(URL)


soup = bs4.BeautifulSoup(response.text, 'html.parser')
links = soup.select('a')

# Print out the result
for link in links:
  print(link.get_text())
  if link.get('href') != None:
    if 'https://' in link.get('href'):
      print(link.get('href'))
    else:
      print('https://en.wikipedia.org' + link.get('href')) # Convert relative URL to absolute URL

  print('----------------------------') # Just a line break

This prints the output with "name on top and URL on bottom", such as:

Counties
----------------------------
Adams
https://en.wikipedia.org/wiki/Adams_County,_Washington
----------------------------
Asotin
https://en.wikipedia.org/wiki/Asotin_County,_Washington
----------------------------
Benton
https://en.wikipedia.org/wiki/Benton_County,_Washington
----------------------------

I would like to store the name in 1 column and the URL in a 2nd column and send it to a CSV

Thank you everyone for this forum! I will append this thread if I start to find answers, rather than replies as I was doing earlier in error! Appreciate the correction and forum/assistance!

Best Regards,

Brandon Kastning

***snippsat*** · (This post was last modified: Jan-20-2022, 05:02 PM by snippsat.)

(Jan-19-2022, 11:39 PM)BrandonKastning Wrote: I would like to store the name in 1 column and the URL in a 2nd column and send it to a CSV

I would first store it in a data structure like list,dict then create CSV from that.
Get a lot of garbage when get all links from Wiki,eg link to different languages link.
Should specify tag in html better,so get only table that you want

Output:Magyar
https://hu.wikipedia.org/wiki/Washington_megy%C3%A9inek_list%C3%A1ja
----------------------------
Nederlands
https://nl.wikipedia.org/wiki/Lijst_van_county%27s_in_Washington
----------------------------
日本語
https://ja.wikipedia.org/wiki/%E3%83%AF%E3%82%B7%E3%83%B3%E3%83%88%E3%83%B3%E5%B7%9E%E3%81%AE%E9%83%A1%E4%B8%80%E8%A6%A7

Example.

import requests
import bs4

URL = "https://en.wikipedia.org/wiki/List_of_counties_in_Washington"
response = requests.get(URL)
lst1 = []
lst2 = []
soup = bs4.BeautifulSoup(response.text, "html.parser")
links = soup.select("a")
for link in links:
    print(link.get_text())
    lst1.append(link.get_text())
    if link.get("href") != None:
        if "https://" in link.get("href"):
            print(link.get("href"))
            lst2.append(link.get("href"))
        else:
             print(f"https://en.wikipedia.org link.get('href')")
             lst2.append(f"https://en.wikipedia.org{link.get('href')}")

So now can zip()and also dict() this together.

Output:>>> record = dict(zip(lst1, lst2))
>>> record
{'': 'https://foundation.wikimedia.org/wiki/Privacy_policy',
 '"Area Transferred"': "https://en.wikipedia.org link.get('href')",
 '"Article XI, Section 3: New Counties"': 'https://en.wikipedia.org '
                                          "link.get('href')",
 '"Chapter 77 (S.B. 297), Changing Name of Chehalis County"': 'https://en.wikipedia.org '
                                                              "link.get('href')",
 '"Chehalis – Thumbnail History"': "https://en.wikipedia.org link.get('href')",

I guess the info you want is in there,so if do call like this.

>>> record['Alabama']
'https://en.wikipedia.org/wiki/List_of_boroughs_and_census_areas_in_Alaska'
>>> record['Massachusetts']
'https://en.wikipedia.org/wiki/List_of_counties_in_Michigan'

Also if want the language link of in Hebrew📜

>>> record['עברית']
'https://frr.wikipedia.org/wiki/Washington_Counties'

**Larz60+** · Jan-21-2022, 12:08 PM

FYI: County data for all US states is available from: https://www.census.gov/geographies/refer...files.html Use link under Counties.
The link will get you a zip file which extracts to a text file containing State, GEOID, CountyName and links to additional data (files for which can be downloaded from the same page)

BrandonKastning · Jan-26-2022, 06:11 PM

Larz60,

Thank you for the link to the U.S. Census Data! :)

(Jan-21-2022, 12:08 PM)Larz60+ Wrote: FYI: County data for all US states is available from: https://www.census.gov/geographies/refer...files.html Use link under Counties.
The link will get you a zip file which extracts to a text file containing State, GEOID, CountyName and links to additional data (files for which can be downloaded from the same page)

**Larz60+** · Jan-27-2022, 04:36 AM

Quote:Thank you for the link to the U.S. Census Data! :)

Your welcome.

You may also be interested in this url: https://www2.census.gov/
here, though it takes a while to find and know what each link is for, contains the entire download tree for census public files.
Lots of useful stuff here.

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Scraping Columns with Pandas (Column Entries w/ more than 1 word writes two columns)	BrandonKastning	7	4,992	Jan-13-2022, 10:52 PM Last Post: BrandonKastning
	Python Obstacles \| Krav Maga \| Wiki Scraped Content [Column Copy]	BrandonKastning	4	3,500	Jan-03-2022, 06:59 AM Last Post: BrandonKastning
	Python Obstacles \| Kapap \| Wiki Scraped Content [Column Nulling]	BrandonKastning	2	2,704	Jan-03-2022, 04:26 AM Last Post: BrandonKastning
	fetching, parsing data from Wikipedia	apollo	2	4,457	May-06-2021, 08:08 PM Last Post: snippsat
	Django : OperationalError no such column: Utilisateurs_videos.user_id	Adem	0	4,408	Mar-20-2021, 06:11 PM Last Post: Adem
	Need help scraping wikipedia table	bborusz2	6	5,005	Dec-01-2020, 11:31 PM Last Post: snippsat
	Article Extraction - Wordpress	svzekio	7	7,177	Jul-10-2020, 10:18 PM Last Post: steve_shambles
	Jump to next empty column with Google Sheets & Python	Biks	1	3,451	Jun-16-2020, 04:51 PM Last Post: aguiatoma
	expecting value: line 1 column 1 (char 0) in print (r.json))	loutsi	3	14,964	Jun-05-2020, 08:38 PM Last Post: nuffink
	How to capture Single Column from Web Html Table?	ahmedwaqas92	5	6,958	Jul-29-2019, 02:17 AM Last Post: ahmedwaqas92

Scraping Wikipedia Article (Name in 1 column & URL in 2nd column) ->CSV! Anyone?

User Panel Messages

Announcements