Python Forum

Full Version: Trying to write a code to get a long list of unknown URLs
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
I am using the Princeton Review Top 384 Colleges as a source in a research project. Ultimately, what I want to do is create a code that will go to the websites of all of these colleges and create a database of professors listed by subject. For now, I'm concerned with writing the portion of the code that will find the URLs of each of the university websites. I am new to Python but have been reading about requests, urllib, BeautifulSoup, etc. But all the questions and tutorials I have found so far focus on requesting data from sites (one site or a short list of sites) where the URLs are already known. Obviously, it will take me hours to find the websites of each of the 384 colleges separately and put them into a list. I am trying to avoid that. Advice?
You can expect each of them to have a dot edu URL. So make a list of the colleges, full names. Then have an algorithm that comes up with possible ways the name might be used in a url. Say you have 'University of Virginia' in your list. It might look at the abbreviation (uv.edu). That's not valid. You might check with the two letter state abbreviation (uva.edu). That's wrong, but it redirects to the right one, something you might want to check for. You might say 'university' and 'of' are common in college names, and look for virginia.edu (that's the right one).

So you need a way to translate a full name into potential urls, and you need a way to check those urls to see if they are correct. And watch for things like sites that are colleges, but not the one you were looking for. For example, Rochester, NY has two colleges: Rochester Institute of Technology and University of Rochester. But rochester.edu goes to UofR, not RIT.
Code removed, sorry violates terms of use