Python Forum
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Downloading txt files
#1
I am trying to learn how to download txt files from the web. I am familiar with downloading pdfs but when I've tried text files I haven't had that much luck.

I'm not in a class but I am trying to learn this which is why I posted my question here.

This is the code I'm trying to run.

from __future__ import print_function

import requests
from bs4 import BeautifulSoup


def file_links_filter(tag):
    """
    Tags filter. Return True for links that ends with 'pdf', 'htm' or 'txt'
    """
    if isinstance(tag, str):
        return tag.endswith('pdf') or tag.endswith('htm') or tag.endswith('txt')


def get_links(tags_list):
    return [WEB_ROOT + tag.attrs['href'] for tag in tags_list]


def download_file(file_link, folder):
    file = requests.get(file_link).content
    name = file_link.split('/')[-1]
    save_path = folder + name

    print("Saving file:", save_path)
    with open(save_path, 'wb') as fp:
        fp.write(file)


WEB_ROOT = 'https://www.sec.gov'
SAVE_FOLDER = '~/download_files/'  # directory in which files will be downloaded

r = requests.get("https://www.sec.gov/litigation/suspensions.shtml")

soup = BeautifulSoup(r.content, 'html.parser')

years = soup.select("p#archive-links > a")  # css selector for all <a> inside <p id='archive'> tag
years_links = get_links(years)

links_to_download = []
for year_link in years_links:
    page = requests.get(year_link)
    beautiful_page = BeautifulSoup(page.content, 'html.parser')

    links = beautiful_page.find_all("a", href=file_links_filter)
    links = get_links(links)

    links_to_download.extend(links)

# make set to exclude duplicate links
links_to_download = set(links_to_download)

print("Got links:", links_to_download)

for link in set(links_to_download):
    download_file(link, SAVE_FOLDER)
This is the error I receive.

Error:
===================== RESTART: C:/Python365/SEC Test.py ===================== Traceback (most recent call last): File "C:/Python365/SEC Test.py", line 3, in <module> import requests ModuleNotFoundError: No module named 'requests' >>>
I installed requests using pip install. I've tried uninstalling it and then reinstalling it. No luck. Can you point me in another direction?

Any help you can provide will be most appreciated!
Reply
#2
You can install requests with:

py -m pip install requests
But you can also use urllib.request.urlopen

from urllib.request import urlopen
from bs4 import BeautifulSoup


req = urlopen('http://google.de')
bs = BeautifulSoup(req.read(), 'html.parser')
Almost dead, but too lazy to die: https://sourceserver.info
All humans together. We don't need politicians!
Reply
#3
Do you have more than one python installation?
If you can't explain it to a six year old, you don't understand it yourself, Albert Einstein
How to Ask Questions The Smart Way: link and another link
Create MCV example
Debug small programs

Reply
#4
Note that one can write
tag.endswith(('pdf', 'htm', 'txt'))
Reply
#5
Gribouillis - Thank you for your response. Does this mean I can lose the 'or' statements? Also, why do you have the double parenthesis?

I appreciate the insight!

Thanks!

(Aug-27-2018, 04:51 PM)buran Wrote: Do you have more than one python installation?

Yes I do. Should I install all but one?

Thanks!
Reply
#6
(Aug-27-2018, 05:36 PM)tjnichols Wrote: Yes I do. Should I install all but one?
You can have more than one installation, it's OK. But like in this case, when install third-party packages you need to make sure to install it for the correct python installation. You installed the requests package for different python installation, not the one used to run your script.
If you can't explain it to a six year old, you don't understand it yourself, Albert Einstein
How to Ask Questions The Smart Way: link and another link
Create MCV example
Debug small programs

Reply
#7
An update - it works! Thanks for your help!
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  python selenium downloading embedded pdf damian0612 0 3,729 Feb-23-2021, 09:11 PM
Last Post: damian0612
  Downloading CSV from a website bmiller12 1 1,813 Nov-26-2020, 09:33 AM
Last Post: Axel_Erfurt
  Downloading book preview Truman 6 3,514 May-15-2019, 10:02 PM
Last Post: Truman
  Downloading Multiple Webpages MoziakBeats 4 3,245 Apr-17-2019, 04:06 AM
Last Post: Skaperen

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020