Python Forum
Multiprocessing OSError 'too many open files'
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Multiprocessing OSError 'too many open files'
#1
I am using multiprocessing to do some link processing.
The code I am using to create the processes is the same as shown in one of my earlier posts:
while (len(processes) < len(list(urls))): #checks for current processes alive being less than all the links needing processing
    if (len(processes) - len([p for p in processes if not p.is_alive()]) < OPTIONS['max_proc']):
        p = Process(target = Links.process_link, args=(urls[index], OPTIONS)) #create a new process
        processes.append(p) #add it to array
        p.start()

        index += 1
            
for p in processes:
    p.join() #dont continue main script until processes have finished
The problem is I get this error:
Error:
OSError: [Errno 24] Too many open files
The first thing I tried was, using contextlib's closing function, closing the requests I make to web page like this:
 with closing(urlopen(Request(url, headers={'User-Agent': 'Mozilla/3.0'}), context=CONTEXT)) as response:
That didn't.
I do open a file and write to it many time, so I used the same closing function to see if that would help:
with closing(gzip.GzipFile(DIR_PATH+'/links.data.gz', 'a')) as lnk: #close file after we have finished with it
Again, that didn't work.
That means it is something to do with the amount of processes being opened. There are a few answers online, but non worked. The closest one to my exact problem said add p.terminate() under p.join(), so it would close the process after it's done.
However, the error occurs before it even makes it to the for loop. And since it's an OSError, the isn't really much (any) useful information in the stack trace.

What is the problem here?
Reply
#2
I expect you to know by now, it's incredibly difficult for us to help with code when we're provided only 10 lines of who knows how much. What you should do is make a copy of your code and simplify it until there's nothing that can be removed. Any user-input should be hard-coded to reproduce the problem, and any line of code that can be removed without preventing the problem being reproduced should be removed.

Once you've gotten to that point, you'll have code that we can actually look at and start trying to figure out what's wrong. The fewer lines of code you're able to reduce to, the more likely you will be to get a satisfactory response.
Reply
#3
A Python error message usually also tells on which line the error occurs. Can you share that information?
Reply
#4
(Dec-27-2019, 04:36 PM)micseydel Wrote: I expect you to know by now, it's incredibly difficult for us to help with code when we're provided only 10 lines of who knows how much. What you should do is make a copy of your code and simplify it until there's nothing that can be removed. Any user-input should be hard-coded to reproduce the problem, and any line of code that can be removed without preventing the problem being reproduced should be removed.

Once you've gotten to that point, you'll have code that we can actually look at and start trying to figure out what's wrong. The fewer lines of code you're able to reduce to, the more likely you will be to get a satisfactory response.
That's fair enough. I was trying to keep the code to minimum.
I've removed unnecessary code, and added a few hardcoded values.

Here's the code - I believe the libraries you need are requests, urllib and psaw.
class Links():
    def start_sorting_links():
        processes = []
        index = 0

        api = PushshiftAPI()
        SUBMISSIONS = api.search_submissions(subreddit='wellthatsucks', filter=['url', 'over_18'], limit=500)

        urls = [subs.d_['url'] for subs in SUBMISSIONS] #get all the urls and put in array

        while (len(processes) < len(list(urls))): #checks for current processes alive being less than all the links needing processing
            if (len(processes) - len([p for p in processes if not p.is_alive()]) < OPTIONS['max_proc']):
                p = Process(target = Links.process_link, args=(urls[index], )) #create a new process
                processes.append(p) #add it to array
                p.start()

                index += 1
                    
        for p in processes:
            p.join() #dont continue main script until processes have finished

    def process_link(url):
        with closing(gzip.GzipFile('/links.data.gz', 'a')) as lnk: #close file after we have finished with it
            if('/imgur.com/' in url): #imgur plain
                tree = Links.parse_link(url)
                Links.Imgur.imgur_plain(tree, lnk)
            elif('/i.imgur.com/' in url): #imgur direct
                Links.Imgur.imgur_direct(url, lnk)
            elif('/gfycat.com/' in url): #gfycat plain
                tree = Links.parse_link(url)
                Links.Gfycat.gfycat_plain(tree, lnk)
            elif('/thumbs.gfycat.com/' in url): #gfycat direct
                Links.Gfycat.gfycat_direct(url, lnk)
            elif('/i.redd.it/' in url): #reddit image / direct
                Links.Reddit.i_reddit(url, lnk)
            elif('/v.redd.it' in url): #reddit video / plain
                Links.Reddit.v_reddit(url, lnk)
            elif('/giphy.com/' in url): #giphy plain
                tree = Links.parse_link(url)
                Links.Giphy.giphy_plain(tree, lnk)
            elif('/media.giphy.com/' in url): #giphy partial direct 
                Links.Giphy.giphy_part_direct(url, lnk)
            elif('/i.giphy.com/' in url): #giphy full direct 
                Links.Giphy.giphy_full_direct(url, lnk)
            else:
                pass #clearly not anything we want here!
            lnk.flush() #flush the buffer
        
    class Imgur():
        def imgur_plain(tree, file):
            l = tree.xpath('/html/head/link[12]')
            try:
                direct_link = [i.attrib['href'] for i in l][0]
            except Exception:
                return
            n = tree.xpath('/html/body/div[7]/p[2]')
            nsfw = [i.text for i in n]
            file.write('{0}\n'.format(str(direct_link)).encode())
        def imgur_direct(url, file): #direct links don't need processing
            file.write('{0}\n'.format(str(url)).encode())
            
    class Gfycat():
        def gfycat_plain(tree, file):
            l = tree.xpath('/html/head/meta[51]')
            direct_link = [i.attrib['content'] for i in l][0]
            #almost impossible to find out whether gif is nsfw or not on gfycat
            file.write('{0}\n'.format(str(direct_link)).encode()) #just gonna risk it
        def gfycat_direct(url, file): #direct links don't need processing
            file.write('{0}\n'.format(str(url)).encode()) #just gonna risk it

    class Reddit(): ####NSFW
        def i_reddit(url, file): #direct links don't need processing
            file.write('{0}\n'.format(str(url)).encode())

        def v_reddit(url, file): #v.reddit is video, and will direct you back to normal reddit page so i need to get direct link
            full_url_jsonified = str(requests.get(url).url) + '.json'#v.reddit redirects to normal reddit, and i need that link, so i get it. add json to the end, so i get json data about video
            try:
                req = urlopen(full_url_jsonified, context=CONTEXT)
            except HTTPError:
                    return                
            data = json.load(req) #load the request into a json format
            direct_link = data[0]['data']['children'][0]['data']['secure_media']['reddit_video']['fallback_url'] #it uses the fallback url which is just a direct url to the video      
            file.write('{0}\n'.format(str(direct_link)).encode())

    class Giphy():
        def giphy_plain(tree, file):
            l = tree.xpath('/html/head/meta[19]')
            direct_link = [i.attrib['content'] for i in l][0].replace('media', 'i', 1) #replace first instance of 'media' with 'i' and that will get you direct link
            #very few giphy gifs have a pg rating / nsfw tag
            file.write('{0}\n'.format(str(direct_link)).encode())

        def giphy_part_direct(url, file):
            suffix = url.split(".")[-1]
            direct_link = url.replace('media', 'i', 1) #changing one of the 'media's makes it a direct link
            file.write('{0}\n'.format(str(direct_link)).encode())

        def giphy_full_direct(url, lnk):
            file.write('{0}\n'.format(str(url)).encode())

    def parse_link(url):
        htmlparser = etree.HTMLParser() #create parser
        with closing(urlopen(Request(url, headers={'User-Agent': 'Mozilla/3.0'}), context=CONTEXT)) as response: #get the html of the page - the headers make it seem like Mozilla is acessing the page otherwise some sites can detect if it is python trying to access and block the connection
            return etree.parse(response, htmlparser) #create an element tree out of it

(Dec-27-2019, 04:46 PM)ibreeden Wrote: A Python error message usually also tells on which line the error occurs. Can you share that information?
I'm actually struggling to reproduce the error.
I've run it twice, and the error hasn't occurred. I'm wondering if I was getting the error because there were too many processes left over from other times I have run the program.

Multiprocessing makes a huge spread difference in this part of the code, but if it's going to be finicky like this, I might not even continue with it.
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  OSError occurs in Linux. anna17 2 191 Mar-23-2024, 10:00 PM
Last Post: snippsat
  open python files in other drive akbarza 1 632 Aug-24-2023, 01:23 PM
Last Post: deanhystad
  OSError with SMPT script Milan 0 692 Apr-28-2023, 01:34 PM
Last Post: Milan
  OSERROR When mkdir Oshadha 4 1,659 Jun-29-2022, 08:50 AM
Last Post: DeaD_EyE
  How to open/load image .tiff files > 2 GB ? hobbyist 1 2,383 Aug-19-2021, 12:50 AM
Last Post: Larz60+
  Open and read multiple text files and match words kozaizsvemira 3 6,674 Jul-07-2021, 11:27 AM
Last Post: Larz60+
Question (solved) open multiple libre office files in libre office lucky67 5 3,213 May-29-2021, 04:54 PM
Last Post: lucky67
  OSError: Unable to load libjvm when connecting to hdfs with pyarrow 3.0.0 aupres 0 3,100 Mar-22-2021, 10:25 AM
Last Post: aupres
  Can't open files Lass86 5 2,354 Nov-10-2020, 07:18 PM
Last Post: jefsummers
  pyarrow throws oserror winerror 193 1 is not a valid win32 application aupres 2 3,722 Oct-21-2020, 01:04 AM
Last Post: aupres

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020