Resolving YouTube search links - Printable Version +- Python Forum (https://python-forum.io) +-- Forum: Python Coding (https://python-forum.io/forum-7.html) +--- Forum: Web Scraping & Web Development (https://python-forum.io/forum-13.html) +--- Thread: Resolving YouTube search links (/thread-28658.html) Pages:
1
2
|
Resolving YouTube search links - pythonnewbie138 - Jul-28-2020 The goal of the program is as follows:
Libraries being utilized:
I went through a number of example scripts trying to understand how the process works and then trying to get some sort of successful result. This is as close as I've gotten. I hit a wall with having no search results printed to the terminal for selection. I'll paste my code below. I've commented as much as I understand well but would appreciate some help with explaining the lines that don't have comments at the end (or are incorrect) and help understanding why no search results are being returned from the query. I know little about html and think that's part of my struggle in writing this one. # import necessary modules import requests import urllib from bs4 import BeautifulSoup # definte youtube search function def getYoutube(name, year): # 2 arguments - movie name and year choice = "" # start choice as empty textToSearch = '{0} {1} + "trailer"'.format(name, year) query = urllib.parse.quote(textToSearch) # create search string url = "https://www.youtube.com/results?search_query=" + query # create search URL object response = urllib.request.urlopen(url) html = response.read() soup = BeautifulSoup(html, "html.parser") # bs parses html inc = 1 # incremental counter to break loop for vid in soup.findAll(attrs={'class': 'yt-uix-tile-link'}): # identify trailer links and print to terminal print("#{0}: {1}".format(inc, vid['title'])) # format text to print nicely and print print(" ") inc += 1 # counter increase by 1 for each loop if inc >5: # break out after 5 loops break while True: # loop to choose trailer choice = input("Pick a trailer: ") # input trailer choice per printed items if choice.isdigit(): # if user entered a number, choice = int(choice) # convert input to integer break # break out of loop else: return 0 print("You chose: " + soup.findAll(attrs={'class': 'yt-uix-tile-link'})[choice - 1]['title']+"\n") # print user choice yt_url_end = soup.findAll(attrs={'class': 'yt-uix-tile-link'})[choice - 1]['href'] # create video string object to concat with baseurl return "http://www.youtube.com" + yt_url_end #return chosen URL print(yt_url_end) getYoutube(input(str("Enter a Movie: ")), input(str("Enter a Year: "))) # RE: Resolving YouTube search links - Larz60+ - Jul-28-2020 using one of the google packages can probably help (It's up to you to decide what might help) see: https://pypi.org/search/?q=google&o= RE: Resolving YouTube search links - pythonnewbie138 - Jul-28-2020 (Jul-28-2020, 04:28 PM)Larz60+ Wrote: using one of the google packages can probably help (It's up to you to decide what might help) see: https://pypi.org/search/?q=google&o= Thanks for the reference! RE: Resolving YouTube search links - pythonnewbie138 - Jul-31-2020 Ok I've spent hours sifting through the libraries and I'm still failing. I've simplified the script based on one of the libraries I found to help identify the issue. It looks like the vids object is not working as expected. I've commented the response I get with attempts to print at various points.Since vids ends up empty, how can I tell if the issue lies with my code or if there are no matching results in the soup? Or both lol.from bs4 import BeautifulSoup as bs import requests base = "https://www.youtube.com/results?search_query=" qstring = "life" r = requests.get(base + qstring) page = r.text soup = bs(page, 'html.parser') # print(soup) #prints html vids = soup.findAll('a', attrs={'class':'yt-uix-tile-link'}, limit=5) # print(vids) prints [] videolist = [] for v in vids: tmp = 'https://www.youtube.com' + v['href'] videolist.append(tmp) inc = 1 inc += 1 if inc > 5: break print(videolist) # prints []EDIT: Looks like this task can't be accomplished in this way due to JS being involved. Will look for alternative solutions. RE: Resolving YouTube search links - Larz60+ - Aug-01-2020 It's been a very busy day for me, I'll take a look at this in the AM (EDT) if no one else has done so. RE: Resolving YouTube search links - j.crater - Aug-01-2020 Hello, it seems you are facing same issues I did a little while ago. And also the reason was that since recently, it seems JavaScript has been used to create the list of videos (search results). Here is the thread with solution @snippsat kindly provided (which is to use Selenium instead of BS): https://python-forum.io/Thread-Beautiful-Soup-suddenly-doesn-t-get-full-webpage-html RE: Resolving YouTube search links - pythonnewbie138 - Aug-01-2020 (Aug-01-2020, 08:08 AM)j.crater Wrote: Hello, it seems you are facing same issues I did a little while ago. And also the reason was that since recently, it seems JavaScript has been used to create the list of videos (search results). Thanks for the input! I started looking into Selenium but I want the script to be OS-independent and Selenium only works on Linux. I'll probably use a YouTube API wrapper when I get back to this. It's way more then I should need to achieve this but I haven't found a cross-platform method to work from the public search results. Or source the links from another site like TMDB if they allow hotlinking. RE: Resolving YouTube search links - j.crater - Aug-01-2020 Selenium only works on Linux This is not the case, I am using Windows and Selenium works just fine ;) RE: Resolving YouTube search links - pythonnewbie138 - Aug-01-2020 (Aug-01-2020, 07:16 PM)j.crater Wrote: This is not the case, I am using Windows and Selenium works just fine ;) Thanks for letting me know. I'll have to do some more reading on it. Much appreciated! RE: Resolving YouTube search links - Axel_Erfurt - Aug-01-2020 save this as youtubesearch.py in the same folder (Credits: searchyoutube made by LBLZR_ https://github.com/LaBlazer/searchyt) import requests import logging import json import re class searchyt(object): ua = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.113 Safari/537.36" config_regexp = re.compile(r'ytcfg\.set\(({.+?})\);') def __init__(self): self.req = requests.Session() self.log = logging.getLogger("ytsearch") headers = {"connection": "keep-alive", "pragma": "no-cache", "cache-control": "no-cache", "upgrade-insecure-requests": "1", "user-agent": searchyt.ua, "accept": "*/*", "accept-language": "en-US,en;q=0.9", "referer": "https://www.youtube.com/", "dnt": "1", "maxResults": "500"} self.req.headers.update(headers) self._populate_headers() def _populate_headers(self): resp = self.req.get("https://www.youtube.com/") if resp.status_code != 200: self.log.debug(resp.text) raise Exception(f"error while scraping youtube (response code {resp.status_code})") result = searchyt.config_regexp.search(resp.text) if not result: self.log.debug(resp.text) raise Exception(f"error while searching for configuration") config = json.loads(result.group(1)) if not config: self.log.debug(resp.text) raise Exception(f"error while parsing headers") updated_headers = { "x-spf-referer": "https://www.youtube.com/", "x-spf-previous": "https://www.youtube.com/", "x-youtube-utc-offset": "120", "x-youtube-client-name": str(config["INNERTUBE_CONTEXT_CLIENT_NAME"]), "x-youtube-variants-checksum": str(config["VARIANTS_CHECKSUM"]), "x-youtube-page-cl" : str(config["PAGE_CL"]), "x-youtube-client-version": str(config["INNERTUBE_CONTEXT_CLIENT_VERSION"]), "x-youtube-page-label": str(config["PAGE_BUILD_LABEL"]) } self.log.debug(f"Headers: {updated_headers}") self.req.headers.update(updated_headers) def _traverse_data(self, data, match): # list if isinstance(data, list): for d in data: if isinstance(d, (dict, list)): yield from self._traverse_data(d, match) return # dict for key, value in data.items(): #print(key) # if key matches if key == match: yield value if isinstance(value, (dict, list)): yield from self._traverse_data(value, match) def _parse_videos(self, json_result): try: json_dict = json.loads(json_result)[1] #self.log.debug(json_dict) videos = [] for v in self._traverse_data(json_dict, "videoRenderer"): vid = {} vid['title'] = v['title']['runs'][0]['text'] vid['author'] = v['ownerText']['runs'][0]['text'] vid['id'] = v["videoId"] vid['thumb'] = v['thumbnail']['thumbnails'][-1]['url'].split('?', maxsplit=1)[0] videos.append(vid) return videos except Exception as ex: self.log.debug(json_result) raise ex def search(self, query): if not isinstance(query, str): raise Exception("search query must be a string type") resp = self.req.get("https://www.youtube.com/results", params = {"search_query": query, "pbj": "1"}) if resp.status_code != 200: self.log.debug(resp.text) raise Exception(f"error while getting search results page (status code {resp.status_code})") return self._parse_videos(resp.text)then as an example import searchyoutube syt = searchyoutube.searchyt() findList = [] def find_it(searchtext): res = syt.search(searchtext) if len(res): for x in range(len(res)): title = res[x].get("title") id = res[x].get("id") url = f"https://www.youtube.com/watch?v={id}" findList.append(f"{str(x + 1)}:\n{title}\n{id}\n{url}") return '\n'.join(findList) print(find_it(f"Movie Trailer 2020")) |