Bottom Page

Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
 Resolving YouTube search links
#1
The goal of the program is as follows:
  1. perform a youtube search per user input
  2. print the first 5 search results to the terminal for user selection
  3. user makes selection
  4. print the chosen URL to the terminal


Libraries being utilized:
  • requests
  • urllib
  • BeautifulSoup

I went through a number of example scripts trying to understand how the process works and then trying to get some sort of successful result. This is as close as I've gotten. I hit a wall with having no search results printed to the terminal for selection. I'll paste my code below. I've commented as much as I understand well but would appreciate some help with explaining the lines that don't have comments at the end (or are incorrect) and help understanding why no search results are being returned from the query.

I know little about html and think that's part of my struggle in writing this one.

# import necessary modules

import requests
import urllib
from bs4 import BeautifulSoup

# definte youtube search function
def getYoutube(name, year):    # 2 arguments - movie name and year
    choice = ""    # start choice as empty
    textToSearch = '{0} {1} + "trailer"'.format(name, year)    
    query = urllib.parse.quote(textToSearch)    # create search string
    url = "https://www.youtube.com/results?search_query=" + query    # create search URL object
    response = urllib.request.urlopen(url)    
    html = response.read()    
    soup = BeautifulSoup(html, "html.parser")    # bs parses html 
    inc = 1    # incremental counter to break loop
    for vid in soup.findAll(attrs={'class': 'yt-uix-tile-link'}):    # identify trailer links and print to terminal
        print("#{0}: {1}".format(inc, vid['title']))    # format text to print nicely and print
        print(" ")
        inc += 1    # counter increase by 1 for each loop
        if inc >5:    # break out after 5 loops
            break
    while True:    # loop to choose trailer
        choice = input("Pick a trailer: ")    # input trailer choice per printed items
        if choice.isdigit():    # if user entered a number,
            choice = int(choice)    # convert input to integer
            break    # break out of loop
        else:
            return 0

    print("You chose: " + soup.findAll(attrs={'class': 'yt-uix-tile-link'})[choice - 1]['title']+"\n")    # print user choice
    yt_url_end = soup.findAll(attrs={'class': 'yt-uix-tile-link'})[choice - 1]['href']    # create video string object to concat with baseurl
    return "http://www.youtube.com" + yt_url_end     #return chosen URL
    print(yt_url_end)


getYoutube(input(str("Enter a Movie: ")), input(str("Enter a Year: ")))    #
Quote
#2
using one of the google packages can probably help (It's up to you to decide what might help) see: https://pypi.org/search/?q=google&o=
Quote
#3
(Jul-28-2020, 04:28 PM)Larz60+ Wrote: using one of the google packages can probably help (It's up to you to decide what might help) see: https://pypi.org/search/?q=google&o=

Thanks for the reference!
Quote
#4
Ok I've spent hours sifting through the libraries and I'm still failing. I've simplified the script based on one of the libraries I found to help identify the issue.

It looks like the vids object is not working as expected. I've commented the response I get with attempts to print at various points.

Since vids ends up empty, how can I tell if the issue lies with my code or if there are no matching results in the soup? Or both lol.

from bs4 import BeautifulSoup as bs
import requests

base = "https://www.youtube.com/results?search_query="
qstring = "life"

r = requests.get(base + qstring)

page = r.text
soup = bs(page, 'html.parser')
# print(soup) #prints html
vids = soup.findAll('a', attrs={'class':'yt-uix-tile-link'}, limit=5)
# print(vids) prints []
videolist = []

for v in vids:
    tmp = 'https://www.youtube.com' + v['href']
    videolist.append(tmp)
    inc = 1
    inc += 1
    if inc > 5:
        break

print(videolist) # prints []
EDIT: Looks like this task can't be accomplished in this way due to JS being involved. Will look for alternative solutions.
Quote
#5
It's been a very busy day for me, I'll take a look at this in the AM (EDT) if no one else has done so.
Quote
#6
Hello, it seems you are facing same issues I did a little while ago. And also the reason was that since recently, it seems JavaScript has been used to create the list of videos (search results).
Here is the thread with solution @snippsat kindly provided (which is to use Selenium instead of BS):
https://python-forum.io/Thread-Beautiful...bpage-html
pythonnewbie138 and Larz60+ like this post
Quote
#7
(Aug-01-2020, 08:08 AM)j.crater Wrote: Hello, it seems you are facing same issues I did a little while ago. And also the reason was that since recently, it seems JavaScript has been used to create the list of videos (search results).
Here is the thread with solution @snippsat kindly provided (which is to use Selenium instead of BS):
https://python-forum.io/Thread-Beautiful...bpage-html

Thanks for the input!

I started looking into Selenium but I want the script to be OS-independent and Selenium only works on Linux. I'll probably use a YouTube API wrapper when I get back to this. It's way more then I should need to achieve this but I haven't found a cross-platform method to work from the public search results. Or source the links from another site like TMDB if they allow hotlinking.
Quote
#8
Selenium only works on Linux
This is not the case, I am using Windows and Selenium works just fine ;)
pythonnewbie138 likes this post
Quote
#9
(Aug-01-2020, 07:16 PM)j.crater Wrote: This is not the case, I am using Windows and Selenium works just fine ;)

Thanks for letting me know. I'll have to do some more reading on it. Much appreciated!
Quote
#10
save this as youtubesearch.py in the same folder
(Credits: searchyoutube made by LBLZR_ https://github.com/LaBlazer/searchyt)

import requests
import logging
import json
import re

class searchyt(object):
    ua = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.113 Safari/537.36"
    config_regexp = re.compile(r'ytcfg\.set\(({.+?})\);')

    def __init__(self):
        self.req = requests.Session()
        self.log = logging.getLogger("ytsearch")
        headers = {"connection": "keep-alive",
                    "pragma": "no-cache",
                    "cache-control": "no-cache",
                    "upgrade-insecure-requests": "1",
                    "user-agent": searchyt.ua,
                    "accept": "*/*",
                    "accept-language": "en-US,en;q=0.9",
                    "referer": "https://www.youtube.com/",
                    "dnt": "1",
                    "maxResults": "500"}
        self.req.headers.update(headers)
        self._populate_headers()
    
    def _populate_headers(self):
        resp = self.req.get("https://www.youtube.com/")

        if resp.status_code != 200:
            self.log.debug(resp.text)
            raise Exception(f"error while scraping youtube (response code {resp.status_code})")

        result = searchyt.config_regexp.search(resp.text)
        if not result:
            self.log.debug(resp.text)
            raise Exception(f"error while searching for configuration")

        config = json.loads(result.group(1))
        if not config:
            self.log.debug(resp.text)
            raise Exception(f"error while parsing headers")

        updated_headers = {
            "x-spf-referer": "https://www.youtube.com/",
            "x-spf-previous": "https://www.youtube.com/",
            "x-youtube-utc-offset": "120",
            "x-youtube-client-name": str(config["INNERTUBE_CONTEXT_CLIENT_NAME"]),
            "x-youtube-variants-checksum": str(config["VARIANTS_CHECKSUM"]),
            "x-youtube-page-cl" : str(config["PAGE_CL"]),
            "x-youtube-client-version": str(config["INNERTUBE_CONTEXT_CLIENT_VERSION"]),
            "x-youtube-page-label": str(config["PAGE_BUILD_LABEL"])
        }
        self.log.debug(f"Headers: {updated_headers}")
        self.req.headers.update(updated_headers)

    def _traverse_data(self, data, match):
        # list
        if isinstance(data, list):
            for d in data:
                if isinstance(d, (dict, list)):
                    yield from self._traverse_data(d, match)
            return
        
        # dict
        for key, value in data.items():
            #print(key)
            # if key matches
            if key == match:
                yield value
            if isinstance(value, (dict, list)):
                yield from self._traverse_data(value, match)

    def _parse_videos(self, json_result):
        try:
            json_dict = json.loads(json_result)[1]

            #self.log.debug(json_dict)
            videos = []
            for v in self._traverse_data(json_dict, "videoRenderer"):
                vid = {}
                vid['title'] = v['title']['runs'][0]['text']
                vid['author'] = v['ownerText']['runs'][0]['text']
                vid['id'] = v["videoId"]
                vid['thumb'] = v['thumbnail']['thumbnails'][-1]['url'].split('?', maxsplit=1)[0]
                videos.append(vid)

            return videos
        except Exception as ex:
            self.log.debug(json_result)
            raise ex

    def search(self, query):
        if not isinstance(query, str):
            raise Exception("search query must be a string type")
        
        resp = self.req.get("https://www.youtube.com/results", params = {"search_query": query, "pbj": "1"})

        if resp.status_code != 200:
            self.log.debug(resp.text)
            raise Exception(f"error while getting search results page (status code {resp.status_code})")

        return self._parse_videos(resp.text)
then as an example

import searchyoutube

syt = searchyoutube.searchyt()
findList = []

def find_it(searchtext):
    res = syt.search(searchtext)
    if len(res):
        for x in range(len(res)):
            title = res[x].get("title")
            id = res[x].get("id")
            url = f"https://www.youtube.com/watch?v={id}"
            findList.append(f"{str(x + 1)}:\n{title}\n{id}\n{url}")
        return '\n'.join(findList)

print(find_it(f"Movie Trailer 2020"))
pythonnewbie138 likes this post
Quote

Top Page

Possibly Related Threads...
Thread Author Replies Views Last Post
  webscrapping links and then enter those links to scrape data kirito85 2 685 Jun-13-2019, 02:23 AM
Last Post: kirito85

Forum Jump:


Users browsing this thread: 1 Guest(s)