Nov-08-2021, 10:01 PM
Hello;
I am learning python and tried the following code which is posted here,
the code should work to scrape details of given video URL in CSV file,
the video URLs are in a CSV file, tab-delimited, it has a header (urls), for testing purposes I put only 4 youtube URLs in the CSV file.
when I run the following code on my machine ( Windows 7 64bit, running python 3.8, using visual studio code), I get no results, no error either, it supposed to export the result in a CSV file, no CSV file is created either.
I think the indentation is correct.
Does anyone have an idea why it does not work?
I am grateful for your help.
Here is the code:
I am learning python and tried the following code which is posted here,
the code should work to scrape details of given video URL in CSV file,
the video URLs are in a CSV file, tab-delimited, it has a header (urls), for testing purposes I put only 4 youtube URLs in the CSV file.
when I run the following code on my machine ( Windows 7 64bit, running python 3.8, using visual studio code), I get no results, no error either, it supposed to export the result in a CSV file, no CSV file is created either.
I think the indentation is correct.
Does anyone have an idea why it does not work?
I am grateful for your help.
Here is the code:
from fake_useragent import UserAgent from bs4 import BeautifulSoup import requests import pandas as pd import time import re import numpy as np def get_youtube_info(url, ua, crawl_delay): header = {"user-agent": ua.random} request = requests.get(url, headers=header, verify=True) soup = BeautifulSoup(request.content, "html.parser") tags = soup.find_all("meta", property="og:video:tag") titles = soup.find("title").text try: getdesc = re.search('description":{"simpleText":".*"', request.text) desc = getdesc.group(0) desc = desc.replace('description":{"simpleText":"', "") desc = desc.replace('"', "") desc = desc.replace("\n", "") except: desc = "n/a" getdate = re.search("[a-zA-z]{3}\s[0-9]{1,2},\s[0-9]{4}", request.text) vid_date = getdate.group(0) return tags, titles, vid_date, desc def tag_matches(desc, vid_tag_list): vid_tag_list = vid_tag_list.split(",") matches = "" for x in vid_tag_list: if desc.find(x) != -1: matches += x + "," return matches df = pd.read_csv("video-urls-4.csv") urls_list = df["urls"].to_list() ua = UserAgent() delays = [*range(10, 22, 1)] df2 = pd.DataFrame( columns=["URL", "Title", "Date", "Views", "Tags", "Tag Matches in Desc"] ) for x in urls_list: crawl_delay = np.random.choice(delays) vid_tags, title, vid_date, desc, views = get_youtube_info( x, ua, crawl_delay ) vid_tag_list = "" for i in vid_tags: vid_tag_list += i["content"] + ", " matches = tag_matches(desc, vid_tag_list) title = title.replace(" - YouTube", "") dict1 = { "URL": x, "Title": title, "Date": vid_date, "Views": views, "Tags": vid_tag_list, "Tag Matches in Desc": matches, } df2 = df2.append(dict1, ignore_index=True) df2.to_csv("vid-detail.csv")