Python Forum
Code worked in shell but not when I tried in my project.
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Code worked in shell but not when I tried in my project.
#1
First of all, I really don't know what the error is, so I can't come up with a better header.

I was learning scraping using Scrapy by following the tutorial on Scrapy's site. It worked fine following Scrapy's tutorial until I tried with another website that I was trying to scrape. I tried to use the Scrapy Shell with step by step coding and it worked as I wanted. The following gave me a blank file when I ran it and trying to store as a .json file [scrapy crawl -O xxx.json].
import scrapy
from datetime import datetime

class FinvizNewsSpider(scrapy.Spider):
    name = "finvizNews"

    start_urls = [
        'https://finviz.com/quote.ashx?t=ANPC'
    ]

    def parse(self, response):
        for news in response.css("tr"):
            yield {
                'news_time' : datetime.strptime(news.css("td::text").get().replace('\xa0',''),'%b-%d-%y %I:%M%p')#,
                
            }
Could you guys point out to me what's wrong with my code?
The data that I'm interested in: https://imgur.com/a/mz3bTnr time in the red box

Thank you very much!
Reply
#2
(Apr-18-2021, 03:48 AM)yoohooos Wrote: Could you guys point out to me what's wrong with my code?
The data that I'm interested in: https://imgur.com/a/mz3bTnr time in the red box
You do not find data because data is generated bye JavaScript.
This is a common problem that all faces when start doing some scraping.
A solution is to use Selenium can use it with Scrapy,
or if only want data from this site is easier to just use it alone.
Example.
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from time import sleep

#--| Setup
options = Options()
options.add_argument("--headless")
options.add_argument("--user-agent=Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.50 Safari/537.36")
#options.add_argument("--window-size=1980,1020")
browser = webdriver.Chrome(executable_path=r'C:\cmder\bin\chromedriver.exe', options=options)
#--| Parse or automation
url = "https://finviz.com/quote.ashx?t=ANPC"
browser.get(url)
sleep(3)
date_1 = browser.find_elements_by_css_selector('#news-table > tbody > tr:nth-child(1) > td:nth-child(1)')[0]
date_2 = browser.find_elements_by_css_selector('#news-table > tbody > tr:nth-child(2) > td:nth-child(1)')[0]
print(f'{date_1.text.strip()}\n{date_2.text.strip()}')
Output:
Apr-16-21 04:15PM Mar-10-21 07:25AM
Reply


Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020