Bottom Page

Thread Rating:
  • 7 Vote(s) - 4 Average
  • 1
  • 2
  • 3
  • 4
  • 5
 Mechanize and BeautifulSoup read not correct hours
#1
Hi all.
I'm experiencing a problem while scraping information from this URL.

The problem arises because mechanize changes the hours while retrieving the html source code. Any hour has a delay of -1 hours. I think it might depend on some local configuration on my system (I live in Italy and the site might have another time zone).

This being said, I could not solve the problem and ask for some help :)

This is a brief working extract of my code
from __future__ import print_function

from bs4 import BeautifulSoup

import regex as re
import mechanize
from datetime import datetime

URL_PAGE = 'https://www.myfxbook.com/forex-economic-calendar'

# retrieve html code      
br = mechanize.Browser()
br.set_handle_robots(False)
br.addheaders = [('User-agent', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1')]        
html_content = br.open(URL_PAGE).read()

# soup
soup = BeautifulSoup(html_content, "html.parser")

#regex for extraction
cal_row_re  = re.compile(r'^calRow.*')             # <-- name
date_re     = re.compile(r'\w+\s?\d+:\d+')         # <-- date

#extracting events
CalEvents = soup.find_all(id=cal_row_re)

for singleEvent in CalEvents:
    date = singleEvent.find(text=date_re).strip()
    eventName = singleEvent.find(class_='noUnderline').get_text().strip()
    print(date, eventName, sep = ';')
Thank you in advance
Quote
#2
why are you using regex and beautifulsoup? You shouldnt use regex on web parsing. And it makes even less sense to use it when you are using beautifulsoup already.

Quote:The problem arises because mechanize changes the hours while retrieving the html source code. Any hour has a delay of -1 hours. I think it might depend on some local configuration on my system (I live in Italy and the site might have another time zone).
Usually sites are default set to GMT. If you log in and set the timezone of your account, you are going to get a different time than whatever it is showing to guests (Which this program looks to not log in). GMT is -1 hour from Italy time zone, so you probably just need to log into your account to get the proper GMT + 1 for italy. Or just add an hour to whatever it is after parsing it.
Quote
#3
regex use makes sense, since it search form matches in tag attrs once the tag are determined.

I am actually trying to see what happen if I log in the site. I'll keep you informed.

Manipulating the hour I get is not the solution. The reason is that the code completly skip the events which occur in the first hour of the day.
I miss information.
Quote
#4
(Jan-12-2019, 08:16 AM)vaeVictis Wrote: regex use makes sense, since it search form matches in tag attrs once the tag are determined.

sorry i mixed threads up. If your just compiling your fine. Although you can still do it without regex compiling with beautifulsoup.

(Jan-11-2019, 09:38 AM)vaeVictis Wrote: import regex as re
Now that i see it again. What module are you using for regex? The standard library is re not regex. So my next questions are: 1) Are you sure the regex is correct and 2) actually extracting what you expect it to?

(Jan-12-2019, 08:16 AM)vaeVictis Wrote: I am actually trying to see what happen if I log in the site. I'll keep you informed.
I think this would fix your issue.
Quote
#5
It actually fixed the problem, using selenium instead of mechanize.
As for now, regex works. I'll change it with the correct module (still coding in Python 2.7 for this project).
By the way, they correctly match what I'm looking for.

Thanks for yout support.
metulburr likes this post
Quote
#6
if selenium works over mechanize then maybe javascript was apart of collecting the correct time. There is definitely javascript in that site as hovering over the time zone's change the classes name and to even click it requires javacsript.

Either way, glad you figured a method out.
Quote

Top Page

Possibly Related Threads...
Thread Author Replies Views Last Post
  Beautifulsoup Scraping PolskaYBZ 3 123 Jun-22-2019, 10:05 AM
Last Post: PolskaYBZ
  BeautifulSoup Installed but not Found in Atom wakegate 4 104 Jun-14-2019, 05:54 PM
Last Post: snippsat
  beautifulsoup error rudolphyaber 7 446 May-26-2019, 02:12 PM
Last Post: heiner55
  Looping with Beautifulsoup CaptainCsaba 8 438 Jan-23-2019, 12:38 PM
Last Post: buran
  [split] [Help] Keep getting a 'TypeError' from Django and BeautifulSoup moxasya 0 337 Nov-15-2018, 07:38 AM
Last Post: moxasya
  BeautifulSoup n levels of nested xml elements fatwalletguy 1 535 Nov-08-2018, 12:23 AM
Last Post: Larz60+
  Web App That Request Data from Another Web Site every 12-hours jomonetta 15 1,244 Sep-26-2018, 04:19 PM
Last Post: snippsat
  Help with python3 (BeautifulSoup) freaknez 1 544 Sep-14-2018, 09:50 PM
Last Post: Larz60+
  BeautifulSoup 'NoneType' object has no attribute 'text' bmccollum 9 2,591 Sep-14-2018, 12:56 PM
Last Post: bmccollum
  Getting Correct 'a'-tag output soothsayerpg 3 596 Jul-26-2018, 06:25 AM
Last Post: soothsayerpg

Forum Jump:


Users browsing this thread: 1 Guest(s)