Mechanize and BeautifulSoup read not correct hours

vaeVictis · Jan-11-2019, 09:38 AM

Hi all.
I'm experiencing a problem while scraping information from this URL.

The problem arises because mechanize changes the hours while retrieving the html source code. Any hour has a delay of -1 hours. I think it might depend on some local configuration on my system (I live in Italy and the site might have another time zone).

This being said, I could not solve the problem and ask for some help :)

This is a brief working extract of my code

from __future__ import print_function

from bs4 import BeautifulSoup

import regex as re
import mechanize
from datetime import datetime

URL_PAGE = 'https://www.myfxbook.com/forex-economic-calendar'

# retrieve html code      
br = mechanize.Browser()
br.set_handle_robots(False)
br.addheaders = [('User-agent', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1')]        
html_content = br.open(URL_PAGE).read()

# soup
soup = BeautifulSoup(html_content, "html.parser")

#regex for extraction
cal_row_re  = re.compile(r'^calRow.*')             # <-- name
date_re     = re.compile(r'\w+\s?\d+:\d+')         # <-- date

#extracting events
CalEvents = soup.find_all(id=cal_row_re)

for singleEvent in CalEvents:
    date = singleEvent.find(text=date_re).strip()
    eventName = singleEvent.find(class_='noUnderline').get_text().strip()
    print(date, eventName, sep = ';')

Thank you in advance

***metulburr*** · (This post was last modified: Jan-11-2019, 03:12 PM by metulburr.)

why are you using regex and beautifulsoup? You shouldnt use regex on web parsing. And it makes even less sense to use it when you are using beautifulsoup already.

Quote:The problem arises because mechanize changes the hours while retrieving the html source code. Any hour has a delay of -1 hours. I think it might depend on some local configuration on my system (I live in Italy and the site might have another time zone).

Usually sites are default set to GMT. If you log in and set the timezone of your account, you are going to get a different time than whatever it is showing to guests (Which this program looks to not log in). GMT is -1 hour from Italy time zone, so you probably just need to log into your account to get the proper GMT + 1 for italy. Or just add an hour to whatever it is after parsing it.

vaeVictis · Jan-12-2019, 08:16 AM

regex use makes sense, since it search form matches in tag attrs once the tag are determined.

I am actually trying to see what happen if I log in the site. I'll keep you informed.

Manipulating the hour I get is not the solution. The reason is that the code completly skip the events which occur in the first hour of the day.
I miss information.

***metulburr*** · (This post was last modified: Jan-12-2019, 03:09 PM by metulburr.)

(Jan-12-2019, 08:16 AM)vaeVictis Wrote: regex use makes sense, since it search form matches in tag attrs once the tag are determined.

sorry i mixed threads up. If your just compiling your fine. Although you can still do it without regex compiling with beautifulsoup.

(Jan-11-2019, 09:38 AM)vaeVictis Wrote: import regex as re

Now that i see it again. What module are you using for regex? The standard library is re not regex. So my next questions are: 1) Are you sure the regex is correct and 2) actually extracting what you expect it to?

(Jan-12-2019, 08:16 AM)vaeVictis Wrote: I am actually trying to see what happen if I log in the site. I'll keep you informed.

I think this would fix your issue.

vaeVictis · (This post was last modified: Jan-15-2019, 10:35 AM by vaeVictis.)

It actually fixed the problem, using selenium instead of mechanize.
As for now, regex works. I'll change it with the correct module (still coding in Python 2.7 for this project).
By the way, they correctly match what I'm looking for.

Thanks for yout support.

***metulburr*** · (This post was last modified: Jan-15-2019, 01:27 PM by metulburr.)

if selenium works over mechanize then maybe javascript was apart of collecting the correct time. There is definitely javascript in that site as hovering over the time zone's change the classes name and to even click it requires javacsript.

Either way, glad you figured a method out.

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Automating Captcha form submission with Mechanize	Dexty	2	4,238	Aug-03-2021, 01:02 PM Last Post: Dexty
	Web App That Request Data from Another Web Site every 12-hours	jomonetta	15	13,073	Sep-26-2018, 04:19 PM Last Post: snippsat
	Click on unusual class button using mechanize Ask Question	Coto	1	4,520	Feb-18-2018, 07:27 AM Last Post: metulburr
	Click on button with python mechanize	torlkius	3	20,187	Jan-03-2018, 02:29 PM Last Post: metulburr

Mechanize and BeautifulSoup read not correct hours

User Panel Messages

Announcements