Python Forum
Mechanize and BeautifulSoup read not correct hours
Thread Rating:
  • 7 Vote(s) - 4 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Mechanize and BeautifulSoup read not correct hours
#1
Hi all.
I'm experiencing a problem while scraping information from this URL.

The problem arises because mechanize changes the hours while retrieving the html source code. Any hour has a delay of -1 hours. I think it might depend on some local configuration on my system (I live in Italy and the site might have another time zone).

This being said, I could not solve the problem and ask for some help :)

This is a brief working extract of my code
from __future__ import print_function

from bs4 import BeautifulSoup

import regex as re
import mechanize
from datetime import datetime

URL_PAGE = 'https://www.myfxbook.com/forex-economic-calendar'

# retrieve html code      
br = mechanize.Browser()
br.set_handle_robots(False)
br.addheaders = [('User-agent', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1')]        
html_content = br.open(URL_PAGE).read()

# soup
soup = BeautifulSoup(html_content, "html.parser")

#regex for extraction
cal_row_re  = re.compile(r'^calRow.*')             # <-- name
date_re     = re.compile(r'\w+\s?\d+:\d+')         # <-- date

#extracting events
CalEvents = soup.find_all(id=cal_row_re)

for singleEvent in CalEvents:
    date = singleEvent.find(text=date_re).strip()
    eventName = singleEvent.find(class_='noUnderline').get_text().strip()
    print(date, eventName, sep = ';')
Thank you in advance
Reply
#2
why are you using regex and beautifulsoup? You shouldnt use regex on web parsing. And it makes even less sense to use it when you are using beautifulsoup already.

Quote:The problem arises because mechanize changes the hours while retrieving the html source code. Any hour has a delay of -1 hours. I think it might depend on some local configuration on my system (I live in Italy and the site might have another time zone).
Usually sites are default set to GMT. If you log in and set the timezone of your account, you are going to get a different time than whatever it is showing to guests (Which this program looks to not log in). GMT is -1 hour from Italy time zone, so you probably just need to log into your account to get the proper GMT + 1 for italy. Or just add an hour to whatever it is after parsing it.
Recommended Tutorials:
Reply
#3
regex use makes sense, since it search form matches in tag attrs once the tag are determined.

I am actually trying to see what happen if I log in the site. I'll keep you informed.

Manipulating the hour I get is not the solution. The reason is that the code completly skip the events which occur in the first hour of the day.
I miss information.
Reply
#4
(Jan-12-2019, 08:16 AM)vaeVictis Wrote: regex use makes sense, since it search form matches in tag attrs once the tag are determined.

sorry i mixed threads up. If your just compiling your fine. Although you can still do it without regex compiling with beautifulsoup.

(Jan-11-2019, 09:38 AM)vaeVictis Wrote: import regex as re
Now that i see it again. What module are you using for regex? The standard library is re not regex. So my next questions are: 1) Are you sure the regex is correct and 2) actually extracting what you expect it to?

(Jan-12-2019, 08:16 AM)vaeVictis Wrote: I am actually trying to see what happen if I log in the site. I'll keep you informed.
I think this would fix your issue.
Recommended Tutorials:
Reply
#5
It actually fixed the problem, using selenium instead of mechanize.
As for now, regex works. I'll change it with the correct module (still coding in Python 2.7 for this project).
By the way, they correctly match what I'm looking for.

Thanks for yout support.
Reply
#6
if selenium works over mechanize then maybe javascript was apart of collecting the correct time. There is definitely javascript in that site as hovering over the time zone's change the classes name and to even click it requires javacsript.

Either way, glad you figured a method out.
Recommended Tutorials:
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Automating Captcha form submission with Mechanize Dexty 2 3,256 Aug-03-2021, 01:02 PM
Last Post: Dexty
  Web App That Request Data from Another Web Site every 12-hours jomonetta 15 9,870 Sep-26-2018, 04:19 PM
Last Post: snippsat
  Click on unusual class button using mechanize Ask Question Coto 1 3,811 Feb-18-2018, 07:27 AM
Last Post: metulburr
  Click on button with python mechanize torlkius 3 18,501 Jan-03-2018, 02:29 PM
Last Post: metulburr

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020