Python Forum
Mechanize and BeautifulSoup read not correct hours - Printable Version

+- Python Forum (https://python-forum.io)
+-- Forum: Python Coding (https://python-forum.io/Forum-Python-Coding)
+--- Forum: Networking (https://python-forum.io/Forum-Networking)
+--- Thread: Mechanize and BeautifulSoup read not correct hours (/Thread-Mechanize-and-BeautifulSoup-read-not-correct-hours)



Mechanize and BeautifulSoup read not correct hours - vaeVictis - Jan-11-2019

Hi all.
I'm experiencing a problem while scraping information from this URL.

The problem arises because mechanize changes the hours while retrieving the html source code. Any hour has a delay of -1 hours. I think it might depend on some local configuration on my system (I live in Italy and the site might have another time zone).

This being said, I could not solve the problem and ask for some help :)

This is a brief working extract of my code
from __future__ import print_function

from bs4 import BeautifulSoup

import regex as re
import mechanize
from datetime import datetime

URL_PAGE = 'https://www.myfxbook.com/forex-economic-calendar'

# retrieve html code      
br = mechanize.Browser()
br.set_handle_robots(False)
br.addheaders = [('User-agent', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1')]        
html_content = br.open(URL_PAGE).read()

# soup
soup = BeautifulSoup(html_content, "html.parser")

#regex for extraction
cal_row_re  = re.compile(r'^calRow.*')             # <-- name
date_re     = re.compile(r'\w+\s?\d+:\d+')         # <-- date

#extracting events
CalEvents = soup.find_all(id=cal_row_re)

for singleEvent in CalEvents:
    date = singleEvent.find(text=date_re).strip()
    eventName = singleEvent.find(class_='noUnderline').get_text().strip()
    print(date, eventName, sep = ';')
Thank you in advance


RE: Mechanize and BeautifulSoup read not correct hours - metulburr - Jan-11-2019

why are you using regex and beautifulsoup? You shouldnt use regex on web parsing. And it makes even less sense to use it when you are using beautifulsoup already.

Quote:The problem arises because mechanize changes the hours while retrieving the html source code. Any hour has a delay of -1 hours. I think it might depend on some local configuration on my system (I live in Italy and the site might have another time zone).
Usually sites are default set to GMT. If you log in and set the timezone of your account, you are going to get a different time than whatever it is showing to guests (Which this program looks to not log in). GMT is -1 hour from Italy time zone, so you probably just need to log into your account to get the proper GMT + 1 for italy. Or just add an hour to whatever it is after parsing it.


RE: Mechanize and BeautifulSoup read not correct hours - vaeVictis - Jan-12-2019

regex use makes sense, since it search form matches in tag attrs once the tag are determined.

I am actually trying to see what happen if I log in the site. I'll keep you informed.

Manipulating the hour I get is not the solution. The reason is that the code completly skip the events which occur in the first hour of the day.
I miss information.


RE: Mechanize and BeautifulSoup read not correct hours - metulburr - Jan-12-2019

(Jan-12-2019, 08:16 AM)vaeVictis Wrote: regex use makes sense, since it search form matches in tag attrs once the tag are determined.

sorry i mixed threads up. If your just compiling your fine. Although you can still do it without regex compiling with beautifulsoup.

(Jan-11-2019, 09:38 AM)vaeVictis Wrote: import regex as re
Now that i see it again. What module are you using for regex? The standard library is re not regex. So my next questions are: 1) Are you sure the regex is correct and 2) actually extracting what you expect it to?

(Jan-12-2019, 08:16 AM)vaeVictis Wrote: I am actually trying to see what happen if I log in the site. I'll keep you informed.
I think this would fix your issue.


RE: Mechanize and BeautifulSoup read not correct hours - vaeVictis - Jan-15-2019

It actually fixed the problem, using selenium instead of mechanize.
As for now, regex works. I'll change it with the correct module (still coding in Python 2.7 for this project).
By the way, they correctly match what I'm looking for.

Thanks for yout support.


RE: Mechanize and BeautifulSoup read not correct hours - metulburr - Jan-15-2019

if selenium works over mechanize then maybe javascript was apart of collecting the correct time. There is definitely javascript in that site as hovering over the time zone's change the classes name and to even click it requires javacsript.

Either way, glad you figured a method out.