Python Forum
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
ReGex With Python
#11
(Oct-21-2016, 10:37 PM)snippsat Wrote: Now is source is bytes,add .decode('utf-8').
You will see a cleaner result.
Python 3.x need this for it to be a normal string.
In Python 3.x are all strings sequences of Unicode characters,if not bytes.

Not works.. See
  

         connect_to = urlopen("https://instagram.com/p/BL1rrSQDu48")
         reading = (connect_to.read())
         filter1 = find.findall(reading.decode('utf-8'))
         print(filter1)
Reply
#12
Not on regex stuff,the whole source.
reading = connect_to.read().decode('utf-8')
But you should use Requests,then you get correct encoding that source use.
import requests

url = 'https://instagram.com/p/BL1rrSQDu48'
url_get = requests.get(url)
#print(url_get.text) # All source
print(url_get.encoding) # ISO-8859-1
Reply
#13
(Oct-21-2016, 11:13 PM)snippsat Wrote: Not on regex stuff,the whole source.
reading = connect_to.read().decode('utf-8')
But you should use Requests,then you get correct encoding that source use.
import requests

url = 'https://instagram.com/p/BL1rrSQDu48'
url_get = requests.get(url)
#print(url_get.text) # All source
print(url_get.encoding) # ISO-8859-1

Try, i'm obtain a error:

Traceback (most recent call last):
  File "instagram.py", line 89, in <module>
    connect()
  File "instagram.py", line 69, in connect
    print(url_get.text) # All source
  File "E:\Programs\lib\encodings\cp850.py", line 19, in encode
    return codecs.charmap_encode(input,self.errors,encoding_map)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\x80' in position 9364: character maps to <undefined>
Reply
#14
Use Requests,then you will never use urllib again.
I have testes code i post over and it work.
Reply
#15
(Oct-21-2016, 11:31 PM)snippsat Wrote: Use Requests,then you will never use urllib again.
I have testes code i post over and it work.

I have againt test, and not work...
print(url_get.encoding) # ISO-8859-1
This works but the next line not works..
print(url_get.text) # All source
I keep getting the same error
return codecs.charmap_encode(input,self.errors,encoding_map)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\x80' in position 9364: character maps to <undefined>

(Oct-21-2016, 11:31 PM)snippsat Wrote: Use Requests,then you will never use urllib again.
I have testes code i post over and it work.

¡Hey! thats works.. but with command in Windows
In the cmd ->
Output:
[size=small][font=Monaco, Consolas, Courier, monospace]chcp 65001[/font][/size]
Reply
#16
Strange this is my output.
using Python 3.4 and:
>>> requests.__version__
'2.9.1'
Try:
print(url_get.content)
But then is back to bytes again i guess?
Reply
#17
(Oct-21-2016, 11:59 PM)snippsat Wrote: Strange this is my output.
using Python 3.4 and:
>>> requests.__version__
'2.9.1'
Try:
print(url_get.content)
But then is back to bytes again i guess?

The code it's works, but, I still have the problem of getting the text and caption.

See:
http://pastebin /XxqbzBAQ (add .com)
Reply
#18
(Oct-21-2016, 11:48 PM)Kalet Wrote: ¡Hey! thats works.. but with command in Windows
In the cmd ->
Yes get the same in cmd,which has always been broken when it comes to Unicode.
So don't use it for output in cases like this.
Reply
#19
(Oct-22-2016, 12:07 AM)snippsat Wrote:
(Oct-21-2016, 11:48 PM)Kalet Wrote: ¡Hey! thats works.. but with command in Windows
In the cmd ->
Yes get the same in cmd,which has always been broken when it comes to Unicode.
So don't use it for output in cases like this.

Oh, that's great!
But i now, I still have the problem of getting the text and caption.

See:
http://pastebin /XxqbzBAQ (add .com)
Reply
#20
(Oct-22-2016, 12:15 AM)Kalet Wrote: See:
http://pastebin /XxqbzBAQ (add .com)
You should be able to post link now,it should be only first post restriction.

Look trough the source because data can have changed now.
So regex can not be valid,and what to you want out?
Reply


Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020