Python Forum
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
ReGex With Python
#21
(Oct-22-2016, 12:22 AM)snippsat Wrote:
(Oct-22-2016, 12:15 AM)Kalet Wrote: See:
http://pastebin /XxqbzBAQ (add .com)
You should be able to post link now,it should be only first post restriction.

Look trough the source because data can have changed now.
So regex can not be valid,and what to you want out?

Thanks...

Then:
"caption": "#Plebiscito 2016

{"text": "Por el simple hecho de dar tu nombre completo, es sencillo buscar tu n\u00famero de c\u00e9dula."

I need extract all text(those in red) and the caption(those in red).

Reply
#22
You should try yourself,here some hints.
Output:
<script type="text/javascript">window._sharedData = [b]all data inside here in json[/b] </script>
Regex
print(re.findall(r'<script type="text/javascript">window._sharedData = (.*);</script>', data)[0])
Convert to Json(becomes a Python dictionary) with build in Json parser in Python.
Then take out what you want.
Reply
#23
(Oct-22-2016, 12:56 AM)snippsat Wrote: You should try yourself,here some hints.
Output:
<script type="text/javascript">window._sharedData = [b]all data inside here in json[/b] </script>
Regex
print(re.findall(r'<script type="text/javascript">window._sharedData = (.*);</script>', data)[0])
Convert to Json with build in Json parser in Python.
Then take out what you want.

Thank you very much!.

When you have the solution I'll post here... :D
Reply
#24
(Oct-22-2016, 01:04 AM)Kalet Wrote: When you have the solution I'll post here... :D
I know the solution,don't need to test it out.
The point was for you to try to figure it out Think

One more hint,post Json data in here.
Then you see is valid,and how it structured better.
Reply
#25
(Oct-22-2016, 01:15 AM)snippsat Wrote:
(Oct-22-2016, 01:04 AM)Kalet Wrote: When you have the solution I'll post here... :D
I know the solution,don't need to test it out.
The point was for you to try to figure it out Think

One more hint,post Json data in here.
Then you see is valid,and how it structured better.


I know you know the solution, and yes, I want to try. Thanks, really.

I keep trying in the solution Think Think  , and now put xD.
LOL

You're the one who created this forum?.

Reply
#26
Quote:You're the one who created this forum?.
We where some people on the old forum who decided for this forum.
metulburr created this forum and did run it before we deiced to move.
I did like NodeBB which i did make demo version of,
but i am really pleased how this forum has turned out now Smile
Reply
#27
(Oct-22-2016, 01:46 AM)snippsat Wrote:
Quote:You're the one who created this forum?.
We where some people on the old forum who decided for this forum.
metulburr created this forum and did run it before we deiced to move.
I did like NodeBB which i did make demo version of,
but i am really pleased how this forum has turned out now Smile

He encontrado este foro por casualidad, pero los usuarios colaboran mucho. Después de resolver todo esto, yo pondré de mi parte en este foro,porque se ve que puso mucho esfuerzo .. :D.
PD: Sorry for my english, is very bad, i'm speak spanish...
Reply
#28
(Oct-22-2016, 01:46 AM)snippsat Wrote:
Quote:You're the one who created this forum?.
We where some people on the old forum who decided for this forum.
metulburr created this forum and did run it before we deiced to move.
I did like NodeBB which i did make demo version of,
but i am really pleased how this forum has turned out now Smile

I tried:

url = "https://www.instagram.com/p/BLExlG_gs9M/"
url_get = requests.get(url)
#print(url_get.text) # All source
a = (re.findall(r'<script type="text/javascript">window._sharedData = (.*);</script>', url_get.text)[0])
data=json.loads((a))
#print(data["entry_data"]["PostPage"])
for a in data["entry_data"]["PostPage"]:
     print(a[0])
Output:
[{'media': {'comments_disabled': False, 'location': None, 'is_video': False, 'likes': {'count': 735, 'nodes': [{'user': {'username': 'luisfey_tm', 'id': '1467220529', 'profile_pic_url': 'https://igcdn-photos-f-a.akamaihd.net/hphotos-ak-xpa1/t51.2885-19/s150x150/14723629_1706527029667493_2750772870568214528_a.jpg'}}, {'user': {'username': 'wendylineth21', 'id': '3901636916', 'profile_pic_url': 'https://igcdn-photos-d-a.akamaihd.net/hphotos-ak-xpa1/t51.2885-19/s150x150/14727520_1371614739517187_4451579541827092480_a.jpg'}}, {'user': {'username': 'vaneyiseth', 'id': '905640633', 'profile_pic_url': 'https://igcdn-photos-e-a.akamaihd.net/hphotos-ak-xpa1/t51.2885-19/s150x150/12317947_444661005731956_146114981_a.jpg'}}, {'user': {'username': 'kesofiia', 'id': '1206442330', 'profile_pic_url': 'https://igcdn-photos-g-a.akamaihd.net/hphotos-ak-xpa1/t51.2885-19/s150x150/14701167_1301507009894718_7730435821307691008_a.jpg'}}, {'user': {'username': 'fergi130885', 'id': '3829079506', 'profile_pic_url': 'https://igcdn-photos-h-a.akamaihd.net/hphotos-ak-xpa1/t51.2885-19/s150x150/14052264_564103313776023_453051203_a.jpg'}}, {'user': {'username': 'astridjasil', 'id': '3661209399', 'profile_pic_url': 'https://igcdn-photos-f-a.akamaihd.net/hphotos-ak-xpa1/t51.2885-19/s150x150/13671943_312970385717957_289304673_a.jpg'}}, {'user': {'username': 'laurubio_29', 'id': '1419151225', 'profile_pic_url': 'https://igcdn-photos-c-a.akamaihd.net/hphotos-ak-xpa1/t51.2885-19/s150x150/13734527_302746513408498_1533830342_a.jpg'}}, {'user': {'username': 'obras_blancas', 'id': '1697214155', 'profile_pic_url': 'https://igcdn-photos-a-a.akamaihd.net/hphotos-ak-xpa1/t51.2885-19/s150x150/14448303_1034705766647048_5355164386581282816_a.jpg'}}, {'user': {'username': 'rebellious_oficial', 'id': '1405468900', 'profile_pic_url': 'https://igcdn-photos-b-a.akamaihd.net/hphotos-ak-xpa1/t51.2885-19/s150x150/14693809_1144934862267689_1402297549309607936_a.jpg'}}, {'user': {'username': 'roberconsul', 'id': '2127270975', 'profile_pic_url': 'https://igcdn-photos-f-a.akamaihd.net/hphotos-ak-xpa1/t51.2885-19/s150x150/14294751_1909360602624925_1191931093_a.jpg'}}], 'viewer_has_liked': False}, 'display_src': 'https://igcdn-photos-a-a.akamaihd.net/hphotos-ak-xpa1/t51.2885-15/e35/14474448_381736542214624_4830854127913271296_n.jpg?ig_cache_key=MTM1MjQyMzg0MjUyNTY2MzA1Mg%3D%3D.2', 'dimensions': {'width': 1080, 'height': 1080}, 'caption_is_edited': False, 'usertags': {'nodes': []}, 'is_ad': False, 'code': 'BLExlG_gs9M', 'owner': {'username': 'youngfelprefe', 'is_private': False, 'blocked_by_viewer': False, 'followed_by_viewer': False, 'requested_by_viewer': False, 'id': '331844759', 'is_unpublished': False, 'has_blocked_viewer': False, 'full_name': 'Young F El Efecto F', 'profile_pic_url': 'https://igcdn-photos-b-a.akamaihd.net/hphotos-ak-xpa1/t51.2885-19/s150x150/14482775_197434037351945_7374535478038495232_a.jpg'}, 'caption': '#Plebiscito 2016', 'id': '1352423842525663052', 'comments': {'count': 6, 'nodes': [{'created_at': 1475442983.0, 'id': '17862897478024941', 'user': {'username': 'luisfelipetv', 'id': '2298791058', 'profile_pic_url': 'http://scontent-icn1-1.cdninstagram.com/t51.2885-19/11906329_960233084022564_1448528159_a.jpg'}, 'text': '@youngfelprefe listo ya voto tambn bien mijo★'}, {'created_at': 1475443748.0, 'id': '17862897862024941', 'user': {'username': 'omeganr', 'id': '202000611', 'profile_pic_url': 'https://igcdn-photos-h-a.akamaihd.net/hphotos-ak-xpa1/t51.2885-19/s150x150/14156414_1079735282112695_1636583007_a.jpg'}, 'text': 'Si'}, {'created_at': 1475446598.0, 'id': '17862899284024941', 'user': {'username': 'nandocolombia', 'id': '479496344', 'profile_pic_url': 'https://igcdn-photos-f-a.akamaihd.net/hphotos-ak-xpa1/t51.2885-19/s150x150/14676778_1715076945484909_6612390138339131392_a.jpg'}, 'text': '?\U0001f3fb'}, {'created_at': 1475447612.0, 'id': '17862899803024941', 'user': {'username': 'lauspath', 'id': '261070560', 'profile_pic_url': 'https://igcdn-photos-e-a.akamaihd.net/hphotos-ak-xpa1/t51.2885-19/s150x150/14727396_1693767414179196_7348884820950253568_a.jpg'}, 'text': 'Por el NOO te conozco tanto jajajajaaj'}, {'created_at': 1475448945.0, 'id': '17862900988024941', 'user': {'username': 'jhonny_reyes23', 'id': '3021395284', 'profile_pic_url': 'https://igcdn-photos-h-a.akamaihd.net/hphotos-ak-xpa1/t51.2885-19/s150x150/14553081_161587474301255_7594124386945204224_a.jpg'}, 'text': 'Primo Jaja ??'}, {'created_at': 1475519921.0, 'id': '17862953830024941', 'user': {'username': 'jjargel', 'id': '646190648', 'profile_pic_url': 'https://igcdn-photos-h-a.akamaihd.net/hphotos-ak-xpa1/t51.2885-19/s150x150/13381051_2033323503560343_1374158480_a.jpg'}, 'text': 'Por el simple hecho de dar tu nombre completo, es sencillo buscar tu número de cédula.'}], 'page_info': {'has_next_page': False, 'start_cursor': None, 'has_previous_page': False, 'end_cursor': None}}, 'date': 1475441507}}]
How could filter this last block?.
Reply
#29
import requests
import re
import json

url = "https://www.instagram.com/p/BLExlG_gs9M/"
url_get = requests.get(url)
sorurce = url_get.text
data_json = re.findall(r'<script type="text/javascript">window._sharedData = (.*);</script>', sorurce)[0]
data = json.loads(data_json)
Use it:
>>> data['entry_data']['PostPage'][0]['media']['caption']
'#Plebiscito 2016'
json.loads() give back a python dictionary.
In this dictionary there is a mix of dictionary/list.
Here dos ['PostPage'][0] contain a list,
therefor [0] to get get contented inside this list and continue to navigate.
Reply
#30
(Oct-22-2016, 04:46 PM)snippsat Wrote:
import requests
import re
import json

url = "https://www.instagram.com/p/BLExlG_gs9M/"
url_get = requests.get(url)
sorurce = url_get.text
data_json = re.findall(r'<script type="text/javascript">window._sharedData = (.*);</script>', sorurce)[0]
data = json.loads(data_json)
Use it:
>>> data['entry_data']['PostPage'][0]['media']['caption']
'#Plebiscito 2016'
json.loads() give back a python dictionary.
In this dictionary there is a mix of dictionary/list.
Here dos ['PostPage'][0] contain a list,
therefor [0] to get get contented inside this list and continue to navigate.
Oh, i understand. 
I will try with the comments(text):
print(data['entry_data']['PostPage'][0]['media']['comments'])
#print(data['entry_data']['PostPage'][0]['media']['comments']['nodes']['text']) #Error  **sad** 
Output:
{'page_info': {'has_next_page': False, 'start_cursor': None, 'end_cursor': None, 'has_previous_page': False}, 'nodes': [{'text': '@youngfelprefe listo ya voto tambn bien mijo★', 'user': {'profile_pic_url': 'http://scontent-lax3-1.cdninstagram.com/t51.2885-19/11906329_960233084022564_1448528159_a.jpg', 'id': '2298791058', 'username': 'luisfelipetv'}, 'id': '17862897478024941', 'created_at': 1475442983.0}, {'text': 'Si', 'user': {'profile_pic_url': 'https://igcdn-photos-h-a.akamaihd.net/hphotos-ak-xpa1/t51.2885-19/s150x150/14156414_1079735282112695_1636583007_a.jpg', 'id': '202000611', 'username': 'omeganr'}, 'id': '17862897862024941', 'created_at': 1475443748.0}, {'text': '?\U0001f3fb', 'user': {'profile_pic_url': 'https://igcdn-photos-f-a.akamaihd.net/hphotos-ak-xpa1/t51.2885-19/s150x150/14676778_1715076945484909_6612390138339131392_a.jpg', 'id': '479496344', 'username': 'nandocolombia'}, 'id': '17862899284024941', 'created_at': 1475446598.0}, {'text': 'Por el NOO te conozco tanto jajajajaaj', 'user': {'profile_pic_url': 'https://igcdn-photos-f-a.akamaihd.net/hphotos-ak-xpa1/t51.2885-19/s150x150/14736230_1778075029077069_7479698963461832704_a.jpg', 'id': '261070560', 'username': 'lauspath'}, 'id': '17862899803024941', 'created_at': 1475447612.0}, {'text': 'Primo Jaja ??', 'user': {'profile_pic_url': 'https://igcdn-photos-h-a.akamaihd.net/hphotos-ak-xpa1/t51.2885-19/s150x150/14553081_161587474301255_7594124386945204224_a.jpg', 'id': '3021395284', 'username': 'jhonny_reyes23'}, 'id': '17862900988024941', 'created_at': 1475448945.0}, {'text': 'Por el simple hecho de dar tu nombre completo, es sencillo buscar tu número de cédula.', 'user': {'profile_pic_url': 'https://igcdn-photos-h-a.akamaihd.net/hphotos-ak-xpa1/t51.2885-19/s150x150/13381051_2033323503560343_1374158480_a.jpg', 'id': '646190648', 'username': 'jjargel'}, 'id': '17862953830024941', 'created_at': 1475519921.0}], 'count': 6}
Reply


Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020