Python Forum
ReGex With Python - Printable Version

+- Python Forum (https://python-forum.io)
+-- Forum: Python Coding (https://python-forum.io/forum-7.html)
+--- Forum: Web Scraping & Web Development (https://python-forum.io/forum-13.html)
+--- Thread: ReGex With Python (/thread-596.html)

Pages: 1 2 3 4


RE: ReGex With Python - Kalet - Oct-22-2016

(Oct-22-2016, 12:22 AM)snippsat Wrote:
(Oct-22-2016, 12:15 AM)Kalet Wrote: See:
http://pastebin /XxqbzBAQ (add .com)
You should be able to post link now,it should be only first post restriction.

Look trough the source because data can have changed now.
So regex can not be valid,and what to you want out?

Thanks...

Then:
"caption": "#Plebiscito 2016

{"text": "Por el simple hecho de dar tu nombre completo, es sencillo buscar tu n\u00famero de c\u00e9dula."

I need extract all text(those in red) and the caption(those in red).




RE: ReGex With Python - snippsat - Oct-22-2016

You should try yourself,here some hints.
Output:
<script type="text/javascript">window._sharedData = [b]all data inside here in json[/b] </script>
Regex
print(re.findall(r'<script type="text/javascript">window._sharedData = (.*);</script>', data)[0])
Convert to Json(becomes a Python dictionary) with build in Json parser in Python.
Then take out what you want.


RE: ReGex With Python - Kalet - Oct-22-2016

(Oct-22-2016, 12:56 AM)snippsat Wrote: You should try yourself,here some hints.
Output:
<script type="text/javascript">window._sharedData = [b]all data inside here in json[/b] </script>
Regex
print(re.findall(r'<script type="text/javascript">window._sharedData = (.*);</script>', data)[0])
Convert to Json with build in Json parser in Python.
Then take out what you want.

Thank you very much!.

When you have the solution I'll post here... :D



RE: ReGex With Python - snippsat - Oct-22-2016

(Oct-22-2016, 01:04 AM)Kalet Wrote: When you have the solution I'll post here... :D
I know the solution,don't need to test it out.
The point was for you to try to figure it out Think

One more hint,post Json data in here.
Then you see is valid,and how it structured better.


RE: ReGex With Python - Kalet - Oct-22-2016

(Oct-22-2016, 01:15 AM)snippsat Wrote:
(Oct-22-2016, 01:04 AM)Kalet Wrote: When you have the solution I'll post here... :D
I know the solution,don't need to test it out.
The point was for you to try to figure it out Think

One more hint,post Json data in here.
Then you see is valid,and how it structured better.


I know you know the solution, and yes, I want to try. Thanks, really.

I keep trying in the solution Think Think  , and now put xD.
LOL

You're the one who created this forum?.




RE: ReGex With Python - snippsat - Oct-22-2016

Quote:You're the one who created this forum?.
We where some people on the old forum who decided for this forum.
metulburr created this forum and did run it before we deiced to move.
I did like NodeBB which i did make demo version of,
but i am really pleased how this forum has turned out now Smile


RE: ReGex With Python - Kalet - Oct-22-2016

(Oct-22-2016, 01:46 AM)snippsat Wrote:
Quote:You're the one who created this forum?.
We where some people on the old forum who decided for this forum.
metulburr created this forum and did run it before we deiced to move.
I did like NodeBB which i did make demo version of,
but i am really pleased how this forum has turned out now Smile

He encontrado este foro por casualidad, pero los usuarios colaboran mucho. Después de resolver todo esto, yo pondré de mi parte en este foro,porque se ve que puso mucho esfuerzo .. :D.
PD: Sorry for my english, is very bad, i'm speak spanish...


RE: ReGex With Python - Kalet - Oct-22-2016

(Oct-22-2016, 01:46 AM)snippsat Wrote:
Quote:You're the one who created this forum?.
We where some people on the old forum who decided for this forum.
metulburr created this forum and did run it before we deiced to move.
I did like NodeBB which i did make demo version of,
but i am really pleased how this forum has turned out now Smile

I tried:

url = "https://www.instagram.com/p/BLExlG_gs9M/"
url_get = requests.get(url)
#print(url_get.text) # All source
a = (re.findall(r'<script type="text/javascript">window._sharedData = (.*);</script>', url_get.text)[0])
data=json.loads((a))
#print(data["entry_data"]["PostPage"])
for a in data["entry_data"]["PostPage"]:
     print(a[0])
Output:
[{'media': {'comments_disabled': False, 'location': None, 'is_video': False, 'likes': {'count': 735, 'nodes': [{'user': {'username': 'luisfey_tm', 'id': '1467220529', 'profile_pic_url': 'https://igcdn-photos-f-a.akamaihd.net/hphotos-ak-xpa1/t51.2885-19/s150x150/14723629_1706527029667493_2750772870568214528_a.jpg'}}, {'user': {'username': 'wendylineth21', 'id': '3901636916', 'profile_pic_url': 'https://igcdn-photos-d-a.akamaihd.net/hphotos-ak-xpa1/t51.2885-19/s150x150/14727520_1371614739517187_4451579541827092480_a.jpg'}}, {'user': {'username': 'vaneyiseth', 'id': '905640633', 'profile_pic_url': 'https://igcdn-photos-e-a.akamaihd.net/hphotos-ak-xpa1/t51.2885-19/s150x150/12317947_444661005731956_146114981_a.jpg'}}, {'user': {'username': 'kesofiia', 'id': '1206442330', 'profile_pic_url': 'https://igcdn-photos-g-a.akamaihd.net/hphotos-ak-xpa1/t51.2885-19/s150x150/14701167_1301507009894718_7730435821307691008_a.jpg'}}, {'user': {'username': 'fergi130885', 'id': '3829079506', 'profile_pic_url': 'https://igcdn-photos-h-a.akamaihd.net/hphotos-ak-xpa1/t51.2885-19/s150x150/14052264_564103313776023_453051203_a.jpg'}}, {'user': {'username': 'astridjasil', 'id': '3661209399', 'profile_pic_url': 'https://igcdn-photos-f-a.akamaihd.net/hphotos-ak-xpa1/t51.2885-19/s150x150/13671943_312970385717957_289304673_a.jpg'}}, {'user': {'username': 'laurubio_29', 'id': '1419151225', 'profile_pic_url': 'https://igcdn-photos-c-a.akamaihd.net/hphotos-ak-xpa1/t51.2885-19/s150x150/13734527_302746513408498_1533830342_a.jpg'}}, {'user': {'username': 'obras_blancas', 'id': '1697214155', 'profile_pic_url': 'https://igcdn-photos-a-a.akamaihd.net/hphotos-ak-xpa1/t51.2885-19/s150x150/14448303_1034705766647048_5355164386581282816_a.jpg'}}, {'user': {'username': 'rebellious_oficial', 'id': '1405468900', 'profile_pic_url': 'https://igcdn-photos-b-a.akamaihd.net/hphotos-ak-xpa1/t51.2885-19/s150x150/14693809_1144934862267689_1402297549309607936_a.jpg'}}, {'user': {'username': 'roberconsul', 'id': '2127270975', 'profile_pic_url': 'https://igcdn-photos-f-a.akamaihd.net/hphotos-ak-xpa1/t51.2885-19/s150x150/14294751_1909360602624925_1191931093_a.jpg'}}], 'viewer_has_liked': False}, 'display_src': 'https://igcdn-photos-a-a.akamaihd.net/hphotos-ak-xpa1/t51.2885-15/e35/14474448_381736542214624_4830854127913271296_n.jpg?ig_cache_key=MTM1MjQyMzg0MjUyNTY2MzA1Mg%3D%3D.2', 'dimensions': {'width': 1080, 'height': 1080}, 'caption_is_edited': False, 'usertags': {'nodes': []}, 'is_ad': False, 'code': 'BLExlG_gs9M', 'owner': {'username': 'youngfelprefe', 'is_private': False, 'blocked_by_viewer': False, 'followed_by_viewer': False, 'requested_by_viewer': False, 'id': '331844759', 'is_unpublished': False, 'has_blocked_viewer': False, 'full_name': 'Young F El Efecto F', 'profile_pic_url': 'https://igcdn-photos-b-a.akamaihd.net/hphotos-ak-xpa1/t51.2885-19/s150x150/14482775_197434037351945_7374535478038495232_a.jpg'}, 'caption': '#Plebiscito 2016', 'id': '1352423842525663052', 'comments': {'count': 6, 'nodes': [{'created_at': 1475442983.0, 'id': '17862897478024941', 'user': {'username': 'luisfelipetv', 'id': '2298791058', 'profile_pic_url': 'http://scontent-icn1-1.cdninstagram.com/t51.2885-19/11906329_960233084022564_1448528159_a.jpg'}, 'text': '@youngfelprefe listo ya voto tambn bien mijo★'}, {'created_at': 1475443748.0, 'id': '17862897862024941', 'user': {'username': 'omeganr', 'id': '202000611', 'profile_pic_url': 'https://igcdn-photos-h-a.akamaihd.net/hphotos-ak-xpa1/t51.2885-19/s150x150/14156414_1079735282112695_1636583007_a.jpg'}, 'text': 'Si'}, {'created_at': 1475446598.0, 'id': '17862899284024941', 'user': {'username': 'nandocolombia', 'id': '479496344', 'profile_pic_url': 'https://igcdn-photos-f-a.akamaihd.net/hphotos-ak-xpa1/t51.2885-19/s150x150/14676778_1715076945484909_6612390138339131392_a.jpg'}, 'text': '?\U0001f3fb'}, {'created_at': 1475447612.0, 'id': '17862899803024941', 'user': {'username': 'lauspath', 'id': '261070560', 'profile_pic_url': 'https://igcdn-photos-e-a.akamaihd.net/hphotos-ak-xpa1/t51.2885-19/s150x150/14727396_1693767414179196_7348884820950253568_a.jpg'}, 'text': 'Por el NOO te conozco tanto jajajajaaj'}, {'created_at': 1475448945.0, 'id': '17862900988024941', 'user': {'username': 'jhonny_reyes23', 'id': '3021395284', 'profile_pic_url': 'https://igcdn-photos-h-a.akamaihd.net/hphotos-ak-xpa1/t51.2885-19/s150x150/14553081_161587474301255_7594124386945204224_a.jpg'}, 'text': 'Primo Jaja ??'}, {'created_at': 1475519921.0, 'id': '17862953830024941', 'user': {'username': 'jjargel', 'id': '646190648', 'profile_pic_url': 'https://igcdn-photos-h-a.akamaihd.net/hphotos-ak-xpa1/t51.2885-19/s150x150/13381051_2033323503560343_1374158480_a.jpg'}, 'text': 'Por el simple hecho de dar tu nombre completo, es sencillo buscar tu número de cédula.'}], 'page_info': {'has_next_page': False, 'start_cursor': None, 'has_previous_page': False, 'end_cursor': None}}, 'date': 1475441507}}]
How could filter this last block?.



RE: ReGex With Python - snippsat - Oct-22-2016

import requests
import re
import json

url = "https://www.instagram.com/p/BLExlG_gs9M/"
url_get = requests.get(url)
sorurce = url_get.text
data_json = re.findall(r'<script type="text/javascript">window._sharedData = (.*);</script>', sorurce)[0]
data = json.loads(data_json)
Use it:
>>> data['entry_data']['PostPage'][0]['media']['caption']
'#Plebiscito 2016'
json.loads() give back a python dictionary.
In this dictionary there is a mix of dictionary/list.
Here dos ['PostPage'][0] contain a list,
therefor [0] to get get contented inside this list and continue to navigate.


RE: ReGex With Python - Kalet - Oct-22-2016

(Oct-22-2016, 04:46 PM)snippsat Wrote:
import requests
import re
import json

url = "https://www.instagram.com/p/BLExlG_gs9M/"
url_get = requests.get(url)
sorurce = url_get.text
data_json = re.findall(r'<script type="text/javascript">window._sharedData = (.*);</script>', sorurce)[0]
data = json.loads(data_json)
Use it:
>>> data['entry_data']['PostPage'][0]['media']['caption']
'#Plebiscito 2016'
json.loads() give back a python dictionary.
In this dictionary there is a mix of dictionary/list.
Here dos ['PostPage'][0] contain a list,
therefor [0] to get get contented inside this list and continue to navigate.
Oh, i understand. 
I will try with the comments(text):
print(data['entry_data']['PostPage'][0]['media']['comments'])
#print(data['entry_data']['PostPage'][0]['media']['comments']['nodes']['text']) #Error  **sad** 
Output:
{'page_info': {'has_next_page': False, 'start_cursor': None, 'end_cursor': None, 'has_previous_page': False}, 'nodes': [{'text': '@youngfelprefe listo ya voto tambn bien mijo★', 'user': {'profile_pic_url': 'http://scontent-lax3-1.cdninstagram.com/t51.2885-19/11906329_960233084022564_1448528159_a.jpg', 'id': '2298791058', 'username': 'luisfelipetv'}, 'id': '17862897478024941', 'created_at': 1475442983.0}, {'text': 'Si', 'user': {'profile_pic_url': 'https://igcdn-photos-h-a.akamaihd.net/hphotos-ak-xpa1/t51.2885-19/s150x150/14156414_1079735282112695_1636583007_a.jpg', 'id': '202000611', 'username': 'omeganr'}, 'id': '17862897862024941', 'created_at': 1475443748.0}, {'text': '?\U0001f3fb', 'user': {'profile_pic_url': 'https://igcdn-photos-f-a.akamaihd.net/hphotos-ak-xpa1/t51.2885-19/s150x150/14676778_1715076945484909_6612390138339131392_a.jpg', 'id': '479496344', 'username': 'nandocolombia'}, 'id': '17862899284024941', 'created_at': 1475446598.0}, {'text': 'Por el NOO te conozco tanto jajajajaaj', 'user': {'profile_pic_url': 'https://igcdn-photos-f-a.akamaihd.net/hphotos-ak-xpa1/t51.2885-19/s150x150/14736230_1778075029077069_7479698963461832704_a.jpg', 'id': '261070560', 'username': 'lauspath'}, 'id': '17862899803024941', 'created_at': 1475447612.0}, {'text': 'Primo Jaja ??', 'user': {'profile_pic_url': 'https://igcdn-photos-h-a.akamaihd.net/hphotos-ak-xpa1/t51.2885-19/s150x150/14553081_161587474301255_7594124386945204224_a.jpg', 'id': '3021395284', 'username': 'jhonny_reyes23'}, 'id': '17862900988024941', 'created_at': 1475448945.0}, {'text': 'Por el simple hecho de dar tu nombre completo, es sencillo buscar tu número de cédula.', 'user': {'profile_pic_url': 'https://igcdn-photos-h-a.akamaihd.net/hphotos-ak-xpa1/t51.2885-19/s150x150/13381051_2033323503560343_1374158480_a.jpg', 'id': '646190648', 'username': 'jjargel'}, 'id': '17862953830024941', 'created_at': 1475519921.0}], 'count': 6}