Python Forum
Regex to retrieve data from json in script tag.
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Regex to retrieve data from json in script tag.
#1
I use bs4 to get all the script tags from a web page like this:
soup = BeautifulSoup(page.content, 'html.parser')
images = soup.find_all('script', type='text/javascript')
Unfortunately, this return just a single element array with all the script tags in them.
In one of the script tags, is some JS which includes some json about and image. Here's the json:
All I need to get is "nsfw":false, from that json. I would use something like find() but 'nsfw' occurs in other places within the script tags.
I did some research and it seems everything that I found, used some sort of regex to get data from a script tag.
I have tried creating my own regex for this, but honestly, I didn't get that far. They're pretty much black magic to me.

What regex should I use for this (if any)?
Reply
#2
I have almost got one working - [\"-\"]\b(nsfw)\b\"[^\"]+[$,]
This is it normally:
[Image: vEJG8Wf.png]
However, if there is another comma later on, not separated by a quotation mark, this happens:
[Image: ol11lrf.png]
I need it to stop searching after it how found the first comma - how do I do this?
Reply
#3
can you show how the script tag looks like?
If you can't explain it to a six year old, you don't understand it yourself, Albert Einstein
How to Ask Questions The Smart Way: link and another link
Create MCV example
Debug small programs

Reply
#4
The full script tag is this:
Output:
</script>, <script type="text/javascript"> (function(widgetFactory) { widgetFactory.mergeConfig('gallery', { account_url : 'SocialKeked', favs_account_url : null, sort : 'viral', section : 'hot', window : 'day', tag : null, isHotImage : '1', hash : 'cLQ8lVI', baseURL : decodeURIComponent('%2Fgallery'), page : 0, isPro : false, searchQuery : '', advSearch : null, isRandom : false, safe_tags : true, hasAccess : false, inGallery : false, hashes : null, image : {"id":339385563,"hash":"cLQ8lVI","account_id":"117321139","account_url":"SocialKeked","title":"Noice!","score":795,"starting_score":0,"virality":18864.156525,"size":37779342,"views":"170307","is_hot":true,"is_album":false,"album_cover":null,"album_cover_width":0,"album_cover_height":0,"mimetype":"image\/gif","ext":".gif","width":728,"height":408,"animated":true,"looping":true,"ups":737,"downs":27,"points":710,"reddit":null,"description":"","bandwidth":"5.85 TB","timestamp":"2019-12-19 12:46:19","hot_datetime":"2019-12-19 16:32:01","gallery_datetime":"2019-12-19 12:45:40","in_gallery":true,"section":"","tags":["0","0"],"subtype":null,"spam":"0","pending":"0","comment_count":115,"nsfw":false,"topic":"No Topic","topic_id":29,"meme_name":null,"meme_top":null,"meme_bottom":null,"prefer_video":true,"video_source":"https:\/\/img-9gag-fun.9cache.com\/photo\/ad5OR4N_460sv.mp4","video_host":"img-9gag-fun.9cache.com","num_images":1,"platform":null,"readonly":false,"ad_type":0,"ad_url":"","weight":-1,"favorite_count":173,"processing":{"status":"completed"},"galleryTags":[{"id":"197940815","hash":"cLQ8lVI","account_id":"117321139","tag_id":"547","display":"football","ups":"0","downs":"0","score":"0","timestamp":"2019-12-19 12:46:19","blocked":"0","tag":"football","subscribers":"16415","images":"11097","background_hash":"dMdNvgJ","thumbnail_hash":null,"spam":"0","nsfw":"0","is_promoted":"0","animated":"0","thumbnail_animated":null,"metadata":{"tag_id":"547","title":null,"description":"touchdoooowwwwnnn!","logo_hash":null,"logo_destination_url":null,"is_promoted":"0","accent":"a88680"},"image":{"animated":"0"},"thumbnail":{"animated":null}},{"id":"197940811","hash":"cLQ8lVI","account_id":"117321139","tag_id":"1024","display":"awesome","ups":"0","downs":"0","score":"0","timestamp":"2019-12-19 12:46:19","blocked":"0","tag":"awesome","subscribers":"981004","images":"756530","background_hash":"4kmYoey","thumbnail_hash":null,"spam":"0","nsfw":"0","is_promoted":"0","animated":"0","thumbnail_animated":null,"metadata":{"tag_id":"1024","title":null,"description":"neat and amazing","logo_hash":null,"logo_destination_url":null,"is_promoted":"0","accent":"8472BD"},"image":{"animated":"0"},"thumbnail":{"animated":null}}],"favorited":false,"adConfig":{"safeFlags":["in_gallery","sixth_mod_safe","gallery"],"highRiskFlags":[],"unsafeFlags":[],"wallUnsafeFlags":[],"showsAds":true},"vote":null}, group : null, comment_sort : 'best', comment_id : '', captionsEnabled : true, onTheFlyThreshold : 10485760, galleryTitle : 'Imgur: The magic of the Internet', votedFavedRecently: false, tagSectionIsPromoted: false, lastModLog: null, }); widgetFactory.mergeConfig('groups', { groups: { } });
Reply
#5
It turns out, although it looked like one element containing all script tags, you could actually index it. This is the working solution:
pattern = re.compile(r'[\"]\b(nsfw)\b\"[^\"]+[$,]')

page = requests.get("https://i.imgur.com/7QHohZV.jpg")
soup = BeautifulSoup(page.content, 'html.parser')

data  = soup.find_all('script', type='text/javascript')[7].string
b = re.search(pattern, data)
nsfw = b.group(0).replace(',','').split(':')[-1].capitalize()
print(nsfw)

>>>True / False
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  encrypt data in json file help jacksfrustration 1 190 Mar-28-2024, 05:16 PM
Last Post: deanhystad
  Python Script to convert Json to CSV file chvsnarayana 8 2,496 Apr-26-2023, 10:31 PM
Last Post: DeaD_EyE
  Read nested data from JSON - Getting an error marlonbown 5 1,357 Nov-23-2022, 03:51 PM
Last Post: snippsat
  Reading Data from JSON tpolim008 2 1,077 Sep-27-2022, 06:34 PM
Last Post: Larz60+
  Code to retrieve data from a website charlie13255 0 977 Jul-07-2022, 07:53 PM
Last Post: charlie13255
  Convert nested sample json api data into csv in python shantanu97 3 2,807 May-21-2022, 01:30 PM
Last Post: deanhystad
  Struggling with Juggling JSON Data SamWatt 7 1,883 May-09-2022, 02:49 AM
Last Post: snippsat
  json api data parsing elvis 0 924 Apr-21-2022, 11:59 PM
Last Post: elvis
  Match key-value json,Regex saam 5 5,406 Dec-07-2021, 03:06 PM
Last Post: saam
  Capture json data JohnnyCoffee 0 1,193 Nov-18-2021, 03:19 PM
Last Post: JohnnyCoffee

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020