Posts: 161
Threads: 36
Joined: Jun 2018
I use bs4 to get all the script tags from a web page like this:
soup = BeautifulSoup(page.content, 'html.parser')
images = soup.find_all('script', type='text/javascript') Unfortunately, this return just a single element array with all the script tags in them.
In one of the script tags, is some JS which includes some json about and image. Here's the json:
Output: {"id":339385563,"hash":"cLQ8lVI","account_id":"117321139","account_url":"SocialKeked","title":"Noice!","score":795,"starting_score":0,"virality":18864.156525,"size":37779342,"views":"170307","is_hot":true,"is_album":false,"album_cover":null,"album_cover_width":0,"album_cover_height":0,"mimetype":"image\/gif","ext":".gif","width":728,"height":408,"animated":true,"looping":true,"ups":737,"downs":27,"points":710,"reddit":null,"description":"","bandwidth":"5.85 TB","timestamp":"2019-12-19 12:46:19","hot_datetime":"2019-12-19 16:32:01","gallery_datetime":"2019-12-19 12:45:40","in_gallery":true,"section":"","tags":["0","0"],"subtype":null,"spam":"0","pending":"0","comment_count":115,"nsfw":false,"topic":"No Topic","topic_id":29,"meme_name":null,"meme_top":null,"meme_bottom":null,"prefer_video":true,"video_source":"https:\/\/img-9gag-fun.9cache.com\/photo\/ad5OR4N_460sv.mp4","video_host":"img-9gag-fun.9cache.com","num_images":1,"platform":null,"readonly":false,"ad_type":0,"ad_url":"","weight":-1,"favorite_count":173,"processing":{"status":"completed"},"galleryTags":[{"id":"197940815","hash":"cLQ8lVI","account_id":"117321139","tag_id":"547","display":"football","ups":"0","downs":"0","score":"0","timestamp":"2019-12-19 12:46:19","blocked":"0","tag":"football","subscribers":"16415","images":"11097","background_hash":"dMdNvgJ","thumbnail_hash":null,"spam":"0","nsfw":"0","is_promoted":"0","animated":"0","thumbnail_animated":null,"metadata":{"tag_id":"547","title":null,"description":"touchdoooowwwwnnn!","logo_hash":null,"logo_destination_url":null,"is_promoted":"0","accent":"a88680"},"image":{"animated":"0"},"thumbnail":{"animated":null}},{"id":"197940811","hash":"cLQ8lVI","account_id":"117321139","tag_id":"1024","display":"awesome","ups":"0","downs":"0","score":"0","timestamp":"2019-12-19 12:46:19","blocked":"0","tag":"awesome","subscribers":"981004","images":"756530","background_hash":"4kmYoey","thumbnail_hash":null,"spam":"0","nsfw":"0","is_promoted":"0","animated":"0","thumbnail_animated":null,"metadata":{"tag_id":"1024","title":null,"description":"neat and amazing","logo_hash":null,"logo_destination_url":null,"is_promoted":"0","accent":"8472BD"},"image":{"animated":"0"},"thumbnail":{"animated":null}}],"favorited":false,"adConfig":{"safeFlags":["in_gallery","sixth_mod_safe","gallery"],"highRiskFlags":[],"unsafeFlags":[],"wallUnsafeFlags":[],"showsAds":true},"vote":null},
group : null,
comment_sort : 'best',
comment_id : '',
captionsEnabled : true,
onTheFlyThreshold : 10485760,
galleryTitle : 'Imgur: The magic of the Internet',
votedFavedRecently: false,
tagSectionIsPromoted: false,
lastModLog: null,
});
All I need to get is "nsfw":false , from that json. I would use something like find() but 'nsfw' occurs in other places within the script tags.
I did some research and it seems everything that I found, used some sort of regex to get data from a script tag.
I have tried creating my own regex for this, but honestly, I didn't get that far. They're pretty much black magic to me.
What regex should I use for this (if any)?
Posts: 161
Threads: 36
Joined: Jun 2018
Dec-19-2019, 08:36 PM
(This post was last modified: Dec-19-2019, 08:37 PM by DreamingInsanity.)
I have almost got one working - [\"-\"]\b(nsfw)\b\"[^\"]+[$,]
This is it normally:
![[Image: vEJG8Wf.png]](https://i.imgur.com/vEJG8Wf.png)
However, if there is another comma later on, not separated by a quotation mark, this happens:
![[Image: ol11lrf.png]](https://i.imgur.com/ol11lrf.png)
I need it to stop searching after it how found the first comma - how do I do this?
Posts: 8,151
Threads: 160
Joined: Sep 2016
can you show how the script tag looks like?
Posts: 161
Threads: 36
Joined: Jun 2018
The full script tag is this:
Output: </script>, <script type="text/javascript">
(function(widgetFactory) {
widgetFactory.mergeConfig('gallery', {
account_url : 'SocialKeked',
favs_account_url : null,
sort : 'viral',
section : 'hot',
window : 'day',
tag : null,
isHotImage : '1',
hash : 'cLQ8lVI',
baseURL : decodeURIComponent('%2Fgallery'),
page : 0,
isPro : false,
searchQuery : '',
advSearch : null,
isRandom : false,
safe_tags : true,
hasAccess : false,
inGallery : false,
hashes : null,
image : {"id":339385563,"hash":"cLQ8lVI","account_id":"117321139","account_url":"SocialKeked","title":"Noice!","score":795,"starting_score":0,"virality":18864.156525,"size":37779342,"views":"170307","is_hot":true,"is_album":false,"album_cover":null,"album_cover_width":0,"album_cover_height":0,"mimetype":"image\/gif","ext":".gif","width":728,"height":408,"animated":true,"looping":true,"ups":737,"downs":27,"points":710,"reddit":null,"description":"","bandwidth":"5.85 TB","timestamp":"2019-12-19 12:46:19","hot_datetime":"2019-12-19 16:32:01","gallery_datetime":"2019-12-19 12:45:40","in_gallery":true,"section":"","tags":["0","0"],"subtype":null,"spam":"0","pending":"0","comment_count":115,"nsfw":false,"topic":"No Topic","topic_id":29,"meme_name":null,"meme_top":null,"meme_bottom":null,"prefer_video":true,"video_source":"https:\/\/img-9gag-fun.9cache.com\/photo\/ad5OR4N_460sv.mp4","video_host":"img-9gag-fun.9cache.com","num_images":1,"platform":null,"readonly":false,"ad_type":0,"ad_url":"","weight":-1,"favorite_count":173,"processing":{"status":"completed"},"galleryTags":[{"id":"197940815","hash":"cLQ8lVI","account_id":"117321139","tag_id":"547","display":"football","ups":"0","downs":"0","score":"0","timestamp":"2019-12-19 12:46:19","blocked":"0","tag":"football","subscribers":"16415","images":"11097","background_hash":"dMdNvgJ","thumbnail_hash":null,"spam":"0","nsfw":"0","is_promoted":"0","animated":"0","thumbnail_animated":null,"metadata":{"tag_id":"547","title":null,"description":"touchdoooowwwwnnn!","logo_hash":null,"logo_destination_url":null,"is_promoted":"0","accent":"a88680"},"image":{"animated":"0"},"thumbnail":{"animated":null}},{"id":"197940811","hash":"cLQ8lVI","account_id":"117321139","tag_id":"1024","display":"awesome","ups":"0","downs":"0","score":"0","timestamp":"2019-12-19 12:46:19","blocked":"0","tag":"awesome","subscribers":"981004","images":"756530","background_hash":"4kmYoey","thumbnail_hash":null,"spam":"0","nsfw":"0","is_promoted":"0","animated":"0","thumbnail_animated":null,"metadata":{"tag_id":"1024","title":null,"description":"neat and amazing","logo_hash":null,"logo_destination_url":null,"is_promoted":"0","accent":"8472BD"},"image":{"animated":"0"},"thumbnail":{"animated":null}}],"favorited":false,"adConfig":{"safeFlags":["in_gallery","sixth_mod_safe","gallery"],"highRiskFlags":[],"unsafeFlags":[],"wallUnsafeFlags":[],"showsAds":true},"vote":null},
group : null,
comment_sort : 'best',
comment_id : '',
captionsEnabled : true,
onTheFlyThreshold : 10485760,
galleryTitle : 'Imgur: The magic of the Internet',
votedFavedRecently: false,
tagSectionIsPromoted: false,
lastModLog: null,
});
widgetFactory.mergeConfig('groups', {
groups: {
}
});
Posts: 161
Threads: 36
Joined: Jun 2018
Dec-20-2019, 06:18 PM
(This post was last modified: Dec-20-2019, 06:18 PM by DreamingInsanity.)
It turns out, although it looked like one element containing all script tags, you could actually index it. This is the working solution:
pattern = re.compile(r'[\"]\b(nsfw)\b\"[^\"]+[$,]')
page = requests.get("https://i.imgur.com/7QHohZV.jpg")
soup = BeautifulSoup(page.content, 'html.parser')
data = soup.find_all('script', type='text/javascript')[7].string
b = re.search(pattern, data)
nsfw = b.group(0).replace(',','').split(':')[-1].capitalize()
print(nsfw)
>>>True / False
|