Python Forum

Full Version: extract specific data from a group of json-files
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Hello everyone!

I have a a large collection of json-files (a few thousand) each containing metadata about a text post, such as the post-ID, the username (and full name, if made public by the user), timestamp and so on. I would like to extract this information from each file without having to do so manually, but am myself not yet familiar enough with Python to figure out how I can do this (I have only been able to follow one course so far, and unfortunately it isn't useful in this case - it was just on basic calculations i Python).
Another problem I have is that some of the files will contain several sets of different data with the same name when someone commented on the post. However, I only need this information about the main post (thus the first time this information appears in the file).

Does anyone have any idea how I might be able to extract this information?
Thank you so much in advance!

Kind regards
First you should know the data structure of the json file.
You can investigate it, if you open the file with Python and use json.loads() on the open file.

import json

with open('your_data0001.json') as fd:
    data = json.loafs(fd)
Usually json data is a dictionary with subdictionaries and lists.
To show the keys, you can use list(data.keys()).
If you find the right key, you can dig deeper.

For example if you have the key 'metadata', accessing it, is very easy:
data['metadata']
The value of 'metadata' could be a list or a dict or something else (int, float, str).


After you know the structure, you can write a transformer function for it.
It should transform the json data into the form you want to have.

This is just an example and do not have to fit on your data.
def transformer(mapping_from_json):
    """
    A generator which takes a mapping (dict)
    and yields name, age, active
    """
    for items in mapping_from_json['results']['metadata']:
        # if metadata is a list
        for element in items:
            name = element.get('name', 'NO NAME')
            age = element.get('age', 0)
            active = element.get('active', False)
            yield (name, age, active)
I used the generator, because the logic is easier to understand. Always if a yield is in a function, then it's returning a generator, if you call the function. Iterating over the generator, yields the elements.

To consume it, just use something, which takes iterables or use a for loop.
list(transformer(my_dict))
Hi @DeaD_EyE! Thank you for helping me!
where do I need to put my file to open it? Right now it's situated in my documents, but when I tried to open it, I received following error:
Error:
--------------------------------------------------------------------------- FileNotFoundError Traceback (most recent call last) <ipython-input-1-b22700e301c2> in <module> 1 import json 2 ----> 3 with open('2015-05-14_16-35-57_UTC_janedoe.json') as fd: 4 data=jsonloafs(fd) FileNotFoundError: [Errno 2] No such file or directory: '2015-05-14_16-35-57_UTC_janedoe.json'
Either give the full path to the file, or put it in your script's working directory (i.e. the same directory as the script resides).

(Dec-05-2019, 09:40 AM)DeaD_EyE Wrote: [ -> ]Usually json data is a dictionary with subdictionaries and lists.

I don't think it's reasonable to assume that. You can have an array as the top level thing, for example.