OK,
I'm working on an idea, not quite there and I need some sleep (abt 4 hours), but I'll present what I have so far
and you can play with it.
The data from the institution file seems to be corrupted, and will probably have to be edited
for now, I removed it from the load loop
import json
from collections import namedtuple, defaultdict
class DynamicNestedDict(defaultdict):
def __init__(self, *args, **kwargs):
super(DynamicNestedDict, self).__init__(DynamicNestedDict, *args, **kwargs)
def set_from_list(self, keylist, value):
level = self
for key in keylist[:-1]:
level = level[key]
level[keylist[-1]] = value
def get_from_list(self, keylist):
level = self
for key in keylist[:-1]:
level = level[key]
return level[keylist[-1]]
class LoadDicts:
'''
LoadDicts - Class which loads all json data into dictionaries
License:
This data set is made available under a Creative Commons License:
http://creativecommons.org/licenses/by-sa/3.0/
You are free:
to Share — to copy, distribute and transmit the work
to Remix — to adapt the work
to make commercial use of the work
Under the following conditions:
Attribution — You must attribute the work in the manner specified by the author or licensor (but not in any way that suggests that they endorse you or your use of the work).
Share Alike — If you alter, transform, or build upon this work, you may distribute the resulting work only under the same or similar license to this one.
With the understanding that:
Waiver — Any of the above conditions can be waived if you get permission from the copyright holder.
Public Domain — Where the work or any of its elements is in the public domain under applicable law, that status is in no way affected by the license.
Other Rights — In no way are any of the following rights affected by the license:
- Your fair dealing or fair use rights, or other applicable copyright exceptions and limitations;
- The author's moral rights;
- Rights other persons may have either in the work itself or in how the work is used, such as publicity or privacy rights.
Notice — For any reuse or distribution, you must make clear to others the license terms of this work. The best way to do this is with a link to this web page.
http://creativecommons.org/licenses/by-sa/3.0/
Introduction
=============
This relation extraction corpus consists of snippets from wikipedia
annotated with rater judgments recording whether they are evidence that
indicates a relation between two entities.
There are two files. 20130403-place_of_birth.json contains 9566 judgments
of evidences concerning place of birth, and 20130403-institution.json contains
42628 judgments of evidences concerning attending or graduating from an
institution.
The files are in JSON format. Each line is a triple with the following fields:
pred: predicate of a triple
sub: subject of a triple
obj: object of a triple
evidences: an array of evidences for this triple
url: the web page from which this evidence was obtained
snippet: short piece of text supporting the triple
judgments: an array of judgements from human annotators
rator: hash code of the identity of the annotator
judgment: judgement of the annotator. It can take the values "yes" or "no"
The software class LoadDicts was written by Larz60+ python-forum.io
Enjoy!
'''
def __init__(self):
self.institution = []
self.place_of_birth = []
self.date_of_birth = []
self.education = []
self.place_of_death = []
self.dictkeys = ['pred', 'sub', 'obj', 'evidences', 'url', 'snippet', 'judgments', 'rator', 'judgment']
self.jsonlist = {
'institution': {
'filename': 'data/20130403-institution.json',
'dict': self.institution
},
'place_of_birth': {
'filename': 'data/20130403-place_of_birth.json',
'dict': self.place_of_birth
},
'date_of_birth': {
'filename': 'data/20131104-date_of_birth.json',
'dict': self.date_of_birth
},
'education': {
'filename': 'data/20131104-education-degree.json',
'dict': self.education
},
'place_of_death': {
'filename': 'data/20131104-place_of_death.json',
'dict': self.place_of_death
}
}
self.data_help = {
'pred': 'predicate of a triple',
'sub': 'subject of a triple',
'obj': 'object of a triple',
'evidences': 'an array of evidences for this triple',
'url': 'the web page from which this evidence was obtained',
'snippet': 'short piece of text supporting the triple',
'judgments': 'an array of judgements from human annotators',
'rator': 'hash code of the identity of the annotator',
'judgment': 'judgement of the annotator. It can take the values "yes" or "no"'
}
jkeys = list(self.jsonlist.keys())
jkeys.sort()
for key in jkeys:
if key == 'institution':
continue
jfilename = self.jsonlist[key]['filename']
dict = self.jsonlist[key]['dict']
print(jfilename)
with open(jfilename, 'r', encoding="utf8") as f:
for line in f:
dict.append(line)
def try_dynamic_nested_dict():
keylist = ['bind', 'application level', 'binding']
loc = [20, 96, 100, 101, 102, 104, 105, 115, 193, 434, 546]
dd = DynamicNestedDict()
dd.set_from_list(keylist, loc)
result = dd.get_from_list(keylist)
print("dd['bind']['application level']['binding']: {}".format(result))
if __name__ == '__main__':
try_dynamic_nested_dict()
ld = LoadDicts()
The goal is to get it all loaded into nested dictionaries dynamically
The code for the dynamic dict is there, with a test, but not implemented yet
the lists (named dict because they will be):
- self.institution = []
- self.place_of_birth = []
- self.date_of_birth = []
- self.education = []
- self.place_of_death = []
are getting loaded now, and each cell contains one dictionary entry,
these need to get loadad as dymamic dictionaries
I believe to do so, each of the list items above need to be redefined like
self.institution = DynamicNestedDict(self)
each gets loaded using the set method,
dictkeys and corresponding values from the current list (such as self.place_of_birth)
on an item by item basis. You can use the 'try_dynamic_nested_dict' function as a guide,
where loc will be an entry from any of the lists