Posts: 24
Threads: 7
Joined: Dec 2016
Hi everyone
I want to apply a supervised approach to learning relations between entities in a text. The relations are: place of birth and institution.
I'm working with this data: code.google.com/p/relation-extraction-corpus/downloads/list
My question is: What kind of features would you extract? And how would you implement this?
(At the end, I want to use the features as input to the logistic regression.)
Thanks a lot for any help and hint!
Posts: 12,041
Threads: 487
Joined: Sep 2016
What are you ideas so far?
Posts: 24
Threads: 7
Joined: Dec 2016
My ideas so far:
- tokenization (done)
- pos-tagging (done)
- chunking (done)
- token distance (how would you implement this?)
- appearance of words
- order of entities
But these are my ideas. I don't know if there would be better ones...
If you want to, I can send you the whole task / code.
Posts: 12,041
Threads: 487
Joined: Sep 2016
OK,
I'm working on an idea, not quite there and I need some sleep (abt 4 hours), but I'll present what I have so far
and you can play with it.
The data from the institution file seems to be corrupted, and will probably have to be edited
for now, I removed it from the load loop
import json
from collections import namedtuple, defaultdict
class DynamicNestedDict(defaultdict):
def __init__(self, *args, **kwargs):
super(DynamicNestedDict, self).__init__(DynamicNestedDict, *args, **kwargs)
def set_from_list(self, keylist, value):
level = self
for key in keylist[:-1]:
level = level[key]
level[keylist[-1]] = value
def get_from_list(self, keylist):
level = self
for key in keylist[:-1]:
level = level[key]
return level[keylist[-1]]
class LoadDicts:
'''
LoadDicts - Class which loads all json data into dictionaries
License:
This data set is made available under a Creative Commons License:
http://creativecommons.org/licenses/by-sa/3.0/
You are free:
to Share — to copy, distribute and transmit the work
to Remix — to adapt the work
to make commercial use of the work
Under the following conditions:
Attribution — You must attribute the work in the manner specified by the author or licensor (but not in any way that suggests that they endorse you or your use of the work).
Share Alike — If you alter, transform, or build upon this work, you may distribute the resulting work only under the same or similar license to this one.
With the understanding that:
Waiver — Any of the above conditions can be waived if you get permission from the copyright holder.
Public Domain — Where the work or any of its elements is in the public domain under applicable law, that status is in no way affected by the license.
Other Rights — In no way are any of the following rights affected by the license:
- Your fair dealing or fair use rights, or other applicable copyright exceptions and limitations;
- The author's moral rights;
- Rights other persons may have either in the work itself or in how the work is used, such as publicity or privacy rights.
Notice — For any reuse or distribution, you must make clear to others the license terms of this work. The best way to do this is with a link to this web page.
http://creativecommons.org/licenses/by-sa/3.0/
Introduction
=============
This relation extraction corpus consists of snippets from wikipedia
annotated with rater judgments recording whether they are evidence that
indicates a relation between two entities.
There are two files. 20130403-place_of_birth.json contains 9566 judgments
of evidences concerning place of birth, and 20130403-institution.json contains
42628 judgments of evidences concerning attending or graduating from an
institution.
The files are in JSON format. Each line is a triple with the following fields:
pred: predicate of a triple
sub: subject of a triple
obj: object of a triple
evidences: an array of evidences for this triple
url: the web page from which this evidence was obtained
snippet: short piece of text supporting the triple
judgments: an array of judgements from human annotators
rator: hash code of the identity of the annotator
judgment: judgement of the annotator. It can take the values "yes" or "no"
The software class LoadDicts was written by Larz60+ python-forum.io
Enjoy!
'''
def __init__(self):
self.institution = []
self.place_of_birth = []
self.date_of_birth = []
self.education = []
self.place_of_death = []
self.dictkeys = ['pred', 'sub', 'obj', 'evidences', 'url', 'snippet', 'judgments', 'rator', 'judgment']
self.jsonlist = {
'institution': {
'filename': 'data/20130403-institution.json',
'dict': self.institution
},
'place_of_birth': {
'filename': 'data/20130403-place_of_birth.json',
'dict': self.place_of_birth
},
'date_of_birth': {
'filename': 'data/20131104-date_of_birth.json',
'dict': self.date_of_birth
},
'education': {
'filename': 'data/20131104-education-degree.json',
'dict': self.education
},
'place_of_death': {
'filename': 'data/20131104-place_of_death.json',
'dict': self.place_of_death
}
}
self.data_help = {
'pred': 'predicate of a triple',
'sub': 'subject of a triple',
'obj': 'object of a triple',
'evidences': 'an array of evidences for this triple',
'url': 'the web page from which this evidence was obtained',
'snippet': 'short piece of text supporting the triple',
'judgments': 'an array of judgements from human annotators',
'rator': 'hash code of the identity of the annotator',
'judgment': 'judgement of the annotator. It can take the values "yes" or "no"'
}
jkeys = list(self.jsonlist.keys())
jkeys.sort()
for key in jkeys:
if key == 'institution':
continue
jfilename = self.jsonlist[key]['filename']
dict = self.jsonlist[key]['dict']
print(jfilename)
with open(jfilename, 'r', encoding="utf8") as f:
for line in f:
dict.append(line)
def try_dynamic_nested_dict():
keylist = ['bind', 'application level', 'binding']
loc = [20, 96, 100, 101, 102, 104, 105, 115, 193, 434, 546]
dd = DynamicNestedDict()
dd.set_from_list(keylist, loc)
result = dd.get_from_list(keylist)
print("dd['bind']['application level']['binding']: {}".format(result))
if __name__ == '__main__':
try_dynamic_nested_dict()
ld = LoadDicts() The goal is to get it all loaded into nested dictionaries dynamically
The code for the dynamic dict is there, with a test, but not implemented yet
the lists (named dict because they will be):
- self.institution = []
- self.place_of_birth = []
- self.date_of_birth = []
- self.education = []
- self.place_of_death = []
are getting loaded now, and each cell contains one dictionary entry,
these need to get loadad as dymamic dictionaries
I believe to do so, each of the list items above need to be redefined like
self.institution = DynamicNestedDict(self)
each gets loaded using the set method,
dictkeys and corresponding values from the current list (such as self.place_of_birth)
on an item by item basis. You can use the 'try_dynamic_nested_dict' function as a guide,
where loc will be an entry from any of the lists
Posts: 24
Threads: 7
Joined: Dec 2016
Hey Larz60+
WOW!! This is just great!
What about - in addition to your code - the following feature (since features have to be, at the end, values)?
--> The token distance of the word "born" and the location?
For example, if we have this in a sentence:
"Born in New York City, he graduated from Union College in Schenectady in 1798."
How can you calculate the token distance between "born" and the location? (I mean I know how to do it the ugly way: just search "born" and then count till the location. But is there any more intelligent solution?)
Thanks a lot for your proposition and your help!
Posts: 12,041
Threads: 487
Joined: Sep 2016
This is your homework assignment. I am showing you one way to approach the data.
It is not our intent to do the work for you, you need to step up and make an effort to put it all together.
Then, if and when you run into trouble, we'll be glad to help with specific items.
Posts: 24
Threads: 7
Joined: Dec 2016
In your code, how do you access for example the place-of-birth-dict? (I mean, how can you see and check the entries?)
And, in the function " try_dynamic_nested_dict", from where do you have the values of "loc"? Are they just arbitrarely?
Posts: 12,041
Threads: 487
Joined: Sep 2016
As stated in the text of the post that contains the code the dictionary hasn't even been created.
The try routine (as stated in same post as before) is only a test routine for the dynamic dict class.
That was to show how it works.
I will get back to you soon.
Posts: 12,041
Threads: 487
Joined: Sep 2016
Dec-13-2016, 04:13 AM
(This post was last modified: Dec-13-2016, 04:13 AM by Larz60+.)
There is still a problem with the nesting on evidences and judgements (they are dictionaries within lists in json data).
The dictionaries are still not being built (until above is fixed)
The dynamic nested dict is no longer necessary, that may change
here's the new code (there are a lot of print statements as still in debug mode) so you can play with it
The football game (American - I know, not real football) I did a stint in Birmingham England and got razzed on that all the time)),
is about to start, so I will be done for the evening.
import json
import sys
class LoadDicts:
'''
LoadDicts - Class which loads all json data into dictionaries
License:
This data set is made available under a Creative Commons License:
http://creativecommons.org/licenses/by-sa/3.0/
You are free:
to Share — to copy, distribute and transmit the work
to Remix — to adapt the work
to make commercial use of the work
Under the following conditions:
Attribution — You must attribute the work in the manner specified by the author or licensor (but not in any way that suggests that they endorse you or your use of the work).
Share Alike — If you alter, transform, or build upon this work, you may distribute the resulting work only under the same or similar license to this one.
With the understanding that:
Waiver — Any of the above conditions can be waived if you get permission from the copyright holder.
Public Domain — Where the work or any of its elements is in the public domain under applicable law, that status is in no way affected by the license.
Other Rights — In no way are any of the following rights affected by the license:
- Your fair dealing or fair use rights, or other applicable copyright exceptions and limitations;
- The author's moral rights;
- Rights other persons may have either in the work itself or in how the work is used, such as publicity or privacy rights.
Notice — For any reuse or distribution, you must make clear to others the license terms of this work. The best way to do this is with a link to this web page.
http://creativecommons.org/licenses/by-sa/3.0/
Introduction
=============
This relation extraction corpus consists of snippets from wikipedia
annotated with rater judgments recording whether they are evidence that
indicates a relation between two entities.
There are two files. 20130403-place_of_birth.json contains 9566 judgments
of evidences concerning place of birth, and 20130403-institution.json contains
42628 judgments of evidences concerning attending or graduating from an
institution.
The files are in JSON format. Each line is a triple with the following fields:
pred: predicate of a triple
sub: subject of a triple
obj: object of a triple
evidences: an array of evidences for this triple
url: the web page from which this evidence was obtained
snippet: short piece of text supporting the triple
judgments: an array of judgements from human annotators
rator: hash code of the identity of the annotator
judgment: judgement of the annotator. It can take the values "yes" or "no"
The software class LoadDicts was written by Larz60+ python-forum.io
Enjoy!
'''
def __init__(self):
self.institution = {}
self.place_of_birth = {}
self.date_of_birth = {}
self.education = {}
self.place_of_death = {}
self.dictkeys = ['obj', 'pred', 'sub', 'evidences', 'url', 'snippet', 'judgments', 'rator', 'judgment']
self.jsonlist = {
'institution': {
'self.filename': 'data/20130403-institution.json',
'dict': self.institution
},
'place_of_birth': {
'self.filename': 'data/20130403-place_of_birth.json',
'dict': self.place_of_birth
},
'date_of_birth': {
'self.filename': 'data/20131104-date_of_birth.json',
'dict': self.date_of_birth
},
'education': {
'self.filename': 'data/20131104-education-degree.json',
'dict': self.education
},
'place_of_death': {
'self.filename': 'data/20131104-place_of_death.json',
'dict': self.place_of_death
}
}
self.data_help = {
'pred': 'predicate of a triple',
'sub': 'subject of a triple',
'obj': 'object of a triple',
'evidences': 'an array of evidences for this triple',
'url': 'the web page from which this evidence was obtained',
'snippet': 'short piece of text supporting the triple',
'judgments': 'an array of judgements from human annotators',
'rator': 'hash code of the identity of the annotator',
'judgment': 'judgement of the annotator. It can take the values "yes" or "no"'
}
self.load_data()
def __repr__(self):
return dict(self).__repr__()
def load_data(self):
for name, item in self.jsonlist.items():
print('name: {}, item: {}'.format(name, item))
print("item['self.filename']: {}".format(item['self.filename']))
d = item['dict']
self.filename = item['self.filename']
# print('name: {} self.filename: {} d: {}'.format(name, item, self.filename, d))
print('name: {}'.format(name))
# with open(self.filename, 'r', encoding="utf8") as f:
with open(self.filename, 'r', encoding="utf8") as f:
try:
for line in f:
tempdict = json.loads(line)
print('tempdict: {}'.format(tempdict))
for key1, value1 in tempdict.items():
print('key1: {}, value1: {}'.format(key1, value1))
if (key1 == 'judgments') or (key1 == 'evidences'):
for subdict in value1:
print(' type subdict: {}'.format(type(subdict)))
print(' subdict: {}'.format(subdict))
# for key2, value2 in subdict:
# print(' key2: {}, value2: {}'.format(key2, value2))
print()
# input()
except:
print("Unexpected error:", sys.exc_info()[0])
print("error in file: {}".format(self.filename))
if __name__ == '__main__':
ld = LoadDicts()
# ld.show_data(ld.education) There's an interesting article on this here
[url=http://searchengineland.com/demystifying-knowledge-graph-201976][/url]
Posts: 24
Threads: 7
Joined: Dec 2016
Dec-13-2016, 02:27 PM
(This post was last modified: Dec-13-2016, 02:27 PM by MattaFX.)
Dear Larz60+
Thanks a lot for your proposition! Currently, I'm playing with the code and trying out my ideas.
I'll write again if I've questions :)
Edit: I've already one question: Could you please show - as example - how you would build and fill a dictionary?
|