Python Forum

Pages: 1 2

Hi everyone

I want to apply a supervised approach to learning relations between entities in a text. The relations are: place of birth and institution.

I'm working with this data: code.google.com/p/relation-extraction-corpus/downloads/list

My question is: What kind of features would you extract? And how would you implement this?
(At the end, I want to use the features as input to the logistic regression.)

Thanks a lot for any help and hint!

What are you ideas so far?

My ideas so far:
- tokenization (done)
- pos-tagging (done)
- chunking (done)

- token distance (how would you implement this?)
- appearance of words
- order of entities

But these are my ideas. I don't know if there would be better ones...
If you want to, I can send you the whole task / code.

OK,

I'm working on an idea, not quite there and I need some sleep (abt 4 hours), but I'll present what I have so far
and you can play with it.

The data from the institution file seems to be corrupted, and will probably have to be edited
for now, I removed it from the load loop

import json
from collections import namedtuple, defaultdict


class DynamicNestedDict(defaultdict):
    def __init__(self, *args, **kwargs):
        super(DynamicNestedDict, self).__init__(DynamicNestedDict, *args, **kwargs)

    def set_from_list(self, keylist, value):
        level = self
        for key in keylist[:-1]:
            level = level[key]
        level[keylist[-1]] = value

    def get_from_list(self, keylist):
        level = self
        for key in keylist[:-1]:
            level = level[key]
        return level[keylist[-1]]

class LoadDicts:
    '''
    LoadDicts - Class which loads all json data into dictionaries

    License:
        This data set is made available under a Creative Commons License:
        http://creativecommons.org/licenses/by-sa/3.0/

    You are free:
        to Share — to copy, distribute and transmit the work
        to Remix — to adapt the work
        to make commercial use of the work

    Under the following conditions:

     Attribution — You must attribute the work in the manner specified by the author or licensor (but not in any way that suggests that they endorse you or your use of the work).

     Share Alike — If you alter, transform, or build upon this work, you may distribute the resulting work only under the same or similar license to this one.

    With the understanding that:

    Waiver — Any of the above conditions can be waived if you get permission from the copyright holder.
    Public Domain — Where the work or any of its elements is in the public domain under applicable law, that status is in no way affected by the license.
    Other Rights — In no way are any of the following rights affected by the license:
     - Your fair dealing or fair use rights, or other applicable copyright exceptions and limitations;
     - The author's moral rights;
     - Rights other persons may have either in the work itself or in how the work is used, such as publicity or privacy rights.


    Notice — For any reuse or distribution, you must make clear to others the license terms of this work. The best way to do this is with a link to this web page.
    http://creativecommons.org/licenses/by-sa/3.0/


    Introduction
    =============

    This relation extraction corpus consists of snippets from wikipedia
    annotated with rater judgments recording whether they are evidence that
    indicates a relation between two entities.

    There are two files. 20130403-place_of_birth.json contains 9566 judgments
    of evidences concerning place of birth, and 20130403-institution.json contains
    42628 judgments of evidences concerning attending or graduating from an
    institution.

    The files are in JSON format. Each line is a triple with the following fields:


     pred: predicate of a triple
     sub: subject of a triple
     obj:  object of a triple
     evidences: an array of evidences for this triple
     url: the web page from which this evidence was obtained
     snippet: short piece of text supporting the triple
     judgments: an array of judgements from human annotators
     rator: hash code of the identity of the annotator
     judgment: judgement of the annotator. It can take the values "yes" or "no"

     The software class LoadDicts was written by Larz60+ python-forum.io

    Enjoy!

    '''
    def __init__(self):
        self.institution = []
        self.place_of_birth = []
        self.date_of_birth = []
        self.education = []
        self.place_of_death = []

        self.dictkeys = ['pred', 'sub', 'obj', 'evidences', 'url', 'snippet', 'judgments', 'rator', 'judgment']
        self.jsonlist = {
            'institution': {
                'filename': 'data/20130403-institution.json',
                'dict': self.institution
            },
            'place_of_birth': {
                'filename': 'data/20130403-place_of_birth.json',
                'dict': self.place_of_birth
            },
            'date_of_birth': {
                'filename': 'data/20131104-date_of_birth.json',
                'dict': self.date_of_birth
            },
            'education': {
                'filename': 'data/20131104-education-degree.json',
                'dict': self.education
            },
            'place_of_death': {
                'filename': 'data/20131104-place_of_death.json',
                'dict': self.place_of_death
            }
        }

        self.data_help = {
            'pred': 'predicate of a triple',
            'sub': 'subject of a triple',
            'obj': 'object of a triple',
            'evidences': 'an array of evidences for this triple',
            'url': 'the web page from which this evidence was obtained',
            'snippet': 'short piece of text supporting the triple',
            'judgments': 'an array of judgements from human annotators',
            'rator': 'hash code of the identity of the annotator',
            'judgment': 'judgement of the annotator. It can take the values "yes" or "no"'
        }


        jkeys = list(self.jsonlist.keys())
        jkeys.sort()
        for key in jkeys:
            if key == 'institution':
                continue
            jfilename = self.jsonlist[key]['filename']
            dict = self.jsonlist[key]['dict']
            print(jfilename)
            with open(jfilename, 'r', encoding="utf8") as f:
                for line in f:
                    dict.append(line)

def try_dynamic_nested_dict():
    keylist = ['bind', 'application level', 'binding']
    loc = [20, 96, 100, 101, 102, 104, 105, 115, 193, 434, 546]
    dd = DynamicNestedDict()
    dd.set_from_list(keylist, loc)
    result = dd.get_from_list(keylist)
    print("dd['bind']['application level']['binding']: {}".format(result))


if __name__ == '__main__':
    try_dynamic_nested_dict()
    ld = LoadDicts()

The goal is to get it all loaded into nested dictionaries dynamically
The code for the dynamic dict is there, with a test, but not implemented yet
the lists (named dict because they will be):

self.institution = []
self.place_of_birth = []
self.date_of_birth = []
self.education = []
self.place_of_death = []

are getting loaded now, and each cell contains one dictionary entry,
these need to get loadad as dymamic dictionaries

I believe to do so, each of the list items above need to be redefined like



    self.institution = DynamicNestedDict(self)

each gets loaded using the set method,
dictkeys and corresponding values from the current list (such as self.place_of_birth)
on an item by item basis. You can use the 'try_dynamic_nested_dict' function as a guide,
where loc will be an entry from any of the lists

Hey Larz60+

WOW!! This is just great!

What about - in addition to your code - the following feature (since features have to be, at the end, values)?
--> The token distance of the word "born" and the location?

For example, if we have this in a sentence:
"Born in New York City, he graduated from Union College in Schenectady in 1798."

How can you calculate the token distance between "born" and the location? (I mean I know how to do it the ugly way: just search "born" and then count till the location. But is there any more intelligent solution?)

Thanks a lot for your proposition and your help!

This is your homework assignment. I am showing you one way to approach the data.
It is not our intent to do the work for you, you need to step up and make an effort to put it all together.
Then, if and when you run into trouble, we'll be glad to help with specific items.

In your code, how do you access for example the place-of-birth-dict? (I mean, how can you see and check the entries?)

And, in the function "try_dynamic_nested_dict", from where do you have the values of "loc"? Are they just arbitrarely?

As stated in the text of the post that contains the code the dictionary hasn't even been created.

The try routine (as stated in same post as before) is only a test routine for the dynamic dict class.
That was to show how it works.

I will get back to you soon.

There is still a problem with the nesting on evidences and judgements (they are dictionaries within lists in json data).

The dictionaries are still not being built (until above is fixed)

The dynamic nested dict is no longer necessary, that may change

here's the new code (there are a lot of print statements as still in debug mode) so you can play with it

The football game (American - I know, not real football) I did a stint in Birmingham England and got razzed on that all the time)),
is about to start, so I will be done for the evening.

import json
import sys


class LoadDicts:
    '''
    LoadDicts - Class which loads all json data into dictionaries

    License:
        This data set is made available under a Creative Commons License:
        http://creativecommons.org/licenses/by-sa/3.0/

    You are free:
        to Share — to copy, distribute and transmit the work
        to Remix — to adapt the work
        to make commercial use of the work

    Under the following conditions:

     Attribution — You must attribute the work in the manner specified by the author or licensor (but not in any way that suggests that they endorse you or your use of the work).

     Share Alike — If you alter, transform, or build upon this work, you may distribute the resulting work only under the same or similar license to this one.

    With the understanding that:

    Waiver — Any of the above conditions can be waived if you get permission from the copyright holder.
    Public Domain — Where the work or any of its elements is in the public domain under applicable law, that status is in no way affected by the license.
    Other Rights — In no way are any of the following rights affected by the license:
     - Your fair dealing or fair use rights, or other applicable copyright exceptions and limitations;
     - The author's moral rights;
     - Rights other persons may have either in the work itself or in how the work is used, such as publicity or privacy rights.


    Notice — For any reuse or distribution, you must make clear to others the license terms of this work. The best way to do this is with a link to this web page.
    http://creativecommons.org/licenses/by-sa/3.0/


    Introduction
    =============

    This relation extraction corpus consists of snippets from wikipedia
    annotated with rater judgments recording whether they are evidence that
    indicates a relation between two entities.

    There are two files. 20130403-place_of_birth.json contains 9566 judgments
    of evidences concerning place of birth, and 20130403-institution.json contains
    42628 judgments of evidences concerning attending or graduating from an
    institution.

    The files are in JSON format. Each line is a triple with the following fields:


     pred: predicate of a triple
     sub: subject of a triple
     obj:  object of a triple
     evidences: an array of evidences for this triple
     url: the web page from which this evidence was obtained
     snippet: short piece of text supporting the triple
     judgments: an array of judgements from human annotators
     rator: hash code of the identity of the annotator
     judgment: judgement of the annotator. It can take the values "yes" or "no"

     The software class LoadDicts was written by Larz60+ python-forum.io

    Enjoy!

    '''
    def __init__(self):
        self.institution = {}
        self.place_of_birth = {}
        self.date_of_birth = {}
        self.education = {}
        self.place_of_death = {}

        self.dictkeys = ['obj', 'pred', 'sub', 'evidences', 'url', 'snippet', 'judgments', 'rator', 'judgment']
        self.jsonlist = {
            'institution': {
                'self.filename': 'data/20130403-institution.json',
                'dict': self.institution
            },
            'place_of_birth': {
                'self.filename': 'data/20130403-place_of_birth.json',
                'dict': self.place_of_birth
            },
            'date_of_birth': {
                'self.filename': 'data/20131104-date_of_birth.json',
                'dict': self.date_of_birth
            },
            'education': {
                'self.filename': 'data/20131104-education-degree.json',
                'dict': self.education
            },
            'place_of_death': {
                'self.filename': 'data/20131104-place_of_death.json',
                'dict': self.place_of_death
            }
        }

        self.data_help = {
            'pred': 'predicate of a triple',
            'sub': 'subject of a triple',
            'obj': 'object of a triple',
            'evidences': 'an array of evidences for this triple',
            'url': 'the web page from which this evidence was obtained',
            'snippet': 'short piece of text supporting the triple',
            'judgments': 'an array of judgements from human annotators',
            'rator': 'hash code of the identity of the annotator',
            'judgment': 'judgement of the annotator. It can take the values "yes" or "no"'
        }
        self.load_data()

    def __repr__(self):
        return dict(self).__repr__()

    def load_data(self):
        for name, item in self.jsonlist.items():
            print('name: {}, item: {}'.format(name, item))
            print("item['self.filename']: {}".format(item['self.filename']))
            d = item['dict']
            self.filename = item['self.filename']
            # print('name: {} self.filename: {} d: {}'.format(name, item, self.filename, d))
            print('name: {}'.format(name))
            # with open(self.filename, 'r', encoding="utf8") as f:
            with open(self.filename, 'r', encoding="utf8") as f:
                try:
                    for line in f:
                        tempdict = json.loads(line)
                        print('tempdict: {}'.format(tempdict))
                        for key1, value1 in tempdict.items():
                            print('key1: {}, value1: {}'.format(key1, value1))
                            if (key1 == 'judgments') or (key1 == 'evidences'):
                                for subdict in value1:
                                    print('    type subdict: {}'.format(type(subdict)))
                                    print('    subdict: {}'.format(subdict))
                                    # for key2, value2 in subdict:
                                    #     print('        key2: {}, value2: {}'.format(key2, value2))
                                print()
                        # input()
                except:
                    print("Unexpected error:", sys.exc_info()[0])
                    print("error in file: {}".format(self.filename))


if __name__ == '__main__':
    ld = LoadDicts()
    # ld.show_data(ld.education)

There's an interesting article on this here
[url=http://searchengineland.com/demystifying-knowledge-graph-201976][/url]

Dear Larz60+

Thanks a lot for your proposition! Currently, I'm playing with the code and trying out my ideas.
I'll write again if I've questions :)

Edit: I've already one question: Could you please show - as example - how you would build and fill a dictionary?

Pages: 1 2

MattaFX

Larz60+

MattaFX

Larz60+

MattaFX

Larz60+

MattaFX

Larz60+

Larz60+

MattaFX