Python Forum
Python Based Keyword and Stemming
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Python Based Keyword and Stemming
#1
Hello All,

I have python script that pulls out a keyword from the data set. The data set contains 3 columns,
1. SysID 2. ID 3. Comment Section.

This script just pulls out keyword for certain extent from Comment section and display only keyword, not any other columns.

Can someone help out to alter this script so that script trim comment column sparing with precise key words from each row of columns, without truncating the other columns.

#!/usr/bin/env python2.7
import numpy as np
from collections import Counter
import csv

class Preprocess_data():


        def __init__(self, data, k_number_of_features=5000):
                self.k = k_number_of_features
                self.words = zip(*data)[2]


        def get_word(self, data):
                punc1 = ("~`!@#$%^&*()_-+=[]{}\|;:',<.>/?")
                punc2 = ('"')
                wordsbag = []
                words = zip(*data)[2]
                words = [item.lower().translate(None, punc1).translate(None, punc2) for item in words]
                self.words = [item.split() for item in words]
                for line in self.words:
                        wordsbag.extend(set(line))
                return wordsbag


        def count_attr(self,data):
                c = Counter(self.get_word(data))
                feature = c.most_common(100+self.k)[100:100+self.k]
                return feature


        def summarize_feature(self, data):
                words = self.words
                feature = self.count_attr(data)
                feature_value = np.zeros((len(data), len(feature)))
                for i in range(len(words)):
                        for j in range(len(feature)):
                                if (feature[j][0] in words[i]):
                                        feature_value[i][j] = 1
                                else:
                                        feature_value[i][j] = 0
                return feature_value



if __name__=='__main__':
        file = open('testfile', 'rU')
        data = list(csv.reader(file, delimiter='\t'))
        preprocessed = Preprocess_data(data, k_number_of_features='n')
        wordsbag = preprocessed.get_word(data)
        feature = preprocessed.count_attr(data)
        feature_value = preprocessed.summarize_feature(data)
        #-------print the most common ten words---------#
        for i in range(3000):
                print 'WORD' + str(i+1), feature[i][0]
Sample Dataset

Output:
SAMPLE INPUT FILE CONTENTS ========================== 4819 810 The locker doors "Inside" were marked and not polished properly. 4885 1313 The seal around / on top of the flush panel is damaged. 4932 825 The clock facing the bag drop drive way is not set correctly / displays incorrect time. 5067 744 Gaps are visible between the interlock flooring tiles. 5027 737 The menu is damaged. 5067 748 The wall is seen blistered. 4845 825 The left side of the panel is fused. 4952 810 The terrace tiles are damaged. 5496 1044 tetst 5022 732 The service door is left open and construction equipment is left unattended. 5496 1044 test 5496 2009 test 4952 810 The terrace tiles are cracked /damaged. 5058 1110 The 5067 2022 The umbrella's bases of the restaurant are seen dusty and dirty. 5058 1110 The Interlock flooring is seen damaged and stained. 5058 1110 Gaps are visible between Interlock flooring. 5058 1110 Several toilet cubicles doors are seen chipped. 5489 824 tttt 5058 1110 The prayer timings electrical board has been removed during painting and never returned back and a mark is visible on the wall. 4771 693 The toilet cubicle skirtings are scratched. 5026 52 The terrace is damaged. 5027 737 The menu is damaged. 5026 743 The terrace is damaged. 4906 24 fgfgf 5059 829 The wall around the A/C grill is stained. 5059 829 The door stopper is missing and tile is damaged by door handle. 5059 829 The soap holder is missing. 5059 829 The douche tap fitting is loose. 5059 829 The corner of the wall is damaged and moldy. 5059 829 The ping pong table is damaged. 5059 829 The sign at the gate to pool area is faded. 5059 829 The protective net is not properly installed. The fitting is untidy. 5059 829 The pool loungers are stained. 5059 829 The corner of the wall is damaged and moldy. 5059 829 The corner of the wall is damaged and moldy. 5059 829 The corner of the wall is damaged and moldy. 5058 1117 The empty unit is seen not hoarded; window is dirty and dust is visible from the window. 5058 1110 The flooring arrows are faded and worn. 5490 1957 test 5022 732 There appears to be water damage on the dipped ceiling. 4825 833 The 5022 727 The information about where the stairs lead to is missing. 5022 732 The stairs walls are all blank. Information about what is at the top of the stairs needs to be added to those walls. 5022 732 The yellow exit sign painted on the wall is damaged above it and the paint is uneven and untidy. 5022 732 The yellow car park sign hanging from the ceiling is chipped at the lower left ledge. 5056 833 Ceiling access panels are still found missing. 5056 833 Main door is damaged on lower edge. 5022 732 There is yellow tape in a square shape left above the Tche Tche Cafe sign on the wall. 5056 833 Tiles panels are damaged.
Current Output from the script is below

Output:
WORD1 working WORD2 correctly WORD3 cover WORD4 ac WORD5 doors WORD6 it WORD7 full WORD8 display WORD9 parking WORD10 heavily WORD11 wooden WORD12 for WORD13 edges WORD14 humidity WORD15 cubicles WORD16 fitted WORD17 out WORD18 room WORD19 tree WORD20 behind WORD21 fence WORD22 ok WORD23 dusty WORD24 cabinet WORD25 along WORD26 rusty WORD27 overgrown WORD28 as WORD29 signs WORD30 protruding WORD31 painted WORD32 fountain WORD33 covered WORD34 does WORD35 dry WORD36 availability WORD37 lift WORD38 operational WORD39 severally WORD40 poor WORD41 found WORD42 litter WORD43 blistered
Expected Result should be

Output:
SysID ID Keywords 5067 2022 umbrella's , dusty, dirty. 5058 1110 Interlock, damaged, stained. 5058 1110 Gaps, flooring. 5058 1110 toilet, doors, chipped.
Thanking you in advance, hope someone will address.
Reply


Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020