Nov-16-2018, 10:05 AM
Hello All,
I have python script that pulls out a keyword from the data set. The data set contains 3 columns,
1. SysID 2. ID 3. Comment Section.
This script just pulls out keyword for certain extent from Comment section and display only keyword, not any other columns.
Can someone help out to alter this script so that script trim comment column sparing with precise key words from each row of columns, without truncating the other columns.
I have python script that pulls out a keyword from the data set. The data set contains 3 columns,
1. SysID 2. ID 3. Comment Section.
This script just pulls out keyword for certain extent from Comment section and display only keyword, not any other columns.
Can someone help out to alter this script so that script trim comment column sparing with precise key words from each row of columns, without truncating the other columns.
#!/usr/bin/env python2.7 import numpy as np from collections import Counter import csv class Preprocess_data(): def __init__(self, data, k_number_of_features=5000): self.k = k_number_of_features self.words = zip(*data)[2] def get_word(self, data): punc1 = ("~`!@#$%^&*()_-+=[]{}\|;:',<.>/?") punc2 = ('"') wordsbag = [] words = zip(*data)[2] words = [item.lower().translate(None, punc1).translate(None, punc2) for item in words] self.words = [item.split() for item in words] for line in self.words: wordsbag.extend(set(line)) return wordsbag def count_attr(self,data): c = Counter(self.get_word(data)) feature = c.most_common(100+self.k)[100:100+self.k] return feature def summarize_feature(self, data): words = self.words feature = self.count_attr(data) feature_value = np.zeros((len(data), len(feature))) for i in range(len(words)): for j in range(len(feature)): if (feature[j][0] in words[i]): feature_value[i][j] = 1 else: feature_value[i][j] = 0 return feature_value if __name__=='__main__': file = open('testfile', 'rU') data = list(csv.reader(file, delimiter='\t')) preprocessed = Preprocess_data(data, k_number_of_features='n') wordsbag = preprocessed.get_word(data) feature = preprocessed.count_attr(data) feature_value = preprocessed.summarize_feature(data) #-------print the most common ten words---------# for i in range(3000): print 'WORD' + str(i+1), feature[i][0]Sample Dataset
Output:SAMPLE INPUT FILE CONTENTS
==========================
4819 810 The locker doors "Inside" were marked and not polished properly.
4885 1313 The seal around / on top of the flush panel is damaged.
4932 825 The clock facing the bag drop drive way is not set correctly / displays incorrect time.
5067 744 Gaps are visible between the interlock flooring tiles.
5027 737 The menu is damaged.
5067 748 The wall is seen blistered.
4845 825 The left side of the panel is fused.
4952 810 The terrace tiles are damaged.
5496 1044 tetst
5022 732 The service door is left open and construction equipment is left unattended.
5496 1044 test
5496 2009 test
4952 810 The terrace tiles are cracked /damaged.
5058 1110 The
5067 2022 The umbrella's bases of the restaurant are seen dusty and dirty.
5058 1110 The Interlock flooring is seen damaged and stained.
5058 1110 Gaps are visible between Interlock flooring.
5058 1110 Several toilet cubicles doors are seen chipped.
5489 824 tttt
5058 1110 The prayer timings electrical board has been removed during painting and never returned back and a mark is visible on the wall.
4771 693 The toilet cubicle skirtings are scratched.
5026 52 The terrace is damaged.
5027 737 The menu is damaged.
5026 743 The terrace is damaged.
4906 24 fgfgf
5059 829 The wall around the A/C grill is stained.
5059 829 The door stopper is missing and tile is damaged by door handle.
5059 829 The soap holder is missing.
5059 829 The douche tap fitting is loose.
5059 829 The corner of the wall is damaged and moldy.
5059 829 The ping pong table is damaged.
5059 829 The sign at the gate to pool area is faded.
5059 829 The protective net is not properly installed. The fitting is untidy.
5059 829 The pool loungers are stained.
5059 829 The corner of the wall is damaged and moldy.
5059 829 The corner of the wall is damaged and moldy.
5059 829 The corner of the wall is damaged and moldy.
5058 1117 The empty unit is seen not hoarded; window is dirty and dust is visible from the window.
5058 1110 The flooring arrows are faded and worn.
5490 1957 test
5022 732 There appears to be water damage on the dipped ceiling.
4825 833 The
5022 727 The information about where the stairs lead to is missing.
5022 732 The stairs walls are all blank. Information about what is at the top of the stairs needs to be added to those walls.
5022 732 The yellow exit sign painted on the wall is damaged above it and the paint is uneven and untidy.
5022 732 The yellow car park sign hanging from the ceiling is chipped at the lower left ledge.
5056 833 Ceiling access panels are still found missing.
5056 833 Main door is damaged on lower edge.
5022 732 There is yellow tape in a square shape left above the Tche Tche Cafe sign on the wall.
5056 833 Tiles panels are damaged.
Current Output from the script is belowOutput:WORD1 working
WORD2 correctly
WORD3 cover
WORD4 ac
WORD5 doors
WORD6 it
WORD7 full
WORD8 display
WORD9 parking
WORD10 heavily
WORD11 wooden
WORD12 for
WORD13 edges
WORD14 humidity
WORD15 cubicles
WORD16 fitted
WORD17 out
WORD18 room
WORD19 tree
WORD20 behind
WORD21 fence
WORD22 ok
WORD23 dusty
WORD24 cabinet
WORD25 along
WORD26 rusty
WORD27 overgrown
WORD28 as
WORD29 signs
WORD30 protruding
WORD31 painted
WORD32 fountain
WORD33 covered
WORD34 does
WORD35 dry
WORD36 availability
WORD37 lift
WORD38 operational
WORD39 severally
WORD40 poor
WORD41 found
WORD42 litter
WORD43 blistered
Expected Result should beOutput:SysID ID Keywords
5067 2022 umbrella's , dusty, dirty.
5058 1110 Interlock, damaged, stained.
5058 1110 Gaps, flooring.
5058 1110 toilet, doors, chipped.
Thanking you in advance, hope someone will address.