Feb-05-2019, 11:32 PM
Hello folks,
I am currently conducting some experiments in the area of Topic Modeling. For this I use a code from a corresponding tutorial. So far all of them work, but I would like to make an adjustment which I'm desperate about at the moment.
Column 1: Continuous Index
Column 2: Document number
Column 3: Topics Number
Column 4: Topics Percentage
Column 5: Keywords on the corresponding topic
Column 6: Document
[Image: 1zyd7jk.jpg]
Screenshot:01
Only the highest percentage of x topics per document is displayed at the moment. (Screenshot_01)
Example: With document 0, topic 1 is added with 0.5491 because this value is the highest in the comparison of all percentages of x Topic_Perc_Contrib documents.
What I would like is to have a variable to determine how many topics actually exist and then output all topics with the corresponding values in connection with the documents. (Screenshot_02)
As an example here a manually created example with 4 topics but I would like to change this number manually so that the output would also change. Of course this should be repeated with all documents.
[Image: 2zebcwm.png]
Screenshot_02
Is there someone who can quickly see through this and nicely adapt it to me?
Thank you for your Answer
I am currently conducting some experiments in the area of Topic Modeling. For this I use a code from a corresponding tutorial. So far all of them work, but I would like to make an adjustment which I'm desperate about at the moment.
#SOURCE: https://www.machinelearningplus.com/nlp/topic-modeling-gensim-python/ def format_topics_sentences(ldamodel=lda_model, corpus=corpus, texts=data): # Init output sent_topics_df = pd.DataFrame() # Get main topic in each document for i, row in enumerate(ldamodel[corpus]): row = sorted(row, key=lambda x: (x[1]), reverse=True) # Get the Dominant topic, Perc Contribution and Keywords for each document for j, (topic_num, prop_topic) in enumerate(row): if j == 0: # => dominant topic wp = ldamodel.show_topic(topic_num) topic_keywords = ", ".join([word for word, prop in wp]) sent_topics_df = sent_topics_df.append(pd.Series([int(topic_num), round(prop_topic,4), topic_keywords]), ignore_index=True) else: break sent_topics_df.columns = ['Dominant_Topic', 'Perc_Contribution', 'Topic_Keywords'] # Add original text to the end of the output contents = pd.Series(texts) sent_topics_df = pd.concat([sent_topics_df, contents], axis=1) return(sent_topics_df) df_topic_sents_keywords = format_topics_sentences(ldamodel=optimal_model, corpus=corpus, texts=data) # Format df_dominant_topic = df_topic_sents_keywords.reset_index() df_dominant_topic.columns = ['Document_No', 'Dominant_Topic', 'Topic_Perc_Contrib', 'Keywords', 'Text'] #Save .CSV File df_dominant_topic.to_csv('OUTPUT/Topic_Overview.csv') #Save .XLSX File df_dominant_topic.to_excel('OUTPUT/Topic_Overview.xlsx', 'Data_Overview') # Show df_dominant_topic.head()At the moment a Pandas Dataframe is output in an Excel with 6 described columns.
Column 1: Continuous Index
Column 2: Document number
Column 3: Topics Number
Column 4: Topics Percentage
Column 5: Keywords on the corresponding topic
Column 6: Document
[Image: 1zyd7jk.jpg]
Screenshot:01
Only the highest percentage of x topics per document is displayed at the moment. (Screenshot_01)
Example: With document 0, topic 1 is added with 0.5491 because this value is the highest in the comparison of all percentages of x Topic_Perc_Contrib documents.
What I would like is to have a variable to determine how many topics actually exist and then output all topics with the corresponding values in connection with the documents. (Screenshot_02)
As an example here a manually created example with 4 topics but I would like to change this number manually so that the output would also change. Of course this should be repeated with all documents.
[Image: 2zebcwm.png]
Screenshot_02
Is there someone who can quickly see through this and nicely adapt it to me?
Thank you for your Answer
