May-24-2021, 02:23 PM
(This post was last modified: May-24-2021, 02:58 PM by Gribouillis.)
Hello everyone!
I was wondering if you could help a student out :)
My name is Marie and I am currently studying Strategic Management at the Erasmus University in Rotterdam. For my master thesis I am investigating whether the relationship between media coverage and strategic change is contingent on the type of publication. I have downloaded several articles from the NexisUni database and I use a Python script to analyse the media sentiment for each article. The output file of the sentiment script (which is an Excel file) shows the data, file name, text, number of positive/negative words and the sentiment score. However, in order to distinguish the several media outlets, I also need to know the type of publication (e.g. The New York Times). The type of publication is currently not shown in the Excel file.
Do you have an idea how to include the variable 'publication type' into the python code so that the Excel sheet will also show the publication type for the articles?
Please find attached the python script I use. I have downloaded the script from Github: https://github.com/snauhaus/lexi_sent
I also attached an example of an Excel sheet with the analysis of two articles.
I hope that you are willing to take a look at this matter
Best wishes,
Marie
Sentiments Example Find & Pivot 3.xlsx (Size: 12.72 KB / Downloads: 384)
Sentiments Example Find & Pivot 3.xlsx (Size: 12.72 KB / Downloads: 384)
Sentiments Example Find & Pivot 3.xlsx (Size: 12.72 KB / Downloads: 384)
Sentiments Example Find & Pivot 3.xlsx (Size: 12.72 KB / Downloads: 384)
Sentiments Example Find & Pivot 3.xlsx (Size: 12.72 KB / Downloads: 384)
I was wondering if you could help a student out :)
My name is Marie and I am currently studying Strategic Management at the Erasmus University in Rotterdam. For my master thesis I am investigating whether the relationship between media coverage and strategic change is contingent on the type of publication. I have downloaded several articles from the NexisUni database and I use a Python script to analyse the media sentiment for each article. The output file of the sentiment script (which is an Excel file) shows the data, file name, text, number of positive/negative words and the sentiment score. However, in order to distinguish the several media outlets, I also need to know the type of publication (e.g. The New York Times). The type of publication is currently not shown in the Excel file.
Do you have an idea how to include the variable 'publication type' into the python code so that the Excel sheet will also show the publication type for the articles?
Please find attached the python script I use. I have downloaded the script from Github: https://github.com/snauhaus/lexi_sent
I also attached an example of an Excel sheet with the analysis of two articles.
I hope that you are willing to take a look at this matter

Best wishes,
Marie
#!/usr/bin/env python3 """ """ import pandas as pd import numpy as np import argparse import os import re # Regex # import nltk import string import io # Handles encoding of text files def janis_fadner(pos, neg): """Returns Fanis-Fadner Coefficient of Imbalance""" jfci = [0]*len(pos) for i, (p, n) in enumerate(zip(pos, neg)): if p > n: jfci[i] = (p**2 - p * n) / (p + n)**2 elif p==0 & n==0: jfci[i] = 0 else: jfci[i] = (p * n - n**2) / (p + n)**2 return jfci def word_counter(words, text): """Vectorized string search""" total = [0]*len(text) # Empty list for i, txt in enumerate(text): for word in words: if word in txt: total[i] = total[i] + 1 return total def sentiment_analysis(df, wordlist): """Sentiment analysis routine using janis fadner coefficient of imbalance""" # Get wordlist pos_words = wordlist[wordlist['sentiment'] > 0]['token'].to_list() neg_words = wordlist[wordlist['sentiment'] < 0]['token'].to_list() # Calculate sentiment df['PositiveWords'] = word_counter(pos_words, df['Text']) df['NegativeWords'] = word_counter(neg_words, df['Text']) df['Sentiment'] = janis_fadner(df['PositiveWords'], df['NegativeWords']) return df def clean_doc(doc): """Cleans a document, extracts meta data, returns a dictionary""" # Split header and text topmarker = "Body" if re.search("\n" + topmarker + ".?\n", doc) is not None: headersplit = re.split("\n" + topmarker + ".?\n", doc) header = headersplit[0] body = headersplit[1] cleaned = 1 else: body = doc header = '' cleaned = 0 # Try getting the date try: dateresult = re.findall(r'\n\s{5}.*\d+.*\d{4}\s', header, flags=re.IGNORECASE) if header: dateresult += re.findall(r'\w+\s\d+.*\d{4}', header) dateresult += re.findall(r'\w+\s*\d{4}', header) date = dateresult[0].strip() except: date = '' # Clean text body # words = nltk.word_tokenize(body) # Tokenize words words = body.split() words = [w.lower() for w in words] # Lowercase everything words = list(set(words)) # Unique words only words = [w for w in words if w.isalpha()] # Letters only nb_words = len(words) words = ' '.join(words) # Collect results cleaned_doc = { 'Text': words, 'Date': date, 'UniqueWords': nb_words } return cleaned_doc def folder_import(path): """Function imports each document in path, cleans it, and appends to a data frame""" files = os.listdir(path) # Text files only files = [f for f in files if f.split(".")[-1]=="txt"] # Results table df = pd.DataFrame() # Loop through files in folder for i, f in enumerate(files): # Read file fp = io.open(os.path.join(path, f), 'r', encoding='latin1').read() # Clean file fp_clean = clean_doc(fp) # Add file name to results fp_clean['File'] = f # Append results to dataframe df = df.append(fp_clean, ignore_index=True) return df def main(): # Get command line arguments parser = argparse.ArgumentParser(description='Perform sentiment analysis on a list of documents.') parser.add_argument('input', type=str, nargs=1, help='A CSV file with a single column, containing the text to one document per row') # group = parser.add_mutually_exclusive_group(required=False) group.add_argument('-w','--wordlist', help='CSV file containing a word list with positive and negative words. Default is the MPQA word list, which ships with this script. Different files must follow the same format.', required=False, nargs=1, default='MPQA.csv') # group.add_argument('-o','--output', help='Name for output file. Defaults to "Sentiments.xslx"', required=False, nargs=1, default='Sentiments.xlsx') # # Parse arguments args = vars(parser.parse_args()) input_arg = args['input'][0] if args['wordlist'] is not None: wordlist_file = args['wordlist'] if args['output'] is not None: output_file = args['output'] # Download nltk's punkt if missing # try: # nltk.data.find('tokenizers/punkt') # except LookupError: # nltk.download('punkt') # Import text data if input_arg.split(".")[-1]=="csv": # Check if input is csv file text_data = pd.read_csv(input_arg, names=["Text"], encoding='latin1') elif os.path.isdir(input_arg): # Check if input is a folder text_data = folder_import(input_arg) else: raise ValueError("input should be path to a folder or csv file") # Import wordlist wordlist = pd.read_csv(wordlist_file) # Sentiment analysis results = sentiment_analysis(text_data, wordlist) # Export results results.to_excel(output_file, index=False) if __name__ == '__main__': main()





Gribouillis write May-24-2021, 02:58 PM:
Please post all code, output and errors (it it's entirety) between their respective tags. Refer to BBCode help topic on how to post. Use the "Preview Post" button to make sure the code is presented as you expect before hitting the "Post Reply/Thread" button.
I fixed for you this time. Please use code tags on future posts.
Please post all code, output and errors (it it's entirety) between their respective tags. Refer to BBCode help topic on how to post. Use the "Preview Post" button to make sure the code is presented as you expect before hitting the "Post Reply/Thread" button.
I fixed for you this time. Please use code tags on future posts.