Python Forum
Adding variable to Python code - media sentiment analysis
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Adding variable to Python code - media sentiment analysis
#1
Hello everyone!

I was wondering if you could help a student out :)

My name is Marie and I am currently studying Strategic Management at the Erasmus University in Rotterdam. For my master thesis I am investigating whether the relationship between media coverage and strategic change is contingent on the type of publication. I have downloaded several articles from the NexisUni database and I use a Python script to analyse the media sentiment for each article. The output file of the sentiment script (which is an Excel file) shows the data, file name, text, number of positive/negative words and the sentiment score. However, in order to distinguish the several media outlets, I also need to know the type of publication (e.g. The New York Times). The type of publication is currently not shown in the Excel file.

Do you have an idea how to include the variable 'publication type' into the python code so that the Excel sheet will also show the publication type for the articles?

Please find attached the python script I use. I have downloaded the script from Github: https://github.com/snauhaus/lexi_sent
I also attached an example of an Excel sheet with the analysis of two articles.

I hope that you are willing to take a look at this matter Smile

Best wishes,

Marie

#!/usr/bin/env python3
"""

"""
import pandas as pd
import numpy as np
import argparse
import os
import re # Regex
# import nltk
import string
import io # Handles encoding of text files

def janis_fadner(pos, neg):
    """Returns Fanis-Fadner Coefficient of Imbalance"""
    jfci = [0]*len(pos)
    for i, (p, n) in enumerate(zip(pos, neg)):
        if p > n:
            jfci[i] = (p**2 - p * n) / (p + n)**2
        elif p==0 & n==0:
            jfci[i] = 0
        else:
            jfci[i] = (p * n - n**2) / (p + n)**2
    return jfci


def word_counter(words, text):
    """Vectorized string search"""
    total = [0]*len(text) # Empty list
    for i, txt in enumerate(text):
        for word in words:
            if word in txt:
                total[i] = total[i] + 1
    return total


def sentiment_analysis(df, wordlist):
    """Sentiment analysis routine using janis fadner coefficient of imbalance"""

    # Get wordlist
    pos_words = wordlist[wordlist['sentiment'] > 0]['token'].to_list()
    neg_words = wordlist[wordlist['sentiment'] < 0]['token'].to_list()
       
    # Calculate sentiment
    df['PositiveWords'] = word_counter(pos_words, df['Text'])
    df['NegativeWords'] = word_counter(neg_words, df['Text'])
    df['Sentiment'] = janis_fadner(df['PositiveWords'], df['NegativeWords'])

   
    return df


def clean_doc(doc):
    """Cleans a document, extracts meta data, returns a dictionary"""
   
    # Split header and text
    topmarker = "Body"
    if re.search("\n" + topmarker + ".?\n", doc) is not None:
        headersplit = re.split("\n" + topmarker + ".?\n", doc)
        header = headersplit[0]
        body = headersplit[1]
        cleaned = 1
    else:
        body = doc
        header = ''
        cleaned = 0

    # Try getting the date
    try:
        dateresult = re.findall(r'\n\s{5}.*\d+.*\d{4}\s', header, flags=re.IGNORECASE)
        if header:
            dateresult += re.findall(r'\w+\s\d+.*\d{4}', header)
            dateresult += re.findall(r'\w+\s*\d{4}', header)
        date = dateresult[0].strip()
    except:
        date = ''

    # Clean text body
    # words = nltk.word_tokenize(body) # Tokenize words
    words = body.split()
    words = [w.lower() for w in words] # Lowercase everything
    words = list(set(words)) # Unique words only
    words = [w for w in words if w.isalpha()] # Letters only
    nb_words = len(words)
    words = ' '.join(words)

    # Collect results
    cleaned_doc = {
        'Text': words,
        'Date': date,
        'UniqueWords': nb_words
    }
   
    return cleaned_doc


def folder_import(path):
    """Function imports each document in path, cleans it, and appends to a data frame"""
    files = os.listdir(path)
    # Text files only
    files = [f for f in files if f.split(".")[-1]=="txt"]
    # Results table
    df = pd.DataFrame()
    # Loop through files in folder
    for i, f in enumerate(files):
        # Read file
        fp = io.open(os.path.join(path, f), 'r', encoding='latin1').read()
        # Clean file
        fp_clean = clean_doc(fp)
        # Add file name to results
        fp_clean['File'] = f
        # Append results to dataframe
        df = df.append(fp_clean, ignore_index=True)
    return df
   

def main():
    # Get command line arguments
    parser = argparse.ArgumentParser(description='Perform sentiment analysis on a list of documents.')
    parser.add_argument('input', type=str, nargs=1, help='A CSV file with a single column, containing the text to one document per row') #
    group = parser.add_mutually_exclusive_group(required=False)
    group.add_argument('-w','--wordlist', help='CSV file containing a word list with positive and negative words. Default is the MPQA word list, which ships with this script. Different files must follow the same format.', required=False, nargs=1, default='MPQA.csv') #
    group.add_argument('-o','--output', help='Name for output file. Defaults to "Sentiments.xslx"', required=False, nargs=1, default='Sentiments.xlsx') #
   
    # Parse arguments
    args = vars(parser.parse_args())
    input_arg = args['input'][0]
    if args['wordlist'] is not None:
        wordlist_file = args['wordlist']
    if args['output'] is not None:
        output_file = args['output']
       
    # Download nltk's punkt if missing
    # try:
    #     nltk.data.find('tokenizers/punkt')
    # except LookupError:
    #     nltk.download('punkt')
   
    # Import text data
    if input_arg.split(".")[-1]=="csv": # Check if input is csv file
        text_data = pd.read_csv(input_arg, names=["Text"], encoding='latin1')
    elif os.path.isdir(input_arg): # Check if input is a folder
        text_data = folder_import(input_arg)
    else:
        raise ValueError("input should be path to a folder or csv file")
       
    # Import wordlist
    wordlist = pd.read_csv(wordlist_file)
   
    # Sentiment analysis
    results = sentiment_analysis(text_data, wordlist)
   
    # Export results
    results.to_excel(output_file, index=False)


if __name__ == '__main__':
    main()

.xlsx   Sentiments Example Find & Pivot 3.xlsx (Size: 12.72 KB / Downloads: 258)
.xlsx   Sentiments Example Find & Pivot 3.xlsx (Size: 12.72 KB / Downloads: 258)
.xlsx   Sentiments Example Find & Pivot 3.xlsx (Size: 12.72 KB / Downloads: 258)
.xlsx   Sentiments Example Find & Pivot 3.xlsx (Size: 12.72 KB / Downloads: 258)
.xlsx   Sentiments Example Find & Pivot 3.xlsx (Size: 12.72 KB / Downloads: 258)
Gribouillis write May-24-2021, 02:58 PM:
Please post all code, output and errors (it it's entirety) between their respective tags. Refer to BBCode help topic on how to post. Use the "Preview Post" button to make sure the code is presented as you expect before hitting the "Post Reply/Thread" button.

I fixed for you this time. Please use code tags on future posts.
Reply
#2
Where would the script get the type/name of publication? Is it in the text file? Input for each file from keyboard?
Gribouillis likes this post
Reply
#3
Hello Jef! Thanks a lot for your response.

The input should be a CSV file with a single unnamed column, containing a document per row (see test.csv). The folder import expects the input to be a folder. Only txt files can be imported -> Each article has its own text file. The text file contains all the necessary information, such as the publication type/name and the date of the publication.
Reply
#4
Suggest that somewhere around line 104 you create a column in df that contains the filenames (files)
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  How would I be able to detect a media player app playing a video. phpjunkie 2 526 Oct-16-2023, 02:09 PM
Last Post: phpjunkie
  Python & Windows Media Player Extra 9 4,749 Apr-05-2022, 10:34 PM
Last Post: Extra
  How to use a variable in linux command in python code? ilknurg 2 1,547 Mar-14-2022, 07:21 AM
Last Post: ndc85430
  python audio analysis kiyoshi7 3 1,716 Feb-22-2022, 06:09 PM
Last Post: Axel_Erfurt
  Python - Import file sequence into Media Pool jensenni 1 2,082 Feb-02-2021, 05:11 PM
Last Post: buran
  Adanced Image Analysis with Python for reflecting surfaces domonkasshu 3 2,927 Jan-03-2021, 11:28 AM
Last Post: Larz60+
  Open windows media player minimised TamP 1 2,169 Aug-02-2020, 08:40 PM
Last Post: Larz60+
  Adding markers to Folium map only adding last element. tantony 0 2,095 Oct-16-2019, 03:28 PM
Last Post: tantony
  Adding Cosinus as utime.sleep variable Kreszy 0 1,694 Sep-19-2019, 04:27 PM
Last Post: Kreszy
  MEDIA ROOT ionezation 1 2,158 Mar-29-2019, 02:43 PM
Last Post: ichabod801

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020