Python Forum

Full Version: How to summarize an article that is stored in a word document on your laptop?
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
So I am new here.... I wrote a code in pycharm that summarizes articles online. This is the code below: it works fine. what about if I want to summarize an article that is stored in a word document on my laptop? can somebody help me with the code? Again I am using Anaconda prompt and pycharm

import tkinter as tk
import nltk
from textblob import TextBlob
from newspaper import Article


url = "https://www.news.com/index.html"

article = Article(url)

article.download()
article.parse()

article.nlp()

print(f'Title: {article.title}')
print(f'Authors: {article.authors}')
print(f'Publication Date: {article.publish_date}')
print(f'Summary: {article.summary}')
Word documents have metadata information. You can access that, if that is what you are looking for.

I think a word document will only have a title, author name, etc. if the author actually puts that data in the document metadata.

For personal stuff, I don't think many people will do that.

Maybe the publish date and modified date are recorded automatically.

I copied this from stackoverflow

# if you don't have it, first install python-docx module: pip3 install python-docx
import docx

path2file = "/home/pedro/myStuff/mydocument1.docx"

def getMetaData(doc):
    metadata = {}
    prop = doc.core_properties
    metadata["author"] = prop.author
    metadata["category"] = prop.category
    metadata["comments"] = prop.comments
    metadata["content_status"] = prop.content_status
    metadata["created"] = prop.created
    metadata["identifier"] = prop.identifier
    metadata["keywords"] = prop.keywords
    metadata["last_modified_by"] = prop.last_modified_by
    metadata["language"] = prop.language
    metadata["modified"] = prop.modified
    metadata["subject"] = prop.subject
    metadata["title"] = prop.title
    metadata["version"] = prop.version
    return metadata

doc = docx.Document(path2file)
metadata_dict = getMetaData(doc)
for item in metadata_dict.items():
    print(item)
Sometimes I want to get the text from .docx files. I never needed the metadata!
This code basically pulls just high level information. I will try to write a new code and will post it it when done.. Thanks so much Pedro!