Need Help with Simple Text Reformatting Problem

MattTuck · Aug-14-2017, 06:05 AM

So, I'm trying to reformat a text from Shakespear's plays so that, every monologue is only one line long.

For example, this is what I am given:

"<STAGEDIR>Exeunt MARK ANTONY and CLEOPATRA with
their train</STAGEDIR>

<SPEECH>
<SPEAKER>DEMETRIUS</SPEAKER>
<LINE>Is Caesar with Antonius prized so slight?</LINE>
</SPEECH>

<SPEECH>
<SPEAKER>PHILO</SPEAKER>
<LINE>Sir, sometimes, when he is not Antony,</LINE>
<LINE>He comes too short of that great property</LINE>
<LINE>Which still should go with Antony.</LINE>
</SPEECH>

<SPEECH>
<SPEAKER>DEMETRIUS</SPEAKER>..."

And this is what I want:

"Is Caesar with Antonius prized so slight?

Sir, sometimes, when he is not Antony, he comes too short of that great property which still should go with Antony.

..."

So far I have been able to get this far:

"Is Caesar with Antonius prized so slight?

Sir, sometimes, when he is not Antony,
He comes too short of that great property
Which still should go with Antony.

..."

And here is my code thus far...

import re

ss = ["a_and_c.txt", "dream.txt", "hamlet.txt", "j_caesar.txt", "macbeth.txt", "merchant.txt", "othello.txt",
      "r_and_j.txt"]
size = len(ss)

f = open(
    "/Users/Tuck/Documents/PyCharm_PythonPrograms/ChatBot_Test/corpus/shakespeare/ShakespearsDialogsPreProcessed.txt",
    'w')
print f

i = 0
while (i < 8):
    print ss[i]
    path = "/Users/Tuck/Documents/PyCharm_PythonPrograms/ChatBot_Test/corpus/shakespeare/" + ss[i]
    print ("working on: " + path)
    p = open(path, 'r')
    for line in p:
        if "</SPEECH>" in line:
            f.write('\n')
            print line
        if "<LINE>" in line:
            line = line.split("<LINE>")
            line = "".join(str(x) for x in line)
            _line = re.sub(r"</LINE>", " ", line)
            f.write(_line)
        else:
            print "moving to next line!"
        i += 1

Any thoughts/suggestions?

Barrowman · Aug-14-2017, 08:19 AM

Where you have:

        if "<LINE>" in line:
            line = line.split("<LINE>")
            line = "".join(str(x) for x in line)
            _line = re.sub(r"</LINE>", " ", line)
            f.write(_line)

You have not taken out the end of line characters.
I have not created all the little files you have but try this:

        if "<LINE>" in line:
            line = line.split("<LINE>")
            line = "".join(str(x) for x in line)
            _line = re.sub(r"</LINE>", " ", line)
            outline = _line.rstrip('\r\n')
            _line = outline
            f.write(_line)

I just created the one file with all of your text in it so it might need some adjustment.

wavic · Aug-14-2017, 09:58 AM

I'd use the BeautifulSoup4 module. You can get all <SPEECH> tags, for every one get all <LINE> tags, join them and print the the result.

MattTuck · Aug-14-2017, 09:56 PM

Sweet... so... new problem... related to this...

I now have two files of different formats:
1.)"LinesfromDialogs.txt"
2.)"ShakespearsDialogs.txt"

In #1 the format is as follows-
"no no it is my fault we did not have a proper introduction
cameron

cameron
the thing is cameron i am at the mercy of a particularly hideous breed of loser my sister i cannot date until she does

the thing is cameron i am at the mercy of a particularly hideous breed of loser my sister i cannot date until she does
seems like she could get a date easy enough"

In #2, we have the following format:
"Nay, but this dotage of our general's O'erflows the measure: those his goodly eyes, That o'er the files and musters of the war Have glow'd like plated Mars, now bend, now turn, The office and devotion of their view Upon a tawny front: his captain's heart, Which in the scuffles of great fights hath burst The buckles on his breast, reneges all temper, And is become the bellows and the fan To cool a gipsy's lust. Look, where they come: Take but good note, and you shall see in him. The triple pillar of the world transform'd Into a strumpet's fool: behold and see.
If it be love indeed, tell me how much."

I am now trying to get them into the same .txt and using the same format as #2:

f = open("/Users/Tuck/Documents/PyCharm_PythonPrograms/ChatBot_Test/srcs/trainingdata.txt", 'w')
f2 = open("/Users/Tuck/Documents/PyCharm_PythonPrograms/ChatBot_Test/corpus/cornell movie-dialogs corpus/"
          "LinesfromDialogs.txt", 'r')
f3 = open("/Users/Tuck/Documents/PyCharm_PythonPrograms/ChatBot_Test/corpus/shakespeare/ShakespearsDialogs.txt", 'r')

for line in f3:
    f.write(line)
for i, line in f2:
    if (i % 3 == 1):
        f.write(line)

BUT! I keep getting the following error:

Error:Traceback (most recent call last):
  File "/Users/Tuck/Documents/PyCharm_PythonPrograms/ChatBot_Test/CleanUpCorpras.py", line 19, in <module>
    for i, line in f2:
ValueError: too many values to unpack

How can I fix this...?

Barrowman · Aug-14-2017, 10:05 PM

Not sure as I don#t use windows but

Quote:f2 = open("/Users/Tuck/Documents/PyCharm_PythonPrograms/ChatBot_Test/corpus/cornell movie-dialogs corpus/"
"LinesfromDialogs.txt", 'r')

contains a space between cornell and movie
so it may be taking them as 2 different values.

MattTuck · (This post was last modified: Aug-14-2017, 10:09 PM by MattTuck.)

Nevermind. I found a work around... I'm telling ya' I'm having a hel'of a day with these trivial problems...

I went back to the .py I used to make "LinesfromDialogs.txt" and edited it to match the data from Shakespeare's.

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Simple flask rest api problem	cancerboi	4	4,047	Jan-29-2020, 03:10 PM Last Post: brighteningeyes
	Help on parsing simple text on HTML	amaumox	5	4,767	Jan-03-2020, 05:50 PM Last Post: amaumox
	html to text problem	Kyle	4	7,732	Apr-27-2018, 09:02 PM Last Post: snippsat
	Problem With Simple Multiprocessing Script	digitalmatic7	11	11,986	Apr-16-2018, 07:18 PM Last Post: digitalmatic7
	Problem formatting output text	aj347	5	5,436	Sep-10-2017, 04:54 PM Last Post: nilamo

Need Help with Simple Text Reformatting Problem

User Panel Messages

Announcements