Python Forum
Need Help with Simple Text Reformatting Problem
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Need Help with Simple Text Reformatting Problem
#1
So, I'm trying to reformat a text from Shakespear's plays so that, every monologue is only one line long.

For example, this is what I am given:


"<STAGEDIR>Exeunt MARK ANTONY and CLEOPATRA with
their train</STAGEDIR>

<SPEECH>
<SPEAKER>DEMETRIUS</SPEAKER>
<LINE>Is Caesar with Antonius prized so slight?</LINE>
</SPEECH>

<SPEECH>
<SPEAKER>PHILO</SPEAKER>
<LINE>Sir, sometimes, when he is not Antony,</LINE>
<LINE>He comes too short of that great property</LINE>
<LINE>Which still should go with Antony.</LINE>
</SPEECH>

<SPEECH>
<SPEAKER>DEMETRIUS</SPEAKER>..."


And this is what I want:

"Is Caesar with Antonius prized so slight?

Sir, sometimes, when he is not Antony, he comes too short of that great property which still should go with Antony.

..."


So far I have been able to get this far:

"Is Caesar with Antonius prized so slight?

Sir, sometimes, when he is not Antony,
He comes too short of that great property
Which still should go with Antony.

..."



And here is my code thus far...

import re

ss = ["a_and_c.txt", "dream.txt", "hamlet.txt", "j_caesar.txt", "macbeth.txt", "merchant.txt", "othello.txt",
      "r_and_j.txt"]
size = len(ss)

f = open(
    "/Users/Tuck/Documents/PyCharm_PythonPrograms/ChatBot_Test/corpus/shakespeare/ShakespearsDialogsPreProcessed.txt",
    'w')
print f

i = 0
while (i < 8):
    print ss[i]
    path = "/Users/Tuck/Documents/PyCharm_PythonPrograms/ChatBot_Test/corpus/shakespeare/" + ss[i]
    print ("working on: " + path)
    p = open(path, 'r')
    for line in p:
        if "</SPEECH>" in line:
            f.write('\n')
            print line
        if "<LINE>" in line:
            line = line.split("<LINE>")
            line = "".join(str(x) for x in line)
            _line = re.sub(r"</LINE>", " ", line)
            f.write(_line)
        else:
            print "moving to next line!"
        i += 1
Any thoughts/suggestions?
Reply
#2
Where you have:
        if "<LINE>" in line:
            line = line.split("<LINE>")
            line = "".join(str(x) for x in line)
            _line = re.sub(r"</LINE>", " ", line)
            f.write(_line)
You have not taken out the end of line characters.
I have not created all the little files you have but try this:
        if "<LINE>" in line:
            line = line.split("<LINE>")
            line = "".join(str(x) for x in line)
            _line = re.sub(r"</LINE>", " ", line)
            outline = _line.rstrip('\r\n')
            _line = outline
            f.write(_line)
I just created the one file with all of your text in it so it might need some adjustment.
Reply
#3
I'd use the BeautifulSoup4 module. You can get all <SPEECH> tags, for every one get all <LINE> tags, join them and print the the result.
"As they say in Mexico 'dosvidaniya'. That makes two vidaniyas."
https://freedns.afraid.org
Reply
#4
Sweet... so... new problem... related to this...

I now have two files of different formats:
1.)"LinesfromDialogs.txt"
2.)"ShakespearsDialogs.txt"

In #1 the format is as follows-

"no no it is my fault we did not have a proper introduction
cameron

cameron
the thing is cameron i am at the mercy of a particularly hideous breed of loser my sister i cannot date until she does

the thing is cameron i am at the mercy of a particularly hideous breed of loser my sister i cannot date until she does
seems like she could get a date easy enough"

In #2, we have the following format:
"Nay, but this dotage of our general's O'erflows the measure: those his goodly eyes, That o'er the files and musters of the war Have glow'd like plated Mars, now bend, now turn, The office and devotion of their view Upon a tawny front: his captain's heart, Which in the scuffles of great fights hath burst The buckles on his breast, reneges all temper, And is become the bellows and the fan To cool a gipsy's lust. Look, where they come: Take but good note, and you shall see in him. The triple pillar of the world transform'd Into a strumpet's fool: behold and see.
If it be love indeed, tell me how much."

I am now trying to get them into the same .txt and using the same format as #2:

f = open("/Users/Tuck/Documents/PyCharm_PythonPrograms/ChatBot_Test/srcs/trainingdata.txt", 'w')
f2 = open("/Users/Tuck/Documents/PyCharm_PythonPrograms/ChatBot_Test/corpus/cornell movie-dialogs corpus/"
          "LinesfromDialogs.txt", 'r')
f3 = open("/Users/Tuck/Documents/PyCharm_PythonPrograms/ChatBot_Test/corpus/shakespeare/ShakespearsDialogs.txt", 'r')

for line in f3:
    f.write(line)
for i, line in f2:
    if (i % 3 == 1):
        f.write(line)
BUT! I keep getting the following error:

Error:
Traceback (most recent call last): File "/Users/Tuck/Documents/PyCharm_PythonPrograms/ChatBot_Test/CleanUpCorpras.py", line 19, in <module> for i, line in f2: ValueError: too many values to unpack
How can I fix this...?
Reply
#5
Not sure as I don#t use windows but 
Quote:f2 = open("/Users/Tuck/Documents/PyCharm_PythonPrograms/ChatBot_Test/corpus/cornell movie-dialogs corpus/"
          "LinesfromDialogs.txt", 'r')
contains a space between cornell and movie
so it may be taking them as 2 different values.
Reply
#6
Nevermind. I found a work around... I'm telling ya' I'm having a hel'of a day with these trivial problems...

I went back to the .py I used to make "LinesfromDialogs.txt" and edited it to match the data from Shakespeare's.
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Simple flask rest api problem cancerboi 4 2,812 Jan-29-2020, 03:10 PM
Last Post: brighteningeyes
  Help on parsing simple text on HTML amaumox 5 3,442 Jan-03-2020, 05:50 PM
Last Post: amaumox
  html to text problem Kyle 4 5,586 Apr-27-2018, 09:02 PM
Last Post: snippsat
  Problem With Simple Multiprocessing Script digitalmatic7 11 9,216 Apr-16-2018, 07:18 PM
Last Post: digitalmatic7
  Problem formatting output text aj347 5 4,131 Sep-10-2017, 04:54 PM
Last Post: nilamo

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020