Python Forum

#!/usr/bin/python

# Creating an output file in writing mode
output_file = open("newfile.txt", "w")

# write 3 header records
output_file.write('<?xml version="1.0" encoding="utf-8"?>\n')
output_file.write("<!DOCTYPE KMYMONEY-FILE>\n")
output_file.write("<KMYMONEY-FILE>\n")

write_flag = 0

# Open the file in read mode
with open('Australian-2024-11-30.xml', 'r') as file:
    # Read each line in the file
    for line in file:
        string = line
        sub_str1 = "<TRANSACTIONS"
        sub_str2 = " <SCHEDULES count"

        if sub_str1 in string:
            print("YES")
            write_flag = 1      #commence writing to newfile.txt
        elif sub_str2 in string:
            write_flag = 0      #stop writing when this string found
            print("schedules found")

        if write_flag:
            output_file.write(file.read())

# Close the output file
output_file.close()

The output has all the "<TRANSACTIONS" tag and associated children, BUT it also has all the "<SCHEDULES" tag , plus all data after that. The variable "write_flag" is not being turned off, despite the fact that the "schedules" tag is present ?

In the data, there is only one occurence of "sub_str1" and "sub_str2". So the writes to the output get turned ON at sub_str1 and then turned OFF at sub_str2. But once that flag is on, it stays on, which suggests the

elif sub_str2 in string:

is not being tested. Or is being tested, yet returns false.

Your code does not look for “<SCHEDULES”. Maybe remove the leading blank and count from sub_str2.

But the real problem is using read(). output_file.write(file.read()) is the last command executed in the loop. It reads the remainder of file and writes that to the output file. It also moves the file pointer to the end of file, ending the loop. I think you might want to do this:

with open("input.txt", "r") as file, open("output.txt", "w") as output_file:
    writing = False
    for line in file:
        if "<TRANSACTIONS" in line:
            writing = True
        elif "<SCHEDULES" in line:
            writing = False
        elif writing:
            output_file.write(line)

When I run using this as the input.txt file:

Output:A
<TRANSACTIONS
C
D
<SCHEDULES
F

I get this in the output.txt file

Output:C
D

Thanks @deanhystad , that code works just fine. Only a few extra lines as an XML requirement with BeautifulSoup. I have used the output file as input to other Python code, and the accounts now balance. Which they didn't do before, as the 'transactions' within schedules was altering totals.

#!/usr/bin/python

# Re-write the XML file - issues with BeautifulSoup finding "TRANSACTIONS" within schedules

with open("Australian-2024-11-30.xml", "r") as file, open("output.txt", "w") as output_file:

    # write 3 header records, otherwise BeautifulSoup doesn't recognise the output file as XML'
    output_file.write('<?xml version="1.0" encoding="utf-8"?>\n')
    output_file.write("<!DOCTYPE KMYMONEY-FILE>\n")
    output_file.write("<KMYMONEY-FILE>\n")
    writing = False

    for line in file:
        if "<TRANSACTIONS" in line:     #required
            writing = True
        elif "<SCHEDULES" in line:      #not requred
            writing = False
        elif writing:
            output_file.write(line)

Quote: 'transactions' within schedules was altering totals

I think an xml parser would be a better choice for filtering out scheduled transactions.

(Dec-03-2024, 03:55 PM)deanhystad Wrote: [ -> ]I think an xml parser would be a better choice for filtering out scheduled transactions.

Using a parser for this part of the project was the reason why I needed to re-write the file. The problem was a limiting one, in that to effectively 'filter', there was a need to 'chase' the parents. However the parent level in both sets of data was very different. The KIS method to first re-write the file as per code above, and then use BeautifulSoup on the second parse.

jehoshua

deanhystad

jehoshua

deanhystad

jehoshua