Python Forum
Why can't it extract the data from .txt well?
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Why can't it extract the data from .txt well?
#1
Version 1. Very good. The dictionaries are compared, and the diacritics from the first dictionary are included in the output.

import tkinter as tk
import re
from tkinter import messagebox, simpledialog
from unidecode import unidecode  # Importați unidecode

# Presupunem că avem următoarele liste:
dictionar = ["înţeleasă", "Eului", "misterului"]  # și așa mai departe
dictionar_2 = ["inteleasa", "Eului", "misterului"]  # și așa mai departe

text = "Fiind inteleasa identitate dintre planul Eului ... "  # și așa mai departe

# Parcurgem fiecare cuvânt din dictionar_2
for idx, cuvant in enumerate(dictionar_2):
    # Înlocuim cuvântul fără diacritice cu cel cu diacritice
    text = text.replace(cuvant, dictionar[idx])

print(text)
In the following code, I want to do the same thing as in the first code, only by extracting data from .txt that contain the same words:

import tkinter as tk
import re
from tkinter import messagebox, simpledialog
from unidecode import unidecode  # Importați unidecode

# Citim cuvintele din dictionar.txt
with open('dictionar.txt', 'r', encoding='utf-8') as f:
    dictionar = f.read().splitlines()

# Citim cuvintele din dictionar-2.txt
with open('dictionar-2.txt', 'r', encoding='utf-8') as f:
    dictionar_2 = f.read().splitlines()

text = "Fiind inteleasa identitate dintre planul Eului ... "  # și așa mai departe

# Parcurgem fiecare cuvânt din dictionar_2
for idx, cuvant in enumerate(dictionar_2):
    # Verificăm dacă cuvântul fără diacritice există în text
    if cuvant in text:
        # Înlocuim cuvântul fără diacritice cu cel cu diacritice
        text = text.replace(cuvant, dictionar[idx])
        print(f"Înlocuit {cuvant} cu {dictionar[idx]}")

print("Textul inițial:", "Fiind inteleasa identitate dintre planul Eului ... ")
print("Textul final:", text)
In dictionary.txt I have the words:

Fiind, înţeleasă, identitate, dintre, planul, Eului, cel, misterului, substanţa, creaţiei, întemeiază, proces, simbolizare, realităţii, cuprinse, specifice, zonei, aflu, scoici, fosile, melci, alge, aduse, ţărm, bucăţele, sticlă, mării, şlefuieşte, timp, şezlonguri, umbrele, vânzători, ambulanți, activități, nautice, și
In dictionary-2.txt I have the same words, but without diacritics:

Fiind, inteleasa, identitate, dintre, planul, Eului, cel, misterului, substanta, creatiei, intemeiaza, proces, simbolizare, realitatii, cuprinse, specifice, zonei, aflu, scoici, fosile, melci, alge, aduse, tarm, bucatele, sticla, marii, slefuieste, timp, sezlonguri, umbrele, vanzatori, ambulanti, activitati, nautice, si
The output should be: (word "înţeleasă" with diacritics)

Fiind înţeleasă identitate dintre planul Eului
Reply
#2
This seems to be good

import tkinter as tk
from unidecode import unidecode
import re

# Citim cuvintele din dictionar.txt
with open('dictionar.txt', 'r', encoding='utf-8') as f:
    dictionar = f.read().split(', ')

# Citim cuvintele din dictionar-2.txt
with open('dictionar-2.txt', 'r', encoding='utf-8') as f:
    dictionar_2 = f.read().split(', ')

def adauga_diacritice():
    # Extragem textul din widget-ul Text
    text = text_input.get("1.0", tk.END)

    # Împărțim textul în linii
    linii = text.split('\n')

    # Procesăm fiecare linie separat
    linii_procesate = []
    for linie in linii:
        cuvinte_linie = re.split(r'(\W+)', linie)  # Folosim regex pentru a extrage cuvintele și semnele de punctuație separat
        linie_finala = []
        for cuvant in cuvinte_linie:
            if cuvant and cuvant[0].isalpha():  # Verificăm dacă este cuvânt
                cuvant_fara_diacritice = unidecode(cuvant).lower()
                print(f"Verificăm cuvântul: {cuvant_fara_diacritice}")
                if cuvant_fara_diacritice in dictionar_2:
                    idx = dictionar_2.index(cuvant_fara_diacritice)
                    if cuvant[0].isupper():
                        linie_finala.append(dictionar[idx].capitalize())
                    else:
                        linie_finala.append(dictionar[idx])
                else:
                    linie_finala.append(cuvant)
            else:
                linie_finala.append(cuvant)  # Adăugăm semnele de punctuație fără modificări
        linii_procesate.append(''.join(linie_finala))

    # Construim textul final, păstrând alineatele
    text_final = '\n'.join(linii_procesate)

    # Ștergem conținutul actual și adăugăm textul procesat
    text_input.delete("1.0", tk.END)
    text_input.insert(tk.END, text_final)

root = tk.Tk()
root.title("Adăugare Diacritice")

text_input = tk.Text(root, height=20, width=50)
text_input.pack(pady=20)

btn_diacritice = tk.Button(root, text="Diacritice", command=adauga_diacritice)
btn_diacritice.pack(side=tk.LEFT, padx=10)

root.mainloop()
Reply
#3
Hello,

and the questions is...? It's missing in your original post. If something doesn't worl as you expect, please describe what doesn't work and what you get as a result.

Regards, noisefloor
Reply
#4
What is going on here? You already answered these questions here:

https://python-forum.io/thread-40556.html

What am I missing? Is the problem that the words are separated by commas and whitespace? This is actually a much simpler problem than you had in the other thread. You can use the same mechanism as before, but use a different regex pattern. You could also treat the file as a csv, and split the file on commas (Comma Separated Values). If going the CSV route you'll probably have to set some parameter in the csv read function to remove the extra spaces.
import csv
from io import StringIO

dictionar_2 = StringIO("Fiind, inteleasa, identitate, dintre, planul, Eului, cel, misterului")

reader = csv.reader(dictionar_2, skipinitialspace=True)
print(*reader)
Output:
['Fiind', 'inteleasa', 'identitate', 'dintre', 'planul', 'Eului', 'cel', 'misterului']
Be aware that "not whitespace" may not be what you expect:
import re

print(re.split("\W+", "This doesn't handle contractions or punctuation well."))
Output:
['This', 'doesn', 't', 'handle', 'contractions', 'or', 'punctuation', 'well', '']
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Is it possible to extract 1 or 2 bits of data from MS project files? cubangt 8 1,072 Feb-16-2024, 12:02 AM
Last Post: deanhystad
  python Extract sql data by combining below code. mg24 1 971 Oct-03-2022, 10:25 AM
Last Post: mg24
  SQL Alchemy help to extract sql data into csv files mg24 1 1,793 Sep-30-2022, 04:43 PM
Last Post: Larz60+
  How to extract specific data from .SRC (note pad file) Shinny_Shin 2 1,285 Jul-27-2022, 12:31 PM
Last Post: Larz60+
  Build a matrix by pressing buttons of an interface in Tkinter which extract data from juandiegopulla 1 1,964 Sep-13-2021, 07:28 PM
Last Post: deanhystad
  Python Pandas: How do I extract all the >1000 data from a certain column? JaneTan 0 1,569 Jul-17-2021, 09:09 AM
Last Post: JaneTan
  Need help on extract dynamic table data Dr_Strange 0 2,504 Apr-30-2021, 07:03 AM
Last Post: Dr_Strange
  Python modules to extract data from a graph? bigmit37 5 22,455 Apr-09-2021, 02:15 PM
Last Post: TysonL
  Pandas Extract data from two dataframe nio74maz 1 2,194 Dec-26-2020, 09:52 PM
Last Post: nio74maz
  Extract data from PDF page to Excel nathan_nz 1 2,715 Oct-29-2020, 08:04 PM
Last Post: Larz60+

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020