Python Forum
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Converting XML
#1
Hi

There is xml-file (350 mb) that is a dictionary of tagged russian words. I want use it to tag words but problem is that search in xml-file is very time-complex thing.

My question is: in what I should convert xml to handle this data more efficiently? I think of trie: whether it is good idea to create "forest" of tries ? Each word will look like this :

[Image: ZwqzBHG__mU.jpg]

// "КОТ" - this is a russian word for a male cat
Reply
#2
Quote: it is good idea to create "forest" of tries ?
I don't think it's a very efficient idea. The 350 mb can probably be reduced drastically. How many words are there in the dictionary? how many different tags? How many tags per word? Are there other informations?
Reply
#3
(Jul-21-2019, 08:38 PM)Gribouillis Wrote:
Quote: it is good idea to create "forest" of tries ?
I don't think it's a very efficient idea. The 350 mb can probably be reduced drastically. How many words are there in the dictionary? how many different tags? How many tags per word? Are there other informations?

It contains about 150.000 words (lemmes) and each word has 7-11 forms; each form has 4-6 tags.
Reply
#4
lxml should handle this data.
It's written in C and very fast. It has the same api like the built-in module xml provides.
Almost dead, but too lazy to die: https://sourceserver.info
All humans together. We don't need politicians!
Reply
#5
I made an experiment of creating a sorted list of 150000 * 10 unique random unicode words having between 5 and 12 unicode characters. By pickling this list I obtain a 50 megabytes file. By loading the whole list and using bisect.bisect() to find the word, it takes 0.4 microseconds to find a word in the list (according to module timeit). You could then create another list containing for each word a reference to the tags (for exemple a python tuple of integers or a shorter structure such as a byte string containing the indices of the tags). This should give a fairly efficient way to find the tags, at the cost of some RAM consumption. A python dictionary could perhaps be used as well.
Reply


Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020