Python Forum
Thai Text Segmentation Module - Printable Version

+- Python Forum (https://python-forum.io)
+-- Forum: Python Coding (https://python-forum.io/forum-7.html)
+--- Forum: General Coding Help (https://python-forum.io/forum-8.html)
+--- Thread: Thai Text Segmentation Module (/thread-1886.html)



Thai Text Segmentation Module - draems - Feb-02-2017

Good day everyone. I'm working with a project that deals with Thai texts processing. Do you have any suggestions of what module to use? I need to detect the name of the places in the texts. I'm considering pythai module, however, can't get it running to my ubuntu. 
Thank you in advance and God bless.


RE: Thai Text Segmentation Module - j.crater - Feb-02-2017

Hello,
quick search found this alternative - Thai language NLP:
https://pypi.python.org/pypi/pythainlp/1.0.0
If you would like us to help you get pythai module running on your system you will need to provide us with more details of problems encountered.
Other more brute-force-like solution is searching and comparing strings, if you can get a list of place names that you are after.
Good luck!


RE: Thai Text Segmentation Module - snippsat - Feb-02-2017

One of the biggest changes in Python 3 was Unicode.
Not gone talk about,better to show it.
# Python 3.6
language = 'หลาม'
char = 'า'
if char in language:
    #print('Yes {} is in {}'.format(char, language))
    print(f'Yes {char} is in {language}') #The new way
Output:
Yes า is in หลาม
In and out is utf-8 okay?
>>> char = 'า'
>>> char
'า'
>>> e = char.encode()
>>> e
b'\xe0\xb8\xb2'
>>> e.decode('utf-8')
'า'
Yes it is,so out/in test.
language = 'หลาม'
with open('thai.txt', 'w', encoding='utf-8') as f_out:
    f_out.write(language)
with open('thai.txt', encoding='utf-8') as f_in:
    f = f_in.read()
    print(f) #--> หลาม



RE: Thai Text Segmentation Module - draems - Feb-02-2017

[quote pid='9514' dateline='1486018570']
j.craterHello, quick search found this alternative - Thai language NLP: https://pypi.python.org/pypi/pythainlp/1.0.0 If you would like us to help you get pythai module running on your system you will need to provide us with more details of problems encountered. Other more brute-force-like solution is searching and comparing strings, if you can get a list of place names that you are after. Good luck!
[/quote]

Thank you for the suggestion sir. Actually, I been working with python in windows. I searched a lot already for modules and I never get into pythainlp. I don't know why, but I suspect it's because of my browser setting? I really don't know. If I can get pythai running in windows, then I will used it. I was forced to try pythai and other solutions in linux because all suggested modules from my searches are done in linux platform. I'll give updates later. Thank you for the help sir.

[quote pid='9516' dateline='1486021218']
snippsatOne of the biggest changes in Python 3 was Unicode. Not gone talk about,better to show it.
# Python 3.6 language = 'หลาม' char = 'า' if char in language:     #print('Yes {} is in {}'.format(char, language))     print(f'Yes {char} is in {language}') #The new way
Output:
Yes า is in หลาม
In and out is utf-8 okay?
>>> char = 'า' >>> char 'า' >>> e = char.encode() >>> e b'\xe0\xb8\xb2' >>> e.decode('utf-8') 'า'
Yes it is,so out/in test.
language = 'หลาม' with open('thai.txt', 'w', encoding='utf-8') as f_out:     f_out.write(language) with open('thai.txt', encoding='utf-8') as f_in:     f = f_in.read()     print(f) #--> หลาม
[/quote]


Thank you for the response sir and sorry for very late reply since. Sorry if this my a a stupid question for you from a newbee. Will this work in python 3.4? Currently, I using python in windows and fetching data from postgresql and the latest version supported as I have read by psycopg2 is python 3.4. I'll try installing 3.6 and I'll give updates. Thank you.
#tried this code in 3.4 IDLE give me error.
>>> language = 'หลาม'
>>> char = 'า'
>>> if char in language:
print(f'Yes {char} is in {language}')
SyntaxError: invalid syntax

I tried pythainlp in python 2.7 and 3.4 in windows and it gives this error.

Error:
      File "C:\Python34\lib\subprocess.py", line 1112, in _execute_child         startupinfo)     FileNotFoundError: [WinError 2] The system cannot find the file specified Command "python setup.py egg_info" failed with error code 1 in C:\Users\DRMS~1\AppData\Local\Temp\pip-build-z8g9b7zi\pyicu\
I also tried it in linux (slackware and ubuntu) and gives this.

I tried pythainlp in python 2.7 and 3.4 in windows and it gives this error.

Error:
    File "C:\Python34\lib\subprocess.py", line 1112, in _execute_child         startupinfo)     FileNotFoundError: [WinError 2] The system cannot find the file specified     ---------------------------------------- Command "python setup.py egg_info" failed with error code 1 in C:\Users\DRMS~1\AppData\Local\Temp\pip-build-z8g9b7zi\pyicu\
I also tried it in linux (slackware and ubuntu) and gives this.

Error:
    File "/usr/local/lib/python3.4/subprocess.py", line 1460, in _execute_childraise child_exception_type(errno_num, err_msg)     FileNotFoundError: [Errno 2] No such file or directory: 'icu-config'     ---------------------------------------- Command "python setup.py egg_info" failed with error code 1 in /tmp/pip-build-cjz6f7eb/pyicu/



RE: Thai Text Segmentation Module - snippsat - Feb-03-2017

Quote:#tried this code in 3.4 IDLE give me error.
You can not use f-string in Python 3.4.
You use that line that i comment out.
Also you need indentation,like i have here. 
# Python 3.4
>>> language = 'หลาม'
>>> language
'หลาม'
>>> char = 'า'
>>> if char in language:
...     print('Yes {} is in {}'.format(char, language)) 
...     
Yes า is in หลาม