Posts: 1,090
Threads: 143
Joined: Jul 2017
According to the instructions, getting nltk data packets is easy:
import nltk
nltk.download('punkt_tab') But I get:
Output: Traceback (most recent call last):
File "<pyshell#4>", line 1, in <module>
nltk.download('punkt_tab')
File "/home/pedro/Python_Virtual_Environments/GP_env/lib/python3.10/site-packages/unstructured/nlp/tokenize.py", line 24, in _raise_on_nltk_download
raise ValueError("NLTK download disabled. See CVE-2024-39705")
ValueError: NLTK download disabled. See CVE-2024-39705
Better to not go there?
Posts: 4,780
Threads: 76
Joined: Jan 2018
Try a search engine perhaps.
« We can solve any problem by introducing an extra level of indirection »
Posts: 1,090
Threads: 143
Joined: Jul 2017
I read that these "pickles" can be unsafe, which is why it is blocked!
I was only trying to use unstructured to get the Russian text from a PDF from another question.
In the end, I converted the a.pdf to a jpg, then used tesseract, worked fine. Don't know why I didn't think of that in the first place!
There didn't seem to be any way to get the text using various pdf modules.
Posts: 4,780
Threads: 76
Joined: Jan 2018
(Aug-11-2024, 07:52 AM)Pedroski55 Wrote: I read that these "pickles" can be unsafe, which is why it is blocked! I read that the vulnerability only affects nltk up to version 3.8.1, which means that if you can install 3.8.2 which is available in Pypi, it should work.
« We can solve any problem by introducing an extra level of indirection »
Posts: 1,090
Threads: 143
Joined: Jul 2017
Thanks for the reply!
After installing unstructured, I had to spend about an hour installing other modules which were needed, like unstructured-inference, pillow-heif and quite a few others. I think, in my situation, with a.pdf, convert to jpg and use tesseract was much much easier and produced a good result!.
I don't know if nltk is a builtin. I did not install it, maybe ustructured did? I only installed python on this new laptop last week, so whatever I have should be fairly recent!
Posts: 4,780
Threads: 76
Joined: Jan 2018
Aug-11-2024, 08:54 AM
(This post was last modified: Aug-11-2024, 08:54 AM by Gribouillis.)
(Aug-11-2024, 08:48 AM)Pedroski55 Wrote: I don't know if nltk is a builtin. I did not install it, maybe ustructured did? Some module may have installed nltk as a requirement. You could try to update nltk by running
Output: python -m pip install -U nltk
then check the installed version.
Pedroski55 likes this post
« We can solve any problem by introducing an extra level of indirection »
Posts: 8
Threads: 0
Joined: Aug 2024
Aug-11-2024, 10:53 AM
(This post was last modified: Aug-11-2024, 10:54 AM by wewer.)
(Aug-11-2024, 08:48 AM)Pedroski55 Wrote: Thanks for the reply!
After installing unstructured, I had to spend about an hour installing other modules which were needed, like unstructured-inference, pillow-heif and quite a few others. I think, in my situation, with a.pdf, convert to jpg and use tesseract was much much easier and produced a good result!.
I don't know if nltk is a builtin. I did not install it, maybe ustructured did? I only installed python on this new laptop last week, so whatever I have should be fairly recent!
I think Gribouillis is referring to version of nltk while you are referring to python version.
you can check version like this:
import nltk
print(nltk.__version__) you can install the correct version like this:
pip install nltk==3.8.2
Pedroski55 likes this post
Posts: 1,090
Threads: 143
Joined: Jul 2017
Thanks for the replies!
Well, I have:
import nltk
print(nltk.__version__) Output: 3.8.2
unstructured seemed good when reading about it, but I think, it just uses other tools all put together in one lump, especially pdfminer.
The problem with the Russian 1 page pdf was, it was not possible to get proper Cyrillic text from it.
It has 2 TimesRoman type embedded fonts.
Quote:[(8, 'ttf', 'Type0', 'TimesNewRomanPS-BoldMT', 'G1', 'Identity-H', 0), (9, 'ttf', 'Type0', 'TimesNewRomanPSMT', 'G2', 'Identity-H', 0)]
No matter what I tried, I could not get sensible text from it, which was the question the OP was trying to solve. But it displays fine in the PDFViewer.
Maybe the PDF, a.pdf is corrupt, but I do not think so, or it would not display properly. PDFs are very tricky things. It was created in macOS.
Quote:macOS Версия 10.15.7 (Выпуск 19H2026) Quartz PDFContext
I don't know enough about the internal storage structure of PDFs to find an answer, but converting to jpg and then using tesseract produced a good result!
Looking at the PDF in Libre OfficeDraw, the parts of the text which were causing the problem showed up as small text fields with no visible content!
But I got the text in the end, so I'm happy, even if I could not get it how I wanted to get it!
|