Python Forum
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
get nltk data
#1
According to the instructions, getting nltk data packets is easy:

import nltk
nltk.download('punkt_tab')
But I get:
Output:
Traceback (most recent call last): File "<pyshell#4>", line 1, in <module> nltk.download('punkt_tab') File "/home/pedro/Python_Virtual_Environments/GP_env/lib/python3.10/site-packages/unstructured/nlp/tokenize.py", line 24, in _raise_on_nltk_download raise ValueError("NLTK download disabled. See CVE-2024-39705") ValueError: NLTK download disabled. See CVE-2024-39705
Better to not go there?
Reply
#2
Try a search engine perhaps.
« We can solve any problem by introducing an extra level of indirection »
Reply
#3
I read that these "pickles" can be unsafe, which is why it is blocked!

I was only trying to use unstructured to get the Russian text from a PDF from another question.

In the end, I converted the a.pdf to a jpg, then used tesseract, worked fine. Don't know why I didn't think of that in the first place!

There didn't seem to be any way to get the text using various pdf modules.
Reply
#4
(Aug-11-2024, 07:52 AM)Pedroski55 Wrote: I read that these "pickles" can be unsafe, which is why it is blocked!
I read that the vulnerability only affects nltk up to version 3.8.1, which means that if you can install 3.8.2 which is available in Pypi, it should work.
« We can solve any problem by introducing an extra level of indirection »
Reply
#5
Thanks for the reply!

After installing unstructured, I had to spend about an hour installing other modules which were needed, like unstructured-inference, pillow-heif and quite a few others. I think, in my situation, with a.pdf, convert to jpg and use tesseract was much much easier and produced a good result!.

I don't know if nltk is a builtin. I did not install it, maybe ustructured did? I only installed python on this new laptop last week, so whatever I have should be fairly recent!
Reply
#6
(Aug-11-2024, 08:48 AM)Pedroski55 Wrote: I don't know if nltk is a builtin. I did not install it, maybe ustructured did?
Some module may have installed nltk as a requirement. You could try to update nltk by running
Output:
python -m pip install -U nltk
then check the installed version.
Pedroski55 likes this post
« We can solve any problem by introducing an extra level of indirection »
Reply
#7
(Aug-11-2024, 08:48 AM)Pedroski55 Wrote: Thanks for the reply!

After installing unstructured, I had to spend about an hour installing other modules which were needed, like unstructured-inference, pillow-heif and quite a few others. I think, in my situation, with a.pdf, convert to jpg and use tesseract was much much easier and produced a good result!.

I don't know if nltk is a builtin. I did not install it, maybe ustructured did? I only installed python on this new laptop last week, so whatever I have should be fairly recent!

I think Gribouillis is referring to version of nltk while you are referring to python version.

you can check version like this:

import nltk
print(nltk.__version__)
you can install the correct version like this:

pip install nltk==3.8.2
Pedroski55 likes this post
Reply
#8
Thanks for the replies!

Well, I have:

import nltk
print(nltk.__version__)
Output:
3.8.2
unstructured seemed good when reading about it, but I think, it just uses other tools all put together in one lump, especially pdfminer.

The problem with the Russian 1 page pdf was, it was not possible to get proper Cyrillic text from it.

It has 2 TimesRoman type embedded fonts.

Quote:[(8, 'ttf', 'Type0', 'TimesNewRomanPS-BoldMT', 'G1', 'Identity-H', 0), (9, 'ttf', 'Type0', 'TimesNewRomanPSMT', 'G2', 'Identity-H', 0)]

No matter what I tried, I could not get sensible text from it, which was the question the OP was trying to solve. But it displays fine in the PDFViewer.

Maybe the PDF, a.pdf is corrupt, but I do not think so, or it would not display properly. PDFs are very tricky things. It was created in macOS.

Quote:macOS Версия 10.15.7 (Выпуск 19H2026) Quartz PDFContext

I don't know enough about the internal storage structure of PDFs to find an answer, but converting to jpg and then using tesseract produced a good result!

Looking at the PDF in Libre OfficeDraw, the parts of the text which were causing the problem showed up as small text fields with no visible content!

But I got the text in the end, so I'm happy, even if I could not get it how I wanted to get it!
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Help with simple nltk Chatbot Extra 3 3,397 Jan-02-2022, 07:50 AM
Last Post: bepammoifoge
  Saving a download of stopwords (nltk) Drone4four 1 12,901 Nov-19-2020, 11:50 PM
Last Post: snippsat
  Installing nltk dependency Eshwar 0 2,506 Aug-30-2020, 06:10 PM
Last Post: Eshwar
  Clean Data using NLTK disruptfwd8 0 3,846 May-12-2018, 11:21 PM
Last Post: disruptfwd8
  Text Processing and NLTK (POS tagging) TwelveMoons 2 5,706 Mar-16-2017, 02:53 AM
Last Post: TwelveMoons
  NLTK create corpora pythlang 5 11,740 Oct-26-2016, 07:31 PM
Last Post: Larz60+
  serious n00b.. NLTK in python 2.7 and 3.5 pythlang 24 23,673 Oct-21-2016, 04:15 PM
Last Post: pythlang
  Corpora catalof for NLTK Larz60+ 1 4,807 Oct-20-2016, 02:31 AM
Last Post: Larz60+

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020