Python Forum
pdfminer package: can't find exgtract_text function
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
pdfminer package: can't find exgtract_text function
#1
Hello,

Using pdfminer package I faced the following problem:

>>> from pdfminer import high_level
>>> extracted_text = high_level.extract_text(full_filename_inp, "", [4])
Traceback (most recent call last):
  File "<pyshell#1>", line 1, in <module>
    extracted_text = high_level.extract_text(full_filename_inp, "", [4])
AttributeError: module 'pdfminer.high_level' has no attribute 'extract_text'
But, according to documentation the function extract_text does exist in pdfminer package.
pdfminer package
Any suggestions ?
Thanks
Reply
#2
The document that you point to is pdfminer-six.

Since 2020, the original pdfminer is dormant, and pdfminer.six is the fork which Euske recommends if you need an actively maintained version of pdfminer.

Which do you have installed?

install for pdfminer-six is pip install pdfminer.six
Reply
#3
First I installed pdfminer:

Output:
pavel@ALABAMA:~$ pip3 install pdfminer Defaulting to user installation because normal site-packages is not writeable Collecting pdfminer Downloading pdfminer-20191125.tar.gz (4.2 MB) |████████████████████████████████| 4.2 MB 3.4 MB/s Requirement already satisfied: pycryptodome in ./.local/lib/python3.6/site-packages (from pdfminer) (3.9.6) Building wheels for collected packages: pdfminer Building wheel for pdfminer (setup.py) ... done Created wheel for pdfminer: filename=pdfminer-20191125-py3-none-any.whl size=6141904 sha256=34b8374913c5a3d565629c16bdd698b7488acd2c912fbab3cbc1ffec783d2f59 Stored in directory: /home/pavel/.cache/pip/wheels/c4/2c/33/fa5a7d524b90318c03454e176b442006c14ea2cfeb9337b308 Successfully built pdfminer Installing collected packages: pdfminer Successfully installed pdfminer-20191125
Then I saw this issue and installed pdfminer.six:

Output:
pavel@ALABAMA:~$ pip3 install pdfminer.six Defaulting to user installation because normal site-packages is not writeable Requirement already satisfied: pdfminer.six in ./.local/lib/python3.6/site-packages (20181108) Requirement already satisfied: six in ./.local/lib/python3.6/site-packages (from pdfminer.six) (1.12.0) Requirement already satisfied: pycryptodome in ./.local/lib/python3.6/site-packages (from pdfminer.six) (3.9.6) Requirement already satisfied: sortedcontainers in ./.local/lib/python3.6/site-packages (from pdfminer.six) (2.1.0) pavel@ALABAMA:~$
So I don't know what's really going on ... which one is imported.
Reply
#4
uninstall both and install just pdfminer.six
If you can't explain it to a six year old, you don't understand it yourself, Albert Einstein
How to Ask Questions The Smart Way: link and another link
Create MCV example
Debug small programs

Reply
#5
Output:
pavel@ALABAMA:~$ pip3 install pdfminer.six Defaulting to user installation because normal site-packages is not writeable Collecting pdfminer.six Downloading pdfminer.six-20201018-py3-none-any.whl (5.6 MB) |████████████████████████████████| 5.6 MB 3.5 MB/s Requirement already satisfied: chardet in ./.local/lib/python3.6/site-packages (from pdfminer.six) (3.0.4) Requirement already satisfied: cryptography in ./.local/lib/python3.6/site-packages (from pdfminer.six) (2.9) Requirement already satisfied: sortedcontainers in ./.local/lib/python3.6/site-packages (from pdfminer.six) (2.1.0) Requirement already satisfied: cffi!=1.11.3,>=1.8 in ./.local/lib/python3.6/site-packages (from cryptography->pdfminer.six) (1.14.0) Requirement already satisfied: six>=1.4.1 in ./.local/lib/python3.6/site-packages (from cryptography->pdfminer.six) (1.12.0) Requirement already satisfied: pycparser in ./.local/lib/python3.6/site-packages (from cffi!=1.11.3,>=1.8->cryptography->pdfminer.six) (2.20) Installing collected packages: pdfminer.six ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts. textract 1.6.3 requires beautifulsoup4==4.8.0, but you have beautifulsoup4 4.8.1 which is incompatible. textract 1.6.3 requires pdfminer.six==20181108, but you have pdfminer-six 20201018 which is incompatible. Successfully installed pdfminer.six-20201018 pavel@ALABAMA:~$
Concerning Error message:
  1. Should I change beautifulsoup4 4.8.1 to beautifulsoup4 4.8.0 ?
  2. Should I install 20181108 instead of 20201018 ?
Reply
#6
After pdfminer.six reinstall, the initial example works.
Thanks.
Reply
#7
Do not change anything,try if works as it probably dos now.
It's highly unlike that one version number of Beautifulsoup will break anything in this package,
as BS it's not even in required packed for pdfminer.six.

Here a quick tutorial on using virtual environment,it's build into Python an just take a minute to do.
This solve all dependency conflicts as none what you installed before is been looked at or used,it's now all new.
tom@tom-VirtualBox:~$ python -V
Python 3.9.1

# Make 
tom@tom-VirtualBox:~$ python -m venv pdf_env
# Cd in 
tom@tom-VirtualBox:~$ cd pdf_env/
# Activate
tom@tom-VirtualBox:~/pdf_env$ source bin/activate
# Install
(pdf_env) tom@tom-VirtualBox:~/pdf_env$ pip install pdfminer.six
Collecting pdfminer.six .....
Successfully installed cffi-1.14.4 chardet-4.0.0 cryptography-3.3.1 pdfminer.six-20201018 pycparser-2.20 six-1.15.0 sortedcontainers-2.3.0
Test that it work.
(pdf_env) tom@tom-VirtualBox:~/pdf_env$ python
Python 3.9.1 (default, Jan 25 2021, 15:34:59) 
[GCC 7.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from pdfminer import high_level
>>> 
>>> high_level.extract_text
<function extract_text at 0x7fe2273cc310>

>>> help(high_level.extract_text)
.....
When do pip list only packages in this environment is shown as it's isolated from what's install on OS level.
(pdf_env) tom@tom-VirtualBox:~/pdf_env$ pip list
Package          Version
---------------- --------
cffi             1.14.4
chardet          4.0.0
cryptography     3.3.1
pdfminer.six     20201018
pip              21.0
pycparser        2.20
setuptools       49.2.1
six              1.15.0
sortedcontainers 2.3.0
Reply
#8
Ok, now it works.
Thanks.
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  package script cant find sibling script when executed from outside Bock 3 870 Mar-03-2023, 04:26 PM
Last Post: snippsat
  pdfminer package: module isn't found Pavel_47 25 8,830 Sep-18-2022, 08:40 PM
Last Post: Larz60+
  how can a function find the name by which it is called? Skaperen 18 3,451 Aug-24-2022, 04:52 PM
Last Post: Skaperen
  Error in find pearson correlation function erneelgupta 1 1,859 Mar-01-2022, 03:41 PM
Last Post: stevendaprano
  pdfminer to csv mfernandes 2 2,825 Jun-16-2021, 10:54 AM
Last Post: mfernandes
  pdfminer vs pdfplumber pprod 2 6,124 Jan-30-2021, 01:35 PM
Last Post: pprod
  pdfminer.six: search for complete documentation Pavel_47 3 2,778 Jan-25-2021, 04:41 PM
Last Post: buran
  How do I find if a function has been defined? AndyHolyer 3 2,262 Jul-24-2020, 01:39 PM
Last Post: Gribouillis
  How to find a zero of this function? kkitti93 4 3,755 Jan-16-2020, 08:44 AM
Last Post: kkitti93
  Create a function to find words of certain length ag4g 2 4,067 Apr-21-2019, 06:20 PM
Last Post: BillMcEnaney

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020