Python Forum
pdfminer package: can't find exgtract_text function - Printable Version

+- Python Forum (https://python-forum.io)
+-- Forum: Python Coding (https://python-forum.io/forum-7.html)
+--- Forum: General Coding Help (https://python-forum.io/forum-8.html)
+--- Thread: pdfminer package: can't find exgtract_text function (/thread-32164.html)



pdfminer package: can't find exgtract_text function - Pavel_47 - Jan-25-2021

Hello,

Using pdfminer package I faced the following problem:

>>> from pdfminer import high_level
>>> extracted_text = high_level.extract_text(full_filename_inp, "", [4])
Traceback (most recent call last):
  File "<pyshell#1>", line 1, in <module>
    extracted_text = high_level.extract_text(full_filename_inp, "", [4])
AttributeError: module 'pdfminer.high_level' has no attribute 'extract_text'
But, according to documentation the function extract_text does exist in pdfminer package.
pdfminer package
Any suggestions ?
Thanks


RE: pdfminer package: can't find exgtract_text function - Larz60+ - Jan-25-2021

The document that you point to is pdfminer-six.

Since 2020, the original pdfminer is dormant, and pdfminer.six is the fork which Euske recommends if you need an actively maintained version of pdfminer.

Which do you have installed?

install for pdfminer-six is pip install pdfminer.six


RE: pdfminer package: can't find exgtract_text function - Pavel_47 - Jan-25-2021

First I installed pdfminer:

Output:
pavel@ALABAMA:~$ pip3 install pdfminer Defaulting to user installation because normal site-packages is not writeable Collecting pdfminer Downloading pdfminer-20191125.tar.gz (4.2 MB) |████████████████████████████████| 4.2 MB 3.4 MB/s Requirement already satisfied: pycryptodome in ./.local/lib/python3.6/site-packages (from pdfminer) (3.9.6) Building wheels for collected packages: pdfminer Building wheel for pdfminer (setup.py) ... done Created wheel for pdfminer: filename=pdfminer-20191125-py3-none-any.whl size=6141904 sha256=34b8374913c5a3d565629c16bdd698b7488acd2c912fbab3cbc1ffec783d2f59 Stored in directory: /home/pavel/.cache/pip/wheels/c4/2c/33/fa5a7d524b90318c03454e176b442006c14ea2cfeb9337b308 Successfully built pdfminer Installing collected packages: pdfminer Successfully installed pdfminer-20191125
Then I saw this issue and installed pdfminer.six:

Output:
pavel@ALABAMA:~$ pip3 install pdfminer.six Defaulting to user installation because normal site-packages is not writeable Requirement already satisfied: pdfminer.six in ./.local/lib/python3.6/site-packages (20181108) Requirement already satisfied: six in ./.local/lib/python3.6/site-packages (from pdfminer.six) (1.12.0) Requirement already satisfied: pycryptodome in ./.local/lib/python3.6/site-packages (from pdfminer.six) (3.9.6) Requirement already satisfied: sortedcontainers in ./.local/lib/python3.6/site-packages (from pdfminer.six) (2.1.0) pavel@ALABAMA:~$
So I don't know what's really going on ... which one is imported.


RE: pdfminer package: can't find exgtract_text function - buran - Jan-25-2021

uninstall both and install just pdfminer.six


RE: pdfminer package: can't find exgtract_text function - Pavel_47 - Jan-25-2021

Output:
pavel@ALABAMA:~$ pip3 install pdfminer.six Defaulting to user installation because normal site-packages is not writeable Collecting pdfminer.six Downloading pdfminer.six-20201018-py3-none-any.whl (5.6 MB) |████████████████████████████████| 5.6 MB 3.5 MB/s Requirement already satisfied: chardet in ./.local/lib/python3.6/site-packages (from pdfminer.six) (3.0.4) Requirement already satisfied: cryptography in ./.local/lib/python3.6/site-packages (from pdfminer.six) (2.9) Requirement already satisfied: sortedcontainers in ./.local/lib/python3.6/site-packages (from pdfminer.six) (2.1.0) Requirement already satisfied: cffi!=1.11.3,>=1.8 in ./.local/lib/python3.6/site-packages (from cryptography->pdfminer.six) (1.14.0) Requirement already satisfied: six>=1.4.1 in ./.local/lib/python3.6/site-packages (from cryptography->pdfminer.six) (1.12.0) Requirement already satisfied: pycparser in ./.local/lib/python3.6/site-packages (from cffi!=1.11.3,>=1.8->cryptography->pdfminer.six) (2.20) Installing collected packages: pdfminer.six ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts. textract 1.6.3 requires beautifulsoup4==4.8.0, but you have beautifulsoup4 4.8.1 which is incompatible. textract 1.6.3 requires pdfminer.six==20181108, but you have pdfminer-six 20201018 which is incompatible. Successfully installed pdfminer.six-20201018 pavel@ALABAMA:~$
Concerning Error message:
  1. Should I change beautifulsoup4 4.8.1 to beautifulsoup4 4.8.0 ?
  2. Should I install 20181108 instead of 20201018 ?



RE: pdfminer package: can't find exgtract_text function - Pavel_47 - Jan-25-2021

After pdfminer.six reinstall, the initial example works.
Thanks.


RE: pdfminer package: can't find exgtract_text function - snippsat - Jan-25-2021

Do not change anything,try if works as it probably dos now.
It's highly unlike that one version number of Beautifulsoup will break anything in this package,
as BS it's not even in required packed for pdfminer.six.

Here a quick tutorial on using virtual environment,it's build into Python an just take a minute to do.
This solve all dependency conflicts as none what you installed before is been looked at or used,it's now all new.
tom@tom-VirtualBox:~$ python -V
Python 3.9.1

# Make 
tom@tom-VirtualBox:~$ python -m venv pdf_env
# Cd in 
tom@tom-VirtualBox:~$ cd pdf_env/
# Activate
tom@tom-VirtualBox:~/pdf_env$ source bin/activate
# Install
(pdf_env) tom@tom-VirtualBox:~/pdf_env$ pip install pdfminer.six
Collecting pdfminer.six .....
Successfully installed cffi-1.14.4 chardet-4.0.0 cryptography-3.3.1 pdfminer.six-20201018 pycparser-2.20 six-1.15.0 sortedcontainers-2.3.0
Test that it work.
(pdf_env) tom@tom-VirtualBox:~/pdf_env$ python
Python 3.9.1 (default, Jan 25 2021, 15:34:59) 
[GCC 7.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from pdfminer import high_level
>>> 
>>> high_level.extract_text
<function extract_text at 0x7fe2273cc310>

>>> help(high_level.extract_text)
.....
When do pip list only packages in this environment is shown as it's isolated from what's install on OS level.
(pdf_env) tom@tom-VirtualBox:~/pdf_env$ pip list
Package          Version
---------------- --------
cffi             1.14.4
chardet          4.0.0
cryptography     3.3.1
pdfminer.six     20201018
pip              21.0
pycparser        2.20
setuptools       49.2.1
six              1.15.0
sortedcontainers 2.3.0



RE: pdfminer package: can't find exgtract_text function - Pavel_47 - Jan-25-2021

Ok, now it works.
Thanks.