Hello,
Using
pdfminer package I faced the following problem:
>>> from pdfminer import high_level
>>> extracted_text = high_level.extract_text(full_filename_inp, "", [4])
Traceback (most recent call last):
File "<pyshell#1>", line 1, in <module>
extracted_text = high_level.extract_text(full_filename_inp, "", [4])
AttributeError: module 'pdfminer.high_level' has no attribute 'extract_text'
But, according to documentation the function
extract_text does exist in
pdfminer package.
pdfminer package
Any suggestions ?
Thanks
The document that you point to is pdfminer-six.
Since 2020, the original pdfminer is dormant, and pdfminer.six is the fork which Euske recommends if you need an actively maintained version of pdfminer.
Which do you have installed?
install for pdfminer-six is pip install pdfminer.six
First I installed
pdfminer:
Output:
pavel@ALABAMA:~$ pip3 install pdfminer
Defaulting to user installation because normal site-packages is not writeable
Collecting pdfminer
Downloading pdfminer-20191125.tar.gz (4.2 MB)
|████████████████████████████████| 4.2 MB 3.4 MB/s
Requirement already satisfied: pycryptodome in ./.local/lib/python3.6/site-packages (from pdfminer) (3.9.6)
Building wheels for collected packages: pdfminer
Building wheel for pdfminer (setup.py) ... done
Created wheel for pdfminer: filename=pdfminer-20191125-py3-none-any.whl size=6141904 sha256=34b8374913c5a3d565629c16bdd698b7488acd2c912fbab3cbc1ffec783d2f59
Stored in directory: /home/pavel/.cache/pip/wheels/c4/2c/33/fa5a7d524b90318c03454e176b442006c14ea2cfeb9337b308
Successfully built pdfminer
Installing collected packages: pdfminer
Successfully installed pdfminer-20191125
Then I saw this issue and installed
pdfminer.six:
Output:
pavel@ALABAMA:~$ pip3 install pdfminer.six
Defaulting to user installation because normal site-packages is not writeable
Requirement already satisfied: pdfminer.six in ./.local/lib/python3.6/site-packages (20181108)
Requirement already satisfied: six in ./.local/lib/python3.6/site-packages (from pdfminer.six) (1.12.0)
Requirement already satisfied: pycryptodome in ./.local/lib/python3.6/site-packages (from pdfminer.six) (3.9.6)
Requirement already satisfied: sortedcontainers in ./.local/lib/python3.6/site-packages (from pdfminer.six) (2.1.0)
pavel@ALABAMA:~$
So I don't know what's really going on ... which one is imported.
uninstall both and install just pdfminer.six
Output:
pavel@ALABAMA:~$ pip3 install pdfminer.six
Defaulting to user installation because normal site-packages is not writeable
Collecting pdfminer.six
Downloading pdfminer.six-20201018-py3-none-any.whl (5.6 MB)
|████████████████████████████████| 5.6 MB 3.5 MB/s
Requirement already satisfied: chardet in ./.local/lib/python3.6/site-packages (from pdfminer.six) (3.0.4)
Requirement already satisfied: cryptography in ./.local/lib/python3.6/site-packages (from pdfminer.six) (2.9)
Requirement already satisfied: sortedcontainers in ./.local/lib/python3.6/site-packages (from pdfminer.six) (2.1.0)
Requirement already satisfied: cffi!=1.11.3,>=1.8 in ./.local/lib/python3.6/site-packages (from cryptography->pdfminer.six) (1.14.0)
Requirement already satisfied: six>=1.4.1 in ./.local/lib/python3.6/site-packages (from cryptography->pdfminer.six) (1.12.0)
Requirement already satisfied: pycparser in ./.local/lib/python3.6/site-packages (from cffi!=1.11.3,>=1.8->cryptography->pdfminer.six) (2.20)
Installing collected packages: pdfminer.six
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
textract 1.6.3 requires beautifulsoup4==4.8.0, but you have beautifulsoup4 4.8.1 which is incompatible.
textract 1.6.3 requires pdfminer.six==20181108, but you have pdfminer-six 20201018 which is incompatible.
Successfully installed pdfminer.six-20201018
pavel@ALABAMA:~$
Concerning Error message:
- Should I change beautifulsoup4 4.8.1 to beautifulsoup4 4.8.0 ?
- Should I install 20181108 instead of 20201018 ?
After pdfminer.six reinstall, the initial example works.
Thanks.
Do
not
change anything,try if works as it probably dos now.
It's highly unlike that one version number of Beautifulsoup will break anything in this package,
as BS it's not even in required packed for pdfminer.six.
Here a quick tutorial on using
virtual environment
,it's build into Python an just take a minute to do.
This solve all dependency conflicts as none what you installed before is been looked at or used,it's now all new.
tom@tom-VirtualBox:~$ python -V
Python 3.9.1
# Make
tom@tom-VirtualBox:~$ python -m venv pdf_env
# Cd in
tom@tom-VirtualBox:~$ cd pdf_env/
# Activate
tom@tom-VirtualBox:~/pdf_env$ source bin/activate
# Install
(pdf_env) tom@tom-VirtualBox:~/pdf_env$ pip install pdfminer.six
Collecting pdfminer.six .....
Successfully installed cffi-1.14.4 chardet-4.0.0 cryptography-3.3.1 pdfminer.six-20201018 pycparser-2.20 six-1.15.0 sortedcontainers-2.3.0
Test that it work.
(pdf_env) tom@tom-VirtualBox:~/pdf_env$ python
Python 3.9.1 (default, Jan 25 2021, 15:34:59)
[GCC 7.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from pdfminer import high_level
>>>
>>> high_level.extract_text
<function extract_text at 0x7fe2273cc310>
>>> help(high_level.extract_text)
.....
When do
pip list
only packages in this environment is shown as it's isolated from what's install on OS level.
(pdf_env) tom@tom-VirtualBox:~/pdf_env$ pip list
Package Version
---------------- --------
cffi 1.14.4
chardet 4.0.0
cryptography 3.3.1
pdfminer.six 20201018
pip 21.0
pycparser 2.20
setuptools 49.2.1
six 1.15.0
sortedcontainers 2.3.0
Ok, now it works.
Thanks.