i have more then 4000 hatefull Microsoft Office Word .doc files from which i should extract some data (both numbers and words but really in most cases there are empty spaces) and later convert to a single .csv file where every row would be a single .doc file
here there is a screen of one of those files, underlined in blue there are some example of what i should extract:
![[Image: Immagine.jpg]](https://s17.postimg.org/gbtbrc0hr/Immagine.jpg)
here i uploaded the file if someone wants to test something
https://ufile.io/vt2zq
So since my experience with python is quite little i thought it would be useful to came here before starting to gather some idea and hints
from google i know that my possibility to work with this file format are not so much
1) textract
2) convert the .doc to .docx with antiword and then use docx2txt
my idea was to:
1) open the folder and read the first .doc file
1) extract the data and handle the many empty values with a try/except
2) go to the next file
right now i doesn't have any idea on how to get to any of those points. what would you do in my situation? how would you open the files? how would you procede?
here there is a screen of one of those files, underlined in blue there are some example of what i should extract:
![[Image: Immagine.jpg]](https://s17.postimg.org/gbtbrc0hr/Immagine.jpg)
here i uploaded the file if someone wants to test something
https://ufile.io/vt2zq
So since my experience with python is quite little i thought it would be useful to came here before starting to gather some idea and hints
from google i know that my possibility to work with this file format are not so much
1) textract
2) convert the .doc to .docx with antiword and then use docx2txt
my idea was to:
1) open the folder and read the first .doc file
1) extract the data and handle the many empty values with a try/except
2) go to the next file
right now i doesn't have any idea on how to get to any of those points. what would you do in my situation? how would you open the files? how would you procede?