How we can transcode encoding format files likes
xxxxx.tar.gz
I have download files that show encoding format but i want to read the text data for some special reasons
Error:
�;'Kjl�7��Ť��!���p�����`��(�D��Y�+F\�t{���һ�Eb>݊���3^N�~�Z\RU+@��
c�!��&+>ݒ��4/�m�;Q���p�$�)m�����Q�a�)�1 �,�P�$��.�k��fT������� ���sG
It is a gzipped tar file. You can read it with the tarfile module from the standard library.
@
Gribouillis i have file Name:
news_sohusite_xml.full.tar.gz i just need to read text data form file with help of software not coding
(Jul-24-2021, 10:31 AM)Anldra12 Wrote: [ -> ]@Gribouillis i have file Name: news_sohusite_xml.full.tar.gz i just need to read text data form file with help of software not coding
Software there is many eg i use
7-zip.
From command line using
tar
if on Windows may need download
Tar for Windows,or just use
cmder
G:\div_code
λ tar -xvf holdem_calc-1.0.0.tar.gz
holdem_calc-1.0.0/
holdem_calc-1.0.0/PKG-INFO
.....
From Python as posted in link is not hard to use.
All files to output_dir:
import tarfile
import os
os.mkdir('output_dir')
with tarfile.open('holdem_calc-1.0.0.tar.gz', 'r') as t:
t.extractall('output_dir')
print(os.listdir('output_dir'))
Get a specific file:
import tarfile
import os
os.mkdir('outdir')
with tarfile.open('holdem_calc-1.0.0.tar.gz', 'r') as t:
#print(t.getmembers())
t.extractall('outdir',
members=[t.getmember('holdem_calc-1.0.0/README.md')],
)
If your OS is Linux, simply run the following command in a terminal
Output:
tar xvzf news_sohusite_xml.full.tar.gz
@
snippsat and Gribouillis the codes and method are not applicable on this type of files ['./news_sohusite_xml.dat']
The file in _xml _url format my purpose to read this type in text form downloads from here:
http://www.sogou.com/labs/resource/cs.php
I cannot read text data apply the above code and i apply others methods but tr..gz file is not readable in text data
I understand that you uncompressed the
.tar.gz
file and you obtain a
.dat
file. I tried to do the same with the short version of the file on the same site (the 110 Kb file instead of the 600 Mb one) and I obtained a .dat file as well. This file is simply an xml file containing a sequence of entries as described in the web site, that is to say
Output:
<doc>
<url>页面URL</url>
<docno>页面ID</docno>
<contenttitle>页面标题</contenttitle>
<content>页面内容</content>
</doc>
You can open this .dat file with any application that can open an xml file, for example a text editor (I opened it with kwrite in kubuntu linux). On the other hand, as the complete file is large (more than 600 MB), it may be difficult for an editor to load and manipulate the whole content. You could perhaps cut the file by extracting a certain number of entries. For example you could read the file until the first line
</doc>
and that is the first entry, etc. You can also process the file with a Python program that reads xml.