Python Forum - How can we transcode encoding file uml url format

How we can transcode encoding format files likes xxxxx.tar.gz
I have download files that show encoding format but i want to read the text data for some special reasons

Error:�;'Kjl�7��Ť��!���p�����`��(�D��Y�+F\�t{���һ�Eb>݊���3^N�~�Z\RU+@��
c�!��&+>ݒ��4/�m�;Q���p�$�)m�����Q�a�)�1 �,�P�$��.�k��fT������� ���sG

It is a gzipped tar file. You can read it with the tarfile module from the standard library.

@Gribouillis give me more details how

Follow the examples given in this blog page https://pymotw.com/3/tarfile/. Use the mode 'r:gz' to open your compressed archive file.

@Gribouillis i have file Name: news_sohusite_xml.full.tar.gz i just need to read text data form file with help of software not coding

(Jul-24-2021, 10:31 AM)Anldra12 Wrote: [ -> ]@Gribouillis i have file Name: news_sohusite_xml.full.tar.gz i just need to read text data form file with help of software not coding

Software there is many eg i use 7-zip.
From command line using tar if on Windows may need download Tar for Windows,or just use cmder

G:\div_code
λ tar -xvf holdem_calc-1.0.0.tar.gz
holdem_calc-1.0.0/
holdem_calc-1.0.0/PKG-INFO
.....

From Python as posted in link is not hard to use.
All files to output_dir:

import tarfile
import os

os.mkdir('output_dir')
with tarfile.open('holdem_calc-1.0.0.tar.gz', 'r') as t:
    t.extractall('output_dir')

print(os.listdir('output_dir'))

Get a specific file:

import tarfile
import os

os.mkdir('outdir')
with tarfile.open('holdem_calc-1.0.0.tar.gz', 'r') as t:
    #print(t.getmembers())
    t.extractall('outdir',
                 members=[t.getmember('holdem_calc-1.0.0/README.md')],
                 )

If your OS is Linux, simply run the following command in a terminal

Output:
tar xvzf news_sohusite_xml.full.tar.gz

@snippsat and Gribouillis the codes and method are not applicable on this type of files ['./news_sohusite_xml.dat']
The file in _xml _url format my purpose to read this type in text form downloads from here: http://www.sogou.com/labs/resource/cs.php
I cannot read text data apply the above code and i apply others methods but tr..gz file is not readable in text data

I understand that you uncompressed the .tar.gz file and you obtain a .dat file. I tried to do the same with the short version of the file on the same site (the 110 Kb file instead of the 600 Mb one) and I obtained a .dat file as well. This file is simply an xml file containing a sequence of entries as described in the web site, that is to say

Output:<doc>
<url>页面URL</url>
<docno>页面ID</docno>
<contenttitle>页面标题</contenttitle>
<content>页面内容</content>
</doc>

You can open this .dat file with any application that can open an xml file, for example a text editor (I opened it with kwrite in kubuntu linux). On the other hand, as the complete file is large (more than 600 MB), it may be difficult for an editor to load and manipulate the whole content. You could perhaps cut the file by extracting a certain number of entries. For example you could read the file until the first line </doc> and that is the first entry, etc. You can also process the file with a Python program that reads xml.

@Gribouillis that all what i needs Thanks