Copy xml content from webpage and save to locally without special characters

Nik1811 · Mar-21-2024, 03:45 PM

I land up in a .xml webpage which is created based on my earlier inputs(<//aep2/xml/trace/NIK243164_AI_14652732.xml>). I want to read the entire content of this xml, copy it locally and compare with an existing baseline Xml.

I've tried the below options, wherein each option has its own troubles:

1)

with open('test.xml', 'w') as f:
f.write(driver.page_source)

This extracts the info from the webpage but add html tags to start-end and also special characters to each tag. Is there a built-in function to automatically convert pagesource content to xml?

2)

headers = {'User-Agent': 'Mozilla'} request = urllib.Request(url1, headers=headers) response = urllib.urlopen(request) print(response.status_code)

if response.status_code == 200: with open('test.xml', 'wb') as f: shutil.copyfileobj(response.text, f)

This is unable to read from the Xml. I get a user-defined error returned 'there was an error generating the xml'. However, the xml is very much generated after my each script as I'm not running the cases headless at the moment.

3)

urllib.request.urlretrieve(url1, "test.xml")

Similar issue as seen in 2)

Please help with a solution to read a webpage that has .xml extension, copy just the content and write it to another file.

Webpage content looks like below:

<?xml version='1.0' encoding='UTF-8'?><IM415 xmlns="http://www.ros.ie/schemas/customs/IM415"> <Declaration> <MsgType>H1</MsgType> <DeclarationType_1_1>IM</DeclarationType_1_1> <AdditionalDeclarationType_1_2>A</AdditionalDeclarationType_1_2> <LRN_2_5>NIK243172_16K0O3</LRN_2_5> <ValuationInformation> <InvoiceCurrency_4_10>EUR</InvoiceCurrency_4_10> <InvoiceAmount_4_11>5000</InvoiceAmount_4_11> <InternalCurrency_4_12>EUR</InternalCurrency_4_12> </ValuationInformation> <GoodsInformation> <GrossMass_6_5>300</GrossMass_6_5> <TotalPackageNumber_6_18>15</TotalPackageNumber_6_18> </GoodsInformation> <TransportInformation> <BorderTransportMode_7_4>3</BorderTransportMode_7_4> <ActiveBorderTransportMeansNationality_7_15>IE</ActiveBorderTransportMeansNationality_7_15> </TransportInformation> <CustomsOffices> <PresentationCustomsOffice_5_26>IEDUB100</PresentationCustomsOffice_5_26> <CustomsOfficeLodgement>IEDUB100</CustomsOfficeLodgement> </CustomsOffices> <Parties> <Declarant> <Declarant_3_18 xmlns="">IE8218454B</Declarant_3_18> </Declarant> <Representative> <Representative_3_20 xmlns="">IE8218454B</Representative_3_20> </Representative> <PersonPayingCustomsDuty_3_46>IE9726356R</PersonPayingCustomsDuty_3_46> </Parties> <PreferredPaymentMethod_4_8>E</PreferredPaymentMethod_4_8> </Declaration> <GoodsShipment> <DocumentsAuthorisations> <AdditionalInformation_2_2> <AdditionalInformationCode xmlns="">00500</AdditionalInformationCode> </AdditionalInformation_2_2> <ProducedDocumentsWritingOff_2_03> <DocumentType xmlns="">1D24</DocumentType> <DocumentIdentifier xmlns="">202403211000</DocumentIdentifier> </ProducedDocumentsWritingOff_2_03> <ProducedDocumentsWritingOff_2_03> <DocumentType xmlns="">1D94</DocumentType> <DocumentIdentifier xmlns="">9214991</DocumentIdentifier> </ProducedDocumentsWritingOff_2_03> <ProducedDocumentsWritingOff_2_03>

***snippsat*** · Mar-21-2024, 04:44 PM

Example,and use Requests and not urllib.

import requests

url = "https://www.w3schools.com/xml/plant_catalog.xml"
response = requests.get(url)
with open('plant_catalog.xml', 'wb') as fp:
    fp.write(response.content)

With a combination of Beautiful Soup that common to use with this.

import requests
from bs4 import BeautifulSoup

url = "https://www.w3schools.com/xml/plant_catalog.xml"
response = requests.get(url)
soup = BeautifulSoup(response.content, 'xml')
first_common = soup.find('COMMON')
print(first_common.text)
# The whole plant_catalog.xml
#print(soup)

# Save to disk
with open('plant_catalog.xml', 'w') as fp:
    fp.write(soup.prettify())

Output:
Bloodroot

Nik1811 · Mar-21-2024, 05:17 PM

Many thanks for your reply.

I'm looking to print the entire plant-catalog and copy it to disk. With the current solution it's just printing the first line, with the below error.

C:\Users\Nikita\PycharmProjects\pythonProject\AIS_Import.py:90: MarkupResemblesLocatorWarning: The input looks more like a filename than markup. You may want to open this file and pass the filehandle into Beautiful Soup.
soup = BeautifulSoup(response.content, 'xml')
<?xml version="1.0" encoding="utf-8"?>

Can you please suggest how to print the entire content of the catalog with tags?

(Mar-21-2024, 04:44 PM)snippsat Wrote: Example,and use Requests and not urllib.

import requests

url = "https://www.w3schools.com/xml/plant_catalog.xml"
response = requests.get(url)
with open('plant_catalog.xml', 'wb') as fp:
    fp.write(response.content)

With a combination of Beautiful Soup that common to use with this.

import requests
from bs4 import BeautifulSoup

url = "https://www.w3schools.com/xml/plant_catalog.xml"
response = requests.get(url)
soup = BeautifulSoup(response.content, 'xml')
first_common = soup.find('COMMON')
print(first_common.text)
# The whole plant_catalog.xml
#print(soup)

# Save to disk
with open('plant_catalog.xml', 'w') as fp:
    fp.write(soup.prettify())

Output:
Bloodroot

***snippsat*** · Mar-21-2024, 05:44 PM

(Mar-21-2024, 05:17 PM)Nik1811 Wrote: I'm looking to print the entire plant-catalog and copy it to disk. With the current solution it's just printing the first line, with the below error.

No it dos not just print the first line,so the error most be on you side.
Here NoteBook you can look at,and for this task only need to Requests.

Nik1811 · Mar-21-2024, 06:27 PM

Yep, I suspect it has something to do with my xml.

I get ''https://www.********/aep2/xml/sad/NIK243179-AI.xml as my generated xml

<?xml version='1.0' encoding='UTF-8'?><IM415 xmlns="http://www.ros.ie/schemas/customs/IM415">
<Declaration>
<MsgType>H1</MsgType>
<DeclarationType_1_1>IM</DeclarationType_1_1>
<AdditionalDeclarationType_1_2>A</AdditionalDeclarationType_1_2>
<LRN_2_5>NIK243179_16KqVM</LRN_2_5>
<ValuationInformation>
<InvoiceCurrency_4_10>EUR</InvoiceCurrency_4_10>
.
.

1) With Beautiful Soup solution, I could print just the first line with the error:
MarkupResemblesLocatorWarning: The input looks more like a filename than markup. You may want to open this file and pass the filehandle into Beautiful Soup.
soup = BeautifulSoup(response.content, 'xml')
<?xml version="1.0" encoding="utf-8"?>

2) With requests, my request fails and I get an error saying:
'There was an issue generating the XML for sad/NIK243179-AI.xml'

It's strange that I get this error although the actual xml gets generated. 'requests' is o reading the actual content, but getting another statement and printing the error instead.

Do you think there could be encoding issue? or the way xml is structured. If yes, how could I parse through such xml's and what would be the best solution?
My assumption is that BeautifulSoup should have our solution, but not sure how.

(Mar-21-2024, 05:44 PM)snippsat Wrote:
(Mar-21-2024, 05:17 PM)Nik1811 Wrote: I'm looking to print the entire plant-catalog and copy it to disk. With the current solution it's just printing the first line, with the below error.
No it dos not just print the first line,so the error most be on you side.
Here NoteBook you can look at,and for this task only need to Requests.

***snippsat*** · (This post was last modified: Mar-21-2024, 10:13 PM by snippsat.)

(Mar-21-2024, 06:27 PM)Nik1811 Wrote: I get ''https://www.********/aep2/xml/sad/NIK243179-AI.xml as my generated xml

Post your code,from what are you generated this .xml?
Then need just to save the content of .xml,and not use url as in my demos.
If you run my code(no changes) dos that work?

Pedroski55 · Mar-22-2024, 06:45 AM

@snippsat

I like to try out advice from experts, to see how it works. I tried your code, but what I get is an almost exact copy of my webpage.

The output has more lines, because every html tag is on a separate line.

Is that what is supposed to happen?

My html file has a few lines of PHP at the top, but is mainly html.

The output opens as .xml in my browser, not as a webpage.

Strangely, the </head> tag turns up at the bottom, between </body> and </html>

The <!DOCTYPE html> tag is not in the xml file, but there does not seem to be an xml root tag.

from bs4 import BeautifulSoup
 
URL = "/var/www/html/22BE1cw/22BE1sW1.html.php"
savepath = '/home/pedro/tmp/html2xml.xml'
with open(URL, "r") as localfile:
    html_content = localfile.read()
soup = BeautifulSoup(html_content, 'xml')
# Save to disk
with open(savepath, 'w') as fp:
    fp.write(soup.prettify())

Nik1811 · (This post was last modified: Mar-22-2024, 11:41 AM by Nik1811.)

Your code works absolutely fine with the w3schools link.

But for my .xml, I can read just the first line with the below error:

MarkupResemblesLocatorWarning: The input looks more like a filename than markup. You may want to open this file and pass the filehandle into Beautiful Soup. soup = BeautifulSoup(response.content, 'xml')
<?xml version="1.0" encoding="utf-8"?>

Few points:

1. What is a markup? How do we open this and resolve the error?
2. Am I not pointing to the root element? When I compare the 'w3schools.xml' with my xml, there is change in the first line. My xml header is extended (<?xml version='1.0' encoding='UTF-8'?><IM415 xmlns="http://www.ros.ie/schemas/customs/IM415">) See below.

Here's my code:

response = requests.get(driver.current_url)
soup = BeautifulSoup(response.content, 'xml')
print(soup)

# Save to disk
with open('test.xml', 'w') as fp:
      fp.write(soup.prettify())

My Xml looks like below:

<?xml version='1.0' encoding='UTF-8'?><IM415 xmlns="http://www.ros.ie/schemas/customs/IM415">
<Declaration>
<MsgType>H1</MsgType>
<DeclarationType_1_1>IM</DeclarationType_1_1>
<AdditionalDeclarationType_1_2>A</AdditionalDeclarationType_1_2>
<LRN_2_5>NIK243104_16nUlp</LRN_2_5>
<ValuationInformation>
<InvoiceCurrency_4_10>AFA</InvoiceCurrency_4_10>
<InvoiceAmount_4_11>5000</InvoiceAmount_4_11>
<InternalCurrency_4_12>AFA</InternalCurrency_4_12>
</ValuationInformation>
<GoodsInformation>
<GrossMass_6_5>33300</GrossMass_6_5>
<TotalPackageNumber_6_18>1665</TotalPackageNumber_6_18>
</GoodsInformation>

I'm expecting the below to be printed and copied to my test.xml:
<?xml version='1.0' encoding='UTF-8'?><IM415 xmlns="http://www.ros.ie/schemas/customs/IM415">
<Declaration>
<MsgType>H1</MsgType>
<DeclarationType_1_1>IM</DeclarationType_1_1>
<AdditionalDeclarationType_1_2>A</AdditionalDeclarationType_1_2>
<LRN_2_5>NIK243104_16nUlp</LRN_2_5>
<ValuationInformation>
<InvoiceCurrency_4_10>AFA</InvoiceCurrency_4_10>
<InvoiceAmount_4_11>5000</InvoiceAmount_4_11>
<InternalCurrency_4_12>AFA</InternalCurrency_4_12>
</ValuationInformation>
<GoodsInformation>
<GrossMass_6_5>33300</GrossMass_6_5>
<TotalPackageNumber_6_18>1665</TotalPackageNumber_6_18>
</GoodsInformation>
<GrossMass_6_5>33300</GrossMass_6_5>
<TotalPackageNumber_6_18>1665</TotalPackageNumber_6_18>
</GoodsInformation>

(Mar-21-2024, 10:13 PM)snippsat Wrote:
(Mar-21-2024, 06:27 PM)Nik1811 Wrote: I get ''https://www.********/aep2/xml/sad/NIK243179-AI.xml as my generated xml
Post your code,from what are you generated this .xml?
Then need just to save the content of .xml,and not use url as in my demos.
If you run my code(no changes) dos that work?

***snippsat*** · (This post was last modified: Mar-22-2024, 04:25 PM by snippsat.)

Quote:My html file has a few lines of PHP at the top, but is mainly html.

The output opens as .xml in my browser, not as a webpage

If the .xml it's genrated it will be a string.
Then need to pass that string to BS,and not the url.
Again there is nothing i can run ans test with what you post.

from bs4 import BeautifulSoup

# Your XML content as a string
xml_content = """
<?xml version='1.0' encoding='UTF-8'?><IM415 xmlns="http://www.ros.ie/schemas/customs/IM415">
  <Declaration>
    <MsgType>H1</MsgType>
    <DeclarationType_1_1>IM</DeclarationType_1_1>
    <AdditionalDeclarationType_1_2>A</AdditionalDeclarationType_1_2>
    <LRN_2_5>NIK243104_16nUlp</LRN_2_5>
    <ValuationInformation>
      <InvoiceCurrency_4_10>AFA</InvoiceCurrency_4_10>
      <InvoiceAmount_4_11>5000</InvoiceAmount_4_11>
      <InternalCurrency_4_12>AFA</InternalCurrency_4_12>
    </ValuationInformation>
    <GoodsInformation>
      <GrossMass_6_5>33300</GrossMass_6_5>
      <TotalPackageNumber_6_18>1665</TotalPackageNumber_6_18>
    </GoodsInformation>
</Declaration>
</IM415>
"""

# Now use lxml as parser
soup = BeautifulSoup(xml_content, 'lxml')
print(soup.prettify())

# Save the modified XML to a file
with open('test.xml', 'w', encoding='utf-8') as fp:
    fp.write(soup.prettify())

Nik1811 · (This post was last modified: Mar-22-2024, 07:06 PM by snippsat.)

Super thanks @snippsat.

Can you please help with how to change the url to an xml file. Apologies for not sharing the code earlier, I'm attaching the code and the XML file generated to this reply.

(Mar-22-2024, 04:25 PM)snippsat Wrote:

Quote:My html file has a few lines of PHP at the top, but is mainly html.

The output opens as .xml in my browser, not as a webpage

If the .xml it's genrated it will be a string.
Then need to pass that string to BS,and not the url.
Again there is nothing i can run ans test with what you post.

from bs4 import BeautifulSoup

# Your XML content as a string
xml_content = """
<?xml version='1.0' encoding='UTF-8'?><IM415 xmlns="http://www.ros.ie/schemas/customs/IM415">
  <Declaration>
    <MsgType>H1</MsgType>
    <DeclarationType_1_1>IM</DeclarationType_1_1>
    <AdditionalDeclarationType_1_2>A</AdditionalDeclarationType_1_2>
    <LRN_2_5>NIK243104_16nUlp</LRN_2_5>
    <ValuationInformation>
      <InvoiceCurrency_4_10>AFA</InvoiceCurrency_4_10>
      <InvoiceAmount_4_11>5000</InvoiceAmount_4_11>
      <InternalCurrency_4_12>AFA</InternalCurrency_4_12>
    </ValuationInformation>
    <GoodsInformation>
      <GrossMass_6_5>33300</GrossMass_6_5>
      <TotalPackageNumber_6_18>1665</TotalPackageNumber_6_18>
    </GoodsInformation>
</Declaration>
</IM415>
"""

# Now use lxml as parser
soup = BeautifulSoup(xml_content, 'lxml')
print(soup.prettify())

# Save the modified XML to a file
with open('test.xml', 'w', encoding='utf-8') as fp:
    fp.write(soup.prettify())

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Locally run an APK and execute functions using Python	KovyJ	0	489	Jan-23-2025, 05:21 PM Last Post: KovyJ
	[SOLVED] Special characters in XML	ForeverNoob	3	1,844	Dec-04-2024, 01:26 PM Last Post: ForeverNoob
	Why is the copy method name in python list copy and not `__copy__`?	YouHoGeon	2	1,376	Apr-04-2024, 01:18 AM Last Post: YouHoGeon
	how to save to multiple locations during save	cubangt	1	1,335	Oct-23-2023, 10:16 PM Last Post: deanhystad
	Special Characters read-write	Prisonfeed	1	1,487	Sep-17-2023, 08:26 PM Last Post: Gribouillis
	UPDATE SQLITE TABLE - Copy a fields content to another field.	andrewarles	14	6,585	May-08-2021, 04:58 PM Last Post: ibreeden
	Rename Multiple files in directory to remove special characters	nyawadasi	9	10,730	Feb-16-2021, 09:49 PM Last Post: BashBedlam
	copy content of text file with three delimiter into excel sheet	vinaykumar	0	2,945	Jul-12-2020, 01:27 PM Last Post: vinaykumar
	Remove escape characters / Unicode characters from string	DreamingInsanity	5	22,002	May-15-2020, 01:37 PM Last Post: snippsat
	Check for a special characters in a column and flag it	ayomayam	0	2,612	Feb-12-2020, 03:04 PM Last Post: ayomayam

Copy xml content from webpage and save to locally without special characters

User Panel Messages

Announcements