Python Forum
Copy xml content from webpage and save to locally without special characters
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Copy xml content from webpage and save to locally without special characters
#1
I land up in a .xml webpage which is created based on my earlier inputs(<//aep2/xml/trace/NIK243164_AI_14652732.xml>). I want to read the entire content of this xml, copy it locally and compare with an existing baseline Xml.

I've tried the below options, wherein each option has its own troubles:

1)

with open('test.xml', 'w') as f:
f.write(driver.page_source)

This extracts the info from the webpage but add html tags to start-end and also special characters to each tag. Is there a built-in function to automatically convert pagesource content to xml?

2)

headers = {'User-Agent': 'Mozilla'} request = urllib.Request(url1, headers=headers) response = urllib.urlopen(request) print(response.status_code)

if response.status_code == 200: with open('test.xml', 'wb') as f: shutil.copyfileobj(response.text, f)

This is unable to read from the Xml. I get a user-defined error returned 'there was an error generating the xml'. However, the xml is very much generated after my each script as I'm not running the cases headless at the moment.

3)

urllib.request.urlretrieve(url1, "test.xml")

Similar issue as seen in 2)

Please help with a solution to read a webpage that has .xml extension, copy just the content and write it to another file.

Webpage content looks like below:

<?xml version='1.0' encoding='UTF-8'?><IM415 xmlns="http://www.ros.ie/schemas/customs/IM415"> <Declaration> <MsgType>H1</MsgType> <DeclarationType_1_1>IM</DeclarationType_1_1> <AdditionalDeclarationType_1_2>A</AdditionalDeclarationType_1_2> <LRN_2_5>NIK243172_16K0O3</LRN_2_5> <ValuationInformation> <InvoiceCurrency_4_10>EUR</InvoiceCurrency_4_10> <InvoiceAmount_4_11>5000</InvoiceAmount_4_11> <InternalCurrency_4_12>EUR</InternalCurrency_4_12> </ValuationInformation> <GoodsInformation> <GrossMass_6_5>300</GrossMass_6_5> <TotalPackageNumber_6_18>15</TotalPackageNumber_6_18> </GoodsInformation> <TransportInformation> <BorderTransportMode_7_4>3</BorderTransportMode_7_4> <ActiveBorderTransportMeansNationality_7_15>IE</ActiveBorderTransportMeansNationality_7_15> </TransportInformation> <CustomsOffices> <PresentationCustomsOffice_5_26>IEDUB100</PresentationCustomsOffice_5_26> <CustomsOfficeLodgement>IEDUB100</CustomsOfficeLodgement> </CustomsOffices> <Parties> <Declarant> <Declarant_3_18 xmlns="">IE8218454B</Declarant_3_18> </Declarant> <Representative> <Representative_3_20 xmlns="">IE8218454B</Representative_3_20> </Representative> <PersonPayingCustomsDuty_3_46>IE9726356R</PersonPayingCustomsDuty_3_46> </Parties> <PreferredPaymentMethod_4_8>E</PreferredPaymentMethod_4_8> </Declaration> <GoodsShipment> <DocumentsAuthorisations> <AdditionalInformation_2_2> <AdditionalInformationCode xmlns="">00500</AdditionalInformationCode> </AdditionalInformation_2_2> <ProducedDocumentsWritingOff_2_03> <DocumentType xmlns="">1D24</DocumentType> <DocumentIdentifier xmlns="">202403211000</DocumentIdentifier> </ProducedDocumentsWritingOff_2_03> <ProducedDocumentsWritingOff_2_03> <DocumentType xmlns="">1D94</DocumentType> <DocumentIdentifier xmlns="">9214991</DocumentIdentifier> </ProducedDocumentsWritingOff_2_03> <ProducedDocumentsWritingOff_2_03>
Reply
#2
Example,and use Requests and not urllib.
import requests

url = "https://www.w3schools.com/xml/plant_catalog.xml"
response = requests.get(url)
with open('plant_catalog.xml', 'wb') as fp:
    fp.write(response.content)
With a combination of Beautiful Soup that common to use with this.
import requests
from bs4 import BeautifulSoup

url = "https://www.w3schools.com/xml/plant_catalog.xml"
response = requests.get(url)
soup = BeautifulSoup(response.content, 'xml')
first_common = soup.find('COMMON')
print(first_common.text)
# The whole plant_catalog.xml
#print(soup)

# Save to disk
with open('plant_catalog.xml', 'w') as fp:
    fp.write(soup.prettify())
Output:
Bloodroot
Reply
#3
Many thanks for your reply.

I'm looking to print the entire plant-catalog and copy it to disk. With the current solution it's just printing the first line, with the below error.

C:\Users\Nikita\PycharmProjects\pythonProject\AIS_Import.py:90: MarkupResemblesLocatorWarning: The input looks more like a filename than markup. You may want to open this file and pass the filehandle into Beautiful Soup.
soup = BeautifulSoup(response.content, 'xml')
<?xml version="1.0" encoding="utf-8"?>


Can you please suggest how to print the entire content of the catalog with tags?


(Mar-21-2024, 04:44 PM)snippsat Wrote: Example,and use Requests and not urllib.
import requests

url = "https://www.w3schools.com/xml/plant_catalog.xml"
response = requests.get(url)
with open('plant_catalog.xml', 'wb') as fp:
    fp.write(response.content)
With a combination of Beautiful Soup that common to use with this.
import requests
from bs4 import BeautifulSoup

url = "https://www.w3schools.com/xml/plant_catalog.xml"
response = requests.get(url)
soup = BeautifulSoup(response.content, 'xml')
first_common = soup.find('COMMON')
print(first_common.text)
# The whole plant_catalog.xml
#print(soup)

# Save to disk
with open('plant_catalog.xml', 'w') as fp:
    fp.write(soup.prettify())
Output:
Bloodroot
Reply
#4
(Mar-21-2024, 05:17 PM)Nik1811 Wrote: I'm looking to print the entire plant-catalog and copy it to disk. With the current solution it's just printing the first line, with the below error.
No it dos not just print the first line,so the error most be on you side.
Here NoteBook you can look at,and for this task only need to Requests.
Reply
#5
Yep, I suspect it has something to do with my xml.

I get ''https://www.********/aep2/xml/sad/NIK243179-AI.xml as my generated xml

<?xml version='1.0' encoding='UTF-8'?><IM415 xmlns="http://www.ros.ie/schemas/customs/IM415">
<Declaration>
<MsgType>H1</MsgType>
<DeclarationType_1_1>IM</DeclarationType_1_1>
<AdditionalDeclarationType_1_2>A</AdditionalDeclarationType_1_2>
<LRN_2_5>NIK243179_16KqVM</LRN_2_5>
<ValuationInformation>
<InvoiceCurrency_4_10>EUR</InvoiceCurrency_4_10>
.
.

1) With Beautiful Soup solution, I could print just the first line with the error:
MarkupResemblesLocatorWarning: The input looks more like a filename than markup. You may want to open this file and pass the filehandle into Beautiful Soup.
soup = BeautifulSoup(response.content, 'xml')
<?xml version="1.0" encoding="utf-8"?>



2) With requests, my request fails and I get an error saying:
'There was an issue generating the XML for sad/NIK243179-AI.xml'

It's strange that I get this error although the actual xml gets generated. 'requests' is o reading the actual content, but getting another statement and printing the error instead.


Do you think there could be encoding issue? or the way xml is structured. If yes, how could I parse through such xml's and what would be the best solution?
My assumption is that BeautifulSoup should have our solution, but not sure how.


(Mar-21-2024, 05:44 PM)snippsat Wrote:
(Mar-21-2024, 05:17 PM)Nik1811 Wrote: I'm looking to print the entire plant-catalog and copy it to disk. With the current solution it's just printing the first line, with the below error.
No it dos not just print the first line,so the error most be on you side.
Here NoteBook you can look at,and for this task only need to Requests.
Reply
#6
(Mar-21-2024, 06:27 PM)Nik1811 Wrote: I get ''https://www.********/aep2/xml/sad/NIK243179-AI.xml as my generated xml
Post your code,from what are you generated this .xml?
Then need just to save the content of .xml,and not use url as in my demos.
If you run my code(no changes) dos that work?
Reply
#7
@snippsat

I like to try out advice from experts, to see how it works. I tried your code, but what I get is an almost exact copy of my webpage.

The output has more lines, because every html tag is on a separate line.

Is that what is supposed to happen?

My html file has a few lines of PHP at the top, but is mainly html.

The output opens as .xml in my browser, not as a webpage.

Strangely, the </head> tag turns up at the bottom, between </body> and </html>

The <!DOCTYPE html> tag is not in the xml file, but there does not seem to be an xml root tag.

from bs4 import BeautifulSoup
 
URL = "/var/www/html/22BE1cw/22BE1sW1.html.php"
savepath = '/home/pedro/tmp/html2xml.xml'
with open(URL, "r") as localfile:
    html_content = localfile.read()
soup = BeautifulSoup(html_content, 'xml')
# Save to disk
with open(savepath, 'w') as fp:
    fp.write(soup.prettify())
Reply
#8
Your code works absolutely fine with the w3schools link.

But for my .xml, I can read just the first line with the below error:

MarkupResemblesLocatorWarning: The input looks more like a filename than markup. You may want to open this file and pass the filehandle into Beautiful Soup. soup = BeautifulSoup(response.content, 'xml')
<?xml version="1.0" encoding="utf-8"?>


Few points:

1. What is a markup? How do we open this and resolve the error?
2. Am I not pointing to the root element? When I compare the 'w3schools.xml' with my xml, there is change in the first line. My xml header is extended (<?xml version='1.0' encoding='UTF-8'?><IM415 xmlns="http://www.ros.ie/schemas/customs/IM415">) See below.

Here's my code:
response = requests.get(driver.current_url)
soup = BeautifulSoup(response.content, 'xml')
print(soup)

# Save to disk
with open('test.xml', 'w') as fp:
      fp.write(soup.prettify())
My Xml looks like below:

<?xml version='1.0' encoding='UTF-8'?><IM415 xmlns="http://www.ros.ie/schemas/customs/IM415">
<Declaration>
<MsgType>H1</MsgType>
<DeclarationType_1_1>IM</DeclarationType_1_1>
<AdditionalDeclarationType_1_2>A</AdditionalDeclarationType_1_2>
<LRN_2_5>NIK243104_16nUlp</LRN_2_5>
<ValuationInformation>
<InvoiceCurrency_4_10>AFA</InvoiceCurrency_4_10>
<InvoiceAmount_4_11>5000</InvoiceAmount_4_11>
<InternalCurrency_4_12>AFA</InternalCurrency_4_12>
</ValuationInformation>
<GoodsInformation>
<GrossMass_6_5>33300</GrossMass_6_5>
<TotalPackageNumber_6_18>1665</TotalPackageNumber_6_18>
</GoodsInformation>


I'm expecting the below to be printed and copied to my test.xml:
<?xml version='1.0' encoding='UTF-8'?><IM415 xmlns="http://www.ros.ie/schemas/customs/IM415">
<Declaration>
<MsgType>H1</MsgType>
<DeclarationType_1_1>IM</DeclarationType_1_1>
<AdditionalDeclarationType_1_2>A</AdditionalDeclarationType_1_2>
<LRN_2_5>NIK243104_16nUlp</LRN_2_5>
<ValuationInformation>
<InvoiceCurrency_4_10>AFA</InvoiceCurrency_4_10>
<InvoiceAmount_4_11>5000</InvoiceAmount_4_11>
<InternalCurrency_4_12>AFA</InternalCurrency_4_12>
</ValuationInformation>
<GoodsInformation>
<GrossMass_6_5>33300</GrossMass_6_5>
<TotalPackageNumber_6_18>1665</TotalPackageNumber_6_18>
</GoodsInformation>
<GrossMass_6_5>33300</GrossMass_6_5>
<TotalPackageNumber_6_18>1665</TotalPackageNumber_6_18>
</GoodsInformation>


(Mar-21-2024, 10:13 PM)snippsat Wrote:
(Mar-21-2024, 06:27 PM)Nik1811 Wrote: I get ''https://www.********/aep2/xml/sad/NIK243179-AI.xml as my generated xml
Post your code,from what are you generated this .xml?
Then need just to save the content of .xml,and not use url as in my demos.
If you run my code(no changes) dos that work?
Reply
#9
Quote:My html file has a few lines of PHP at the top, but is mainly html.

The output opens as .xml in my browser, not as a webpage
If the .xml it's genrated it will be a string.
Then need to pass that string to BS,and not the url.
Again there is nothing i can run ans test with what you post.
from bs4 import BeautifulSoup

# Your XML content as a string
xml_content = """
<?xml version='1.0' encoding='UTF-8'?><IM415 xmlns="http://www.ros.ie/schemas/customs/IM415">
  <Declaration>
    <MsgType>H1</MsgType>
    <DeclarationType_1_1>IM</DeclarationType_1_1>
    <AdditionalDeclarationType_1_2>A</AdditionalDeclarationType_1_2>
    <LRN_2_5>NIK243104_16nUlp</LRN_2_5>
    <ValuationInformation>
      <InvoiceCurrency_4_10>AFA</InvoiceCurrency_4_10>
      <InvoiceAmount_4_11>5000</InvoiceAmount_4_11>
      <InternalCurrency_4_12>AFA</InternalCurrency_4_12>
    </ValuationInformation>
    <GoodsInformation>
      <GrossMass_6_5>33300</GrossMass_6_5>
      <TotalPackageNumber_6_18>1665</TotalPackageNumber_6_18>
    </GoodsInformation>
</Declaration>
</IM415>
"""

# Now use lxml as parser
soup = BeautifulSoup(xml_content, 'lxml')
print(soup.prettify())

# Save the modified XML to a file
with open('test.xml', 'w', encoding='utf-8') as fp:
    fp.write(soup.prettify())
Reply
#10
Super thanks @snippsat.

Can you please help with how to change the url to an xml file. Apologies for not sharing the code earlier, I'm attaching the code and the XML file generated to this reply.

(Mar-22-2024, 04:25 PM)snippsat Wrote:
Quote:My html file has a few lines of PHP at the top, but is mainly html.

The output opens as .xml in my browser, not as a webpage
If the .xml it's genrated it will be a string.
Then need to pass that string to BS,and not the url.
Again there is nothing i can run ans test with what you post.
from bs4 import BeautifulSoup

# Your XML content as a string
xml_content = """
<?xml version='1.0' encoding='UTF-8'?><IM415 xmlns="http://www.ros.ie/schemas/customs/IM415">
  <Declaration>
    <MsgType>H1</MsgType>
    <DeclarationType_1_1>IM</DeclarationType_1_1>
    <AdditionalDeclarationType_1_2>A</AdditionalDeclarationType_1_2>
    <LRN_2_5>NIK243104_16nUlp</LRN_2_5>
    <ValuationInformation>
      <InvoiceCurrency_4_10>AFA</InvoiceCurrency_4_10>
      <InvoiceAmount_4_11>5000</InvoiceAmount_4_11>
      <InternalCurrency_4_12>AFA</InternalCurrency_4_12>
    </ValuationInformation>
    <GoodsInformation>
      <GrossMass_6_5>33300</GrossMass_6_5>
      <TotalPackageNumber_6_18>1665</TotalPackageNumber_6_18>
    </GoodsInformation>
</Declaration>
</IM415>
"""

# Now use lxml as parser
soup = BeautifulSoup(xml_content, 'lxml')
print(soup.prettify())

# Save the modified XML to a file
with open('test.xml', 'w', encoding='utf-8') as fp:
    fp.write(soup.prettify())

Attached Files

.xml   NIK243111-AI.xml (Size: 355.4 KB / Downloads: 13)
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Why is the copy method name in python list copy and not `__copy__`? YouHoGeon 2 287 Apr-04-2024, 01:18 AM
Last Post: YouHoGeon
  how to save to multiple locations during save cubangt 1 563 Oct-23-2023, 10:16 PM
Last Post: deanhystad
Question Special Characters read-write Prisonfeed 1 634 Sep-17-2023, 08:26 PM
Last Post: Gribouillis
  UPDATE SQLITE TABLE - Copy a fields content to another field. andrewarles 14 4,432 May-08-2021, 04:58 PM
Last Post: ibreeden
  Rename Multiple files in directory to remove special characters nyawadasi 9 6,431 Feb-16-2021, 09:49 PM
Last Post: BashBedlam
  copy content of text file with three delimiter into excel sheet vinaykumar 0 2,364 Jul-12-2020, 01:27 PM
Last Post: vinaykumar
  Remove escape characters / Unicode characters from string DreamingInsanity 5 13,777 May-15-2020, 01:37 PM
Last Post: snippsat
  Check for a special characters in a column and flag it ayomayam 0 2,057 Feb-12-2020, 03:04 PM
Last Post: ayomayam
  save content of table into file atlass218 10 9,979 Aug-28-2019, 12:12 PM
Last Post: Gribouillis
  Split pyscaffold project into packages locally mucrom 0 1,508 Aug-05-2019, 12:07 PM
Last Post: mucrom

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020