Python Forum

Apologies for the long post, thanks to anyone in advance that takes the time to offer their help. Any constructive guidance is appreciated.

Version: Python 2.7.10
OS: OSX 10.11.6

I have tried with LXML, but do not have the experience to do much more than basic operations. In addition I tried with regex but it proved overly complicated considering the nature of the task. If anyone thinks Regex is the best approach here, I'd be happy to post my code.

Consider the following:

<source>foo<x id="1">&ndash;</x> bar</source>
<target></target>

For each source element in an XML, if any "internal tag" is present (x, y, z) I need to take the content of this tag and write it to a note element above the element source (position is relevant), like so:

<note>&ndash;</note>

- please note there can be multiple tags present in each source segment, meaning multiple note nodes.

The problem I'm having is understanding the best way to handle the following:

1. How to define at which position which tag came from, there can be multiple internal tags in a single source segment which after removal initially I would need to write back during post processing. Perhaps an attribute like so is the best approach?

<note type="x" id="1" position="1">&ndash;</note>

2. Secondly, where an internal tag is present that does not have white space either side of it, I would need to make sure it is present, and again, remove it during postprocessing. For example:

<note type="x" id="1" position="1", spacing="rsb">&ndash;</note>

rsb - Remove space beginning
rse - Remove space end
drs - Do not remove space

given there was no whitespace in the source segment initially:

<note type="x" id="1" position="1", spacing="rsb">&ndash;</note>
<source>foo 1_rsb bar</source>
<target></target>

It's imperative this is the case, given the translated text will be matched against its source counterpart at a later stage. If the whitespace were disregarded, it would result in a lower percentile match.

post-pro

after export, translation will be present in target, but

1. the full tag from the original source needs to be added in the corresponding source and target elements in the correct position.
2. and note elements removed.

This hopefully should lead to:

<note type="x" id="1" position="1" spacing="rsb">&ndash;</note>
<source>foo<x id=1>&ndash;</x> bar</source>
<target>spam<x id=1>&ndash;</x> eggs</target>

I imagine the best approach is with lxml but my knowledge of that is limited. If anyone has any advice whatsoever as to the best way to approach this problem, I'd greatly appreciate it.

Looking forward/hoping for replies :)

vasilysmyslov