XML Generator, Entity referenced, but not declared

I’m having trouble using the XML Generator with data that contains special symbols and characters. All the usual suspects (<, >, &, ", ', ) have been taken care of by using the HTML notations <, >, &amp, etc., but I’m having trouble with others that I feel should work. For instance, the symbol for the registered sign, ®, should be notated as ®, however, the XML generator gives me an error stating that “the entity, reg, is referenced but not delcared.” How do I go about declaring these. I have quite a number of symbols and accented/uumlat’ed letters that I need to use. Whether I use the rendered symbol or the HTML notation, the XML Generator fails.

How can I successfully include these symbols in my XML messages so that the XML Generator doesn’t see them as errors? See pics below for examples of my issue.


Replacing &reg; with the decimal equivalent may work (ie, &#174;). There is
also an "Escape Special Characters" checkbox which will escape many special
characters.

For example, < -> &lt; 

Though representation may not be correct if already escaped, for example, &lt; -> &amp;lt;

I tried playing with the “escape special characters” option, but it just dropped my special charcter data instead of leaving the escaped notations as they were.

Hmmm… Frustrating indeed. Let me conduct an experiment too.

I’m wondering if there’s a way to make the XML Generator and the XML Formatter ignore escaped codes and leave them as they are instead of rendering them. It would be nice if &reg/&174 behaved the same way as &lt and &gt

Thanks for helping me look at this!

I still need to investigate this. For the Formatter, do you check the “Format as canonical XML” checkbox?

Hi, no I am not checking “Format as canonical XML.” I do check “Strict XSD Output.”

Here are my results of testing entity reference escaping.

XML Generator (Escape Special Characters: unchecked)
    Input: &#174;
    Output: Registered Sign (decimal code point: 174, utf-8 hex bytes: C2 AE)
...
XML Formatter: 
    Input: Registered Sign
    Output: Registered Sign

Running pipeline with “escape” checked.

XML Generator (Escape Special Characters: checked)
    Input: &#174;
    Output: &amp;#174; 
...
XML Formatter: 
    Input: &#174
    Output: &amp;#174

As you know, if we try to use the ® reference, XML Generator will fail with “SAXParseException The entity ‘reg’ was referenced, but not declared”. However, if check “escape special chars”, I get this.

XML Generator (Escape Special Characters: checked)
    Input: &reg;
    Output: &amp;reg; 
...
XML Formatter: 
    Input: &reg;
    Output: &amp;reg;

Admittedly, the escaping and transformations are a bit tricky in spots. I assume you would like to see &reg; or &#174; output from the XML Formatter? I can attach my sample pipeline after updating wsdl endpoint if helpful. Where is it failing for you and what you would like to see.

Hi Matthew

Yes, your results matched what I’m experiencing too.

Ideally, I would like to see &reg;, &#174, and &#xAE go into the XML Generator (with Escape Special Characters unchecked) and come out in the exact same format, the same way as &lt;, &gt;, &apos;, and &quot; do. The same goes for the XML Formatter…I want to see any input that has been pre-escaped come out in the same format as it went in. This would make the behavior of the snaps consistent for all escaped codes.

As a retailer, our vendors pass to us product descriptions that are HTML formatted and friendly. I am attempting to place these product descriptions in XML messages so that they can be imported into our Salesforce hosted website and mobile app. Aside from <, >, ', ", and ®, I have around 17-20 additional escape codes that I need to use, however, the way the XML Generator and Formatters are working now, I am unable to do this.

Can I request for a future release that these snaps be changed so that the behavior is consistent for all escape codes when the Escape Special Characters option is turned off, or how would I go about making a custom snap to behave this way? Without a change like this, I’m pretty much dead in the water.

Thanks, Alex

p.s. Answering your question, where is it failing for me?
When I use &reg; as input to either the Generator or the Formatter, they both fail due to “reg” not being declared.
When I use &#174; and &#xAE;, they go through the Generator snap successfully, however, as you’ve experienced, they come out of Generator snap as the registered sign. Downstream, when I am running this data through the XML Formatter (in preparation to do a file write), the Formatter fails because it does not like the registered sign.
I cannot use the Escape Special Characters option because I have so many codes that I need to use and this option only accounts for the less than sign, the greater than sign, the apostrophe, and quotation marks. No other symbols or special characters are recognized by this option.

Sorry. Got caught up with another issue today. I will reply on Monday (6/7). Have a nice weekend.

Thanks, Matt. I’m looking forward to your response and what I can do to workaround this issue. Many of our vendors/brands have branding policies associated with the symbols and special characters that are being identified as undeclared and by the XML Generator and the XML Formatter, thus resulting in pipeline failure. Thanks again!

Hi Kuya,

Good day, here’s a PoC pipeline that escape the unicode string value

when parsed, the xml encode will the actual value…

image

so when doing a rewrite as an xml the actual char will be set (XML formatter)

XML Generator - special char_2021_06_07.slp (9.3 KB)

~EmEm

Well, well! What a nice surprise. Great to hear from you Michael. I’ve missed you. Thanks for your response although I’m not sure it’s going to help me in the long run. If I were having problems with just this one special character, it would be easy to convert the escape code, however, considering the data that our vendors are supplying us, I need for the XML Generator and Formatter to be able to accept any UTF-8 compliant escape code.

The following is a list of what I have found in our data so far and none of these will pass through the XML Generator and XML Formatter snaps. And no doubt, vendors can, will, and should be able to add whatever they want to this list, as long as it’s HTML compliant. The snaps also sometimes do not recognize the character codes at all and I have had to convert them to either decimal or hexidecimal for the snaps to recognize them (for instance I had to convert &reg; to &#174; and &#xAE;. I see in your example you converted it to #00xAE; ).

® &reg;
  &nbsp;
“ &ldquo;
” &rdquo;
’ \rsquo;
– &ndash;
… &hellip;
— &mdash;
é &eacute;
É &Eacute;
™ &trade;
ó &oacute;
í &iacute;
ü &uuml;
Ü &Uuml;

xml_SpecialCharactersSample.zip (9.8 KB)
@mbowen and @alchemiz I’m sharing a little sample of that illustrates that problem that I’m having.

Take note of the mapper after the XML Generator. In the mapper I use the string.replace() method. If I don’t do this, then the pipeline fails once it gets to the XML Parser that is downstream. By doing this, it also allows the XML formatter to complete successfully, however, I don’t wish to have the rendered symbol, but rather, I want to have the escape character code, just like the &lt; and &gt; remain intact.

How can I get either these 3 snaps (XML Generator, XML Parser, and XML Formatter) to treat all escape codes the same as it does for < and >, or how can I create customized snaps that are based on these 3 snaps. I am working against a deadline of June 16th, 2021 and this predictament is a big blocker for me.

Thanks, Alex

Thank you @alchemiz for the wonderful little pipeline.

@alexpanganiban, you are correct that the Escape Special Characters option in the XML Generator snap only escapes these five entities: gt, lt, quot, amp, apos.

The snap is currently using a minimal xml escape handler, however, other snaps use a custom handler. So, this snap could also be updated to recognize/escape more xml.

Under the hood, the xml is parsed and transformed. It’s during the transformation that the register mark (&#174;) gets converted to a glyph (symbol). We may be able control this behavior too.

Creating a support ticket would allow for this work to get started, however, it likely would be completed and available by June 16.

Let me evaluate your sample pipeline and see if we can get a workaround in place that can help you sooner!

@mbowen Thanks for your response and it makes me feel very hopeful. I just submitted a support ticket. My ticket number is #41732. I have my fingers crossed and remain hopeful.

FYI, in my sample, the input document contains a field called “additional_romance_copy” for sku 228407. The value for this field should get placed in an output field called, “long-description,” in product-id 11110525. This is where the final results will be. In my experiments, I modified additional_romance_copy so that it contained a variety of special characters that I know my vendors use. I outlined the top 18 of them earlier in this thread. Hope this info helps.

Thanks again.

I tried to set the DOCTYPE to HTML and include the xhtml strict dtd but getting an error where DOCTYPE is not allowed XML generator snap

So, I tried to validate/parse the XML from an online tool… just something to explore

Without the DOCTYPE, the XML is invalid due to reg is not declared

Now adding the DOCTYPE, the reg and copy are now valid since it was declared from the DTD