Forum Discussion

alex_panganiban's avatar
alex_panganiban
Contributor
5 years ago

XML Generator, Entity referenced, but not declared

I’m having trouble using the XML Generator with data that contains special symbols and characters. All the usual suspects (<, >, &, ", ', ) have been taken care of by using the HTML notations <, >, &amp, etc., but I’m having trouble with others that I feel should work. For instance, the symbol for the registered sign, ®, should be notated as ®, however, the XML generator gives me an error stating that “the entity, reg, is referenced but not delcared.” How do I go about declaring these. I have quite a number of symbols and accented/uumlat’ed letters that I need to use. Whether I use the rendered symbol or the HTML notation, the XML Generator fails.

How can I successfully include these symbols in my XML messages so that the XML Generator doesn’t see them as errors? See pics below for examples of my issue.


25 Replies

  • Replacing &reg; with the decimal equivalent may work (ie, &#174;). There is
    also an "Escape Special Characters" checkbox which will escape many special
    characters.
    
    For example, < -> &lt; 
    
    Though representation may not be correct if already escaped, for example, &lt; -> &amp;lt;
    
    • alex_panganiban's avatar
      alex_panganiban
      Contributor

      I tried playing with the “escape special characters” option, but it just dropped my special charcter data instead of leaving the escaped notations as they were.

      • mbowen's avatar
        mbowen
        Employee

        Hmmm… Frustrating indeed. Let me conduct an experiment too.

  • Thanks, Matt. I’m looking forward to your response and what I can do to workaround this issue. Many of our vendors/brands have branding policies associated with the symbols and special characters that are being identified as undeclared and by the XML Generator and the XML Formatter, thus resulting in pipeline failure. Thanks again!

  • Well, well! What a nice surprise. Great to hear from you Michael. I’ve missed you. Thanks for your response although I’m not sure it’s going to help me in the long run. If I were having problems with just this one special character, it would be easy to convert the escape code, however, considering the data that our vendors are supplying us, I need for the XML Generator and Formatter to be able to accept any UTF-8 compliant escape code.

    The following is a list of what I have found in our data so far and none of these will pass through the XML Generator and XML Formatter snaps. And no doubt, vendors can, will, and should be able to add whatever they want to this list, as long as it’s HTML compliant. The snaps also sometimes do not recognize the character codes at all and I have had to convert them to either decimal or hexidecimal for the snaps to recognize them (for instance I had to convert &reg; to &#174; and &#xAE;. I see in your example you converted it to #00xAE; ).

    ® &reg;
      &nbsp;
    “ &ldquo;
    ” &rdquo;
    ’ \rsquo;
    – &ndash;
    … &hellip;
    — &mdash;
    é &eacute;
    É &Eacute;
    ™ &trade;
    ó &oacute;
    í &iacute;
    ü &uuml;
    Ü &Uuml;

  • xml_SpecialCharactersSample.zip (9.8 KB)
    @mbowen and @alchemiz I’m sharing a little sample of that illustrates that problem that I’m having.

    Take note of the mapper after the XML Generator. In the mapper I use the string.replace() method. If I don’t do this, then the pipeline fails once it gets to the XML Parser that is downstream. By doing this, it also allows the XML formatter to complete successfully, however, I don’t wish to have the rendered symbol, but rather, I want to have the escape character code, just like the &lt; and &gt; remain intact.

    How can I get either these 3 snaps (XML Generator, XML Parser, and XML Formatter) to treat all escape codes the same as it does for < and >, or how can I create customized snaps that are based on these 3 snaps. I am working against a deadline of June 16th, 2021 and this predictament is a big blocker for me.

    Thanks, Alex

    • mbowen's avatar
      mbowen
      Employee

      Thank you @alchemiz for the wonderful little pipeline.

      @alex.panganiban.guild, you are correct that the Escape Special Characters option in the XML Generator snap only escapes these five entities: gt, lt, quot, amp, apos.

      The snap is currently using a minimal xml escape handler, however, other snaps use a custom handler. So, this snap could also be updated to recognize/escape more xml.

      Under the hood, the xml is parsed and transformed. It’s during the transformation that the register mark (&#174;) gets converted to a glyph (symbol). We may be able control this behavior too.

      Creating a support ticket would allow for this work to get started, however, it likely would be completed and available by June 16.

      Let me evaluate your sample pipeline and see if we can get a workaround in place that can help you sooner!

  • @mbowen Thanks for your response and it makes me feel very hopeful. I just submitted a support ticket. My ticket number is #41732. I have my fingers crossed and remain hopeful.

    FYI, in my sample, the input document contains a field called “additional_romance_copy” for sku 228407. The value for this field should get placed in an output field called, “long-description,” in product-id 11110525. This is where the final results will be. In my experiments, I modified additional_romance_copy so that it contained a variety of special characters that I know my vendors use. I outlined the top 18 of them earlier in this thread. Hope this info helps.

    Thanks again.

    • alchemiz's avatar
      alchemiz
      Contributor III

      I tried to set the DOCTYPE to HTML and include the xhtml strict dtd but getting an error where DOCTYPE is not allowed XML generator snap

      So, I tried to validate/parse the XML from an online tool… just something to explore

      Without the DOCTYPE, the XML is invalid due to reg is not declared

      Now adding the DOCTYPE, the reg and copy are now valid since it was declared from the DTD

      • alex_panganiban's avatar
        alex_panganiban
        Contributor

        @mbowen Hi Matthew. I hope you are well. Earlier this summer you offered me a solution to get past my XML HTML entity declaration issues. We basically Base64 encoded the value of the XML element that contained my HTML entity codes, passed it through the both the XML Generator and XML Formatter snaps, and then just before we wrote the XML document to file, we used a Mapper snap with the following expression in it.

        $content.replace(/<(short|long)-description>(.+?)</\1-description>/g,
        (_matched, prefix, b64enc) => “<”+prefix+“-description>”+Base64.decode(b64enc)+“</”+prefix+“-description>”)

        I now have another element in the same XML document that I wish to do a Base64.decode() on, however, I’m having trouble understanding how your replace method and the callback function within it works. Could you help me to understand it so that I can string an additional replace on this same expression?

        I’m basically trying to decode the value of the custom-attribute below, but when I try to emulate the pattern that you established, I can’t get it to work.

        <custom-attribute attribute-id="productFeatures">PCFbQ0RBVEFbPHA+RmVhdHVyZXMgTGlzdDwvcD5dXT4=</custom-attribute>