Forum Discussion

Drew's avatar
Drew
New Contributor
7 years ago

Covert HTML to Plain Text

I’m using the Email Reader snap and finding that the messages that it is pulling only have an htmlBody and no textBody. I need the messages properly formatted in plane text. Does SnapLogic have an easy way to do this?

I have found and worked with several RegEx’s to strip out the HTML tags but the format that is left is not the same in most cases and can make the plane text messy.

Other options?

8 Replies

  • Hi

    You Can try below function 🙂

    replace( /(<([^>]+)>)/ig, ‘’)

    Thanks

  • dcarlson's avatar
    dcarlson
    New Contributor

    That doesn’t do the trick. What me and the originator Drew are looking for it a way to transform this…

    \r\n\r\n\r\n\r\n\r\n\r\n
    \r\n

    \r\n
    This
    \r\nis
    \r\na
    \r\nplain
    \r\ntext
    \r\nbody?
    \r\n

    \r\n


    \r\n

    \r\n
    \r\n\r\n\r\n

    …into this…

    This\r\nis\r\na\r\nplain\r\ntext\r\nbody?\r\n

    In other words, we don’t need to decode… we need to strip all of the HTML tags.

    Does anyone know an easy way to do this in SnapLogic? Otherwise, I the only option seems to be to build a library that knows how to do it. But someone else must have already invented that mouse trap, eh. Thanks - Davey

    • dcarlson's avatar
      dcarlson
      New Contributor

      Ha… this website is rich text, and renders the HTML tags instead of showing them to you. 😉

      Let’s try this: Convert THIS… <html>\r\n<head>\r\n<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">\r\n<style type="text/css" style="display:none;"><!-- P {margin-top:0;margin-bottom:0;} --></style>\r\n</head>\r\n<body dir="ltr">\r\n<div id="divtagdefaultwrapper" style="font-size:12pt;color:#000000;font-family:Calibri,Helvetica,sans-serif;" dir="ltr">\r\n<p></p>\r\n<div>This<br>\r\nis<br>\r\na<br>\r\nplain<br>\r\ntext<br>\r\nbody?</div>\r\n<p></p>\r\n<p><br>\r\n</p>\r\n</div>\r\n</body>\r\n</html>\r\n
      … to plain text.

      • dmiller's avatar
        dmiller
        Former Employee

        When writing a post in the Commuity, there is a Preformatted text option you can use with scripts to prevent it converting.

  • dcarlson's avatar
    dcarlson
    New Contributor

    Yup, that’s what I did on my 2nd attempt.

    For anyone else experiencing this problem… this is the expression I wrote for our library to strip (at least) the HTML that I am seeing; might need more tags added to it.

    {
    stripHTML: x => x.replaceAll(x.slice(x.indexOf("<b"),x.indexOf(">",x.indexOf("<b"))+1),"").replaceAll("</b>","").replaceAll(x.slice(x.indexOf("<span"),x.indexOf(">",x.indexOf("<span"))+1),"").replaceAll("</span>","").replaceAll(x.slice(x.indexOf("<i"),x.indexOf(">",x.indexOf("<i"))+1),"").replaceAll("</i>","").replaceAll(x.slice(x.indexOf("<u"),x.indexOf(">",x.indexOf("<u"))+1),"").replaceAll("</u>","").replaceAll(x.slice(x.indexOf("<br"),x.indexOf(">",x.indexOf("<br"))+1),"").replaceAll("</br>","").replaceAll("<br>","").replaceAll(x.slice(x.indexOf("<font"),x.indexOf(">",x.indexOf("<font"))+1),"").replaceAll("</font>","").replaceAll(x.slice(x.indexOf("<html"),x.indexOf(">",x.indexOf("<html"))+1),"").replaceAll("</html>","").replaceAll(x.slice(x.indexOf("<body"),x.indexOf(">",x.indexOf("<body"))+1),"").replaceAll("</body>","").replaceAll(x.slice(x.indexOf("<head"),x.indexOf(">",x.indexOf("<head"))+1),"").replaceAll("</head>","").replaceAll(x.slice(x.indexOf("<meta"),x.indexOf(">",x.indexOf("<meta"))+1),"").replaceAll("</meta>","").replaceAll(x.slice(x.indexOf("<style"),x.indexOf(">",x.indexOf("<style"))+1),"").replaceAll("</style>","").replaceAll(x.slice(x.indexOf("<!"),x.indexOf(">",x.indexOf("<!"))+1),"").replaceAll("</!>","").replaceAll(x.slice(x.indexOf("<div"),x.indexOf(">",x.indexOf("<div"))+1),"").replaceAll("</div>","").replaceAll(x.slice(x.indexOf("<p"),x.indexOf(">",x.indexOf("<p"))+1),"").replaceAll(x.slice(x.indexOf("<div"),x.indexOf(">",x.indexOf("<div"))+1),"").replaceAll("</div>","").replaceAll("</p>","").replaceAll("//r","").replaceAll("<div>","").trim()
    }
    
  • @dcarlson - you could also just use a simple regular expression to strip away all the html tags. Assuming that your html string is in a single variable called “content”:

    $content.replace(/<(.|\n)*?>/ig,'')
    
    • dcarlson's avatar
      dcarlson
      New Contributor

      Shucks… vaidyarm had it right all this time. I had tried that with .replaceAll() and it didn’t work; but with .replace() like you both said… works. I would not have expected to use .replace() to strip ALL the tags. Live and learn.
      Thank you both - Davey