Parse HTML file and map the contents therein

Hi all,

I am trying to read a html file using the file reader where the file is in .txt format. I cannot use the third party libraries in the script snap to parse it because I am on cloud plex. Please let me know if anyone knows as to how this can be done.

Regards,
Lidiya

Hi @Lidiya_Thomas ,

You can try with the following:

image

In Binary to Document set:
image

Let me know if this helps you.

BR,
Marjan

1 Like

Thank you, but I had tried this approach and it doesn’t seem to work in my case.

Hi @Lidiya_Thomas ,

Can you please attach sample of the file and what you want to achieve?

Thanks.

1 Like

Hey @Lidiya_Thomas,

I think this is very complex request, especially if the HTML contains various scripts and everything. You can try and investigate some APIs. I did check one, manually on the web page and it works:

HTML 2 JSON

Keep in mind to read about security terms and policies.

Other way would be to use a script. Check out this link which leads to a Python (Jython) library documentation for parsing HTML:

Jython HTML Parser

I surely hope that you’ll find this helpful,
Bojan

2 Likes

This is a sample one -

<HTML><HEAD><TITLE>Here goes the title</TITLE>
<STYLE>
body, td { font: normal 10pt sans-serif; margin-left: 5%; margin-right: 5%;}
.articletitle { font: bold 14pt; }
.articleinfo { text-align: right; font-size: 9pt; }
.countrytitle, .citationtitle, .descriptorstitle { font: bold; vertical-align: top; }
.copyright { font: bold 8pt; text-align: right; }
.authorcomment { font: bold; }
.tables caption { font: bold; font: 11px; }
.tables thead td { font: bold; }
.footnotes { margin-top: 10px; font: 9pt; }
</STYLE></HEAD>
<BODY>
<TABLE width="100%"><TR><TD class="articletitle">Title [S]</TD><TD class="articleinfo" rowspan="2">Release Date: 03 Jan 2002<BR/>336</TD></TR><TR><TD class="articlesubtitle">Subject</TD></TR></TABLE>
<HR class="full"/>
<TABLE width="100%"><TR><TD class="text">Description of the case <BR/>sample text[<EM>sample text</EM>] <BR/><SPAN class="authorcomment">Author Comment</SPAN><BR/>&quot;Comment&quot;.</TD></TR></TABLE>
<HR class="light"/>
<TABLE width="100%"><TR><TD class="countrytitle">Country</TD><TD class="country">USA</TD></TR><TR><TD class="citationtitle">Citation</TD><TD class="citation">citation(Spec. issue): http://i.org/10/j.735 (English)</TD></TR><TR><TD class="descriptorstitle">Descriptors</TD><TD class="descriptors"><TABLE width="100%"><TR><TD width="40%" class="descriptor">Elderly</TD><TD width="40%" class="descriptor">Description</TD></TR><TR><TD width="40%" class="descriptor">adverse reactions (serious)</TD><TD width="40%" class="descriptor">errors</TD></TR><TR><TD width="40%" class="descriptor">induced</TD><TD width="40%" class="descriptor">Off-label-use</TD></TR><TR><TD width="40%" class="descriptor">induced</TD></TR></TABLE></TD></TR></TABLE>
<HR class="full"/>
<DIV class="copyright">(c)XYZ</DIV>
</BODY></HTML>

Hi @Lidiya_Thomas ,

Thanks for the sample input.
What would you like to have on output?

1 Like

Hi @marjan.karafiloski thanks for helping, I want all the data in the form of json. Please do let me know if you have got any ideas on how that can be possible.