11-30-2023 04:58 AM
I would like to create a pipeline which downloads excel file from this webpage https://www.eia.gov/electricity/data/eia861m/#salesrevenue every month. I am currently using web scraping and python to download the excel file. Can we do webscraping or is there any special Snap ?
11-30-2023 05:36 AM
@dawna14 - are you looking for only the current Excel from the page, or all of them?
12-02-2023 03:34 PM
@koryknick I am looking to download only specific excel link. Pls see attachment
12-02-2023 05:44 AM
@dawna14 - I went for the more generic approach. Please download the attached zip file, decompress it, and import the pipeline from the SLP file.
The HTTP Client is simply getting the webpage contents.
In the "Scrape html for excel file links", the match() method is simply using a regular expression to find the link anchors, then filtering the resulting array of strings for those that contain ".xls"
The "Split anchor references" I believe is self-explanatory.
The "Map filePath" snap is again using the match() method to extract out the file reference, which will return two strings in an array, but we want the last one, hence the pop() method
The "Get file" is another HTTP Client snap to get the file contents from the relative path - note that this time, we use a Binary output view on the snap to return the data as a binary stream.
Finally, "Write output file" will write the file out to the SLDB.
Hope this helps!