Setting up Kerberos on SnapLogic groundplex for authentication to Cloudera - Hive account and snap pack
I’m looking for information on how to set up Kerberos on the SnapLogic groundplex for authentication to Cloudera. I want to use Hive account and snap pack. I see this documentation: https://docs-snaplogic.atlassian.net/wiki/spaces/SD/pages/2015960/How+to+Configure+a+Groundplex+for+CDH+with+Kerberos+Authentication Is that all there is to it? Or are there more steps?6.8KViews0likes8CommentsKERBEROS issues - HDFS cluster
Hi , We’ve implemented Kerberos in our hadoop cluster, but we are having issues with our pipelines. We are getting this error 🙂 "Reason: org.apache.hadoop.security.AccessControlException: SIMPLE authentication is not enabled. Available:[TOKEN, KERBEROS] " “Caused by: java.util.concurrent.ExecutionException: org.apache.hadoop.security.AccessControlException: SIMPLE authentication is not enabled. Available:[TOKEN, KERBEROS] at java.util.concurrent.FutureTask.report(FutureTask.java:122) at java.util.concurrent.FutureTask.get(FutureTask.java:206) at com.snaplogic.snap.api.binary.SimpleReader.doRead(SimpleReader.java:332) … 30 more Caused by: org.apache.hadoop.security.AccessControlException: SIMPLE authentication is not enabled. Available:[TOKEN, KERBEROS] at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62” All was configured correctly , keytabs are ok, i’m able to do a kinit from any groundplex nodes, time on KDC servers and kerberos clients is synced. Pipelines worked ok for some while, but we saw that after we restarted jcc process on the groundplex nodes pipelines started to fail. JCE extension was installed as per https://docs-snaplogic.atlassian.net/wiki/spaces/SD/pages/2015960/How+to+Configure+a+Groundplex+for+CDH+with+Kerberos+Authentication Any ideas ? Thank you, Petrica5.6KViews0likes3CommentsThe Data Lake will get Cloudy in 2017
Written by Chief Enterprise Architect, Ravi Dharnikota. Originally posted in SandHIll Magazine. In 2016, we saw a lot of interesting discussions around the data lake. In the spirit of reflecting on the past and looking at the future of this technology, here are the significant themes I foresee in 2017 for the enterprise data lake. The data lake is no longer just Hadoop Until recently, the enterprise data lake has largely been discussed, designed and implemented around Hadoop. Doug Cutting, creator of Hadoop, commented that Hadoop is at its core – HDFS, YARN and MapReduce. He went on to say that Spark, for many, is now favored over MapReduce. So that leaves a distributed storage system (HDFS) and a resource negotiator (YARN). HDFS has other alternatives in S3, Azure Blob/Azure Data Lake Store and Google Cloud Storage. YARN has alternatives like Mesos. Also, if you wanted to run Spark, it has its own resource management and job scheduling in-built, in stand-alone mode. This brings in other alternatives like S3/EMR/Redshift/Spark/Azure, which are outside of traditional Hadoop, into the enterprise data lake discussion. Separation of storage and compute Elasticity is among the most attractive promises of the cloud. Pay for what you consume when you need it, thus keeping your costs under control. It also leads to simple operational models where you can scale and manage your storage and compute independently. An example of this is the need for a marketing team that buys terabytes of data to merge that with other internal data but quickly realizes that the quality of this data leaves a lot to be desired. Their internal on-premises enterprise data lake is built for regular, predictable batch loads and investing in more nodes in their Hadoop cluster is not an option. Such a burst load can be well handled by utilizing elastic compute separate from storage. Data lake “cloudification” Hadoop has proven a challenge for enterprises, especially from an admin and skill set perspective. This has resulted in data lakes with the right intent, but poor execution, stuck in labs. At this point, the cloud seems like a great alternative with the promise of low to no maintenance and removing the skill set required around management, maintenance and admin of the data lake platform infrastructure. The cloud also allows for freedom to research new technologies (which are rapidly changing in the big data world) and experiment with solutions without a large up-front investment. Avoiding vendor lock-in Various cloud vendors today provide solutions for the data lake. Each vendor has its own advantage with respect to technology, ecosystem, price, ease of use, etc. Multiple options for cloud data lake platforms, on the one hand, provide an advantage for the enterprise to get exposure to the latest innovations, while on the other hand, the same prove to be a challenge to insulate changes from the consumers of these platforms. Like most technology on-ramps, vendors and customers will predictably try many solutions until they settle on one. Vendor changes will come frequently and significantly as their functionality and market focus improves. Enterprises will likely have expert infrastructure teams trying to implement and learning to tweak the interfaces of several vendor offerings concurrently. All this is a good and healthy practice. Application teams will be subjected to all this vendor and solution shakeout. Hundreds of users that have better things to do will naturally become inhibitors to the infrastructure team’s need to make continuous improvements in the cloud infrastructure selection. There needs to be a strategy to protect the application teams in this multi-cloud data lake environment against frequent changes to their data pipelines. Operationalizing the data lake The enterprise is a retrofit job. Nothing can be introduced into the enterprise in isolation. Any new system needs to work well with the technology, skill set and processes that exist. As data lakes graduate from the labs and become ready for consumption by the enterprise, one of the biggest challenges to adoption and success is around retrofitting and operationalizing it. A well-oiled process in an enterprise is mostly automated. How Continuous Integration and Continuous Deployment(CICD), versioning and collaboration are handled will define the enterprise data lake success in an enterprise. Self-service usage led by ease of use, API access and a security framework that supports authentication and authorization in the enterprise with well-defined access controls streamlines consumption of data and services and prevents chaos. As the demand for non-traditional data integration increases, the themes that have been developing over the last year should improve the success of enterprise data lake implementation and adoption.3.6KViews0likes3CommentsUpdating Databricks/Delta Lake tables in Standard Pipelines
Hi community, I was wondering if anyone has tried doing Inserts/Updates to Databricks/Delta Lake Table using the generic JDBC Snaps (or any other work around if you have it). Using the latest Databricks JDBC Driver, I was able to setup an account successfully but I only seem to be able to do SELECT statements using the Generic JDBC - Execute Snap, if I try any other statement like an INSERT/UPDATE, I get the following error: Failure: SQL operation failed, Reason: SQL [null]; Error message not found: NOT_IMPLEMENTED. Can’t find resource for bundle java.util.PropertyResourceBundle, key NOT_IMPLEMENTED, Resolution: Please check for valid Snap properties and input data.2.8KViews0likes1CommentConfiguring SnapLogic Hive Account for Hive on HortonWorks HDP (No Kerberos)
Below are steps to configure the Hive Account in SnapLogic to connect to a Hive Database on HortonWorks Step 1: Go to Product Downloads | Cloudera and download the HortonWorks JDBC Driver for Apache Hive Step 2: If Kerberos is not enabled on HortonWorks HDP then Create a Hive Account and use the sample below as a reference2.6KViews0likes0CommentsDuplicate data using SQL offset
We are attempting to load transactional tables on a pipeline we created Issue faced - While loading data from a transactional table by leveraging the help of limit and offset in SQL SERVER select snap (partition and loading), we are getting duplicate records on the destination table. Cause expected – While loading data, since the source table is getting updated(insertion, deletion and updates) limit and offset values may change with the data mapping, resulting in duplication of reads. We looked at the documentation to understand how the offset feature works but not much data there. Can someone share some further insight on this feature and how it works that may be contributing to the above issue2.2KViews0likes0CommentsHDFS Reader - Reading a partition with space
I am using HDFS Reader snap to read the data from Hadoop for the ETL. I am unable to read few partitions which has space character. Same partitions, i am able list using hadoop fs -ls command using escape character before the space(\ ). I have tried replacing space with escape character & %20, but it didn’t work. Can any one suggest if you know any workaround reading the partitions with space.1.9KViews0likes0Comments