Category Archives: Big Data

Install Zookeeper on Linux

Zookeeper installation steps are nice and easy for dev environment.

  1. Download zookeeper from http://zookeeper.apache.org, extract it to some place, let it be /opt/zookeeper/
  2. Create a simple zoo.cfg file, i.e. copy config sample /opt/zookeeper/conf/zoo_sample.cfg to /opt/zookeeper/conf/zoo.cfg
  3. Start zookeeper
    /opt/zookeeper/bin/zkServer.sh start
  4. Stop zookeeper
    /opt/zookeeper/bin/zkServer.sh stop

HBase shell on AWS EMR cluster quickstart

How to start HBase client in AWS EMR and query external HBase DB

  1. Create EMR cluster with HBase application enabled manually or using command like this:

2. Establish ssh connection to the cluster

3.To work with external database, set zookeeper quorum in /etc/hbase/conf/hbase-site.xml

3. Start HBase shell

4. In shell do queries like that:

 

s3cmd WARNING: Retrying failed request

I used s3cmd cp and s3cmd sync commands to copy files from source s3 folder to s3 destination. Copying  is too slow, I see many attempts, failed by timeout, like this

In ~/.s3cfg file I found socket_timeout setting, which is 100 by default. Setting it to 1000 helped me:

 

Apache Spark error on start: java.lang.NoClassDefFoundError: org/apache/spark/SparkConf

Starting a simple Spark project in IntelliJ Idea and getting an exception:

Exception in thread “main” java.lang.NoClassDefFoundError: org/apache/spark/SparkConf
at …
Caused by: java.lang.ClassNotFoundException: org.apache.spark.SparkConf
at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
… 2 more

Solution:

Change Spark dependencies scope from provided to compile in pom.xml

 

 

 

 

 

Spark SQL: load parquet in Spark Shell

I have my data in parquet format and want to load and query it using Spark SQL.

Start Spark shell

Load parquet folder to table

Now we can use this table for SQL queries:

 

 

Apache Spark Feature http://apache.org/xml/features/xinclude is not recognized

Problem: Apache Spark 1.3.1 application produces the following error:

Fix: edit pom.xml to use older version of xercesImpl Continue reading