Category Archives: Big Data

Install HBase on Linux dev

1 Reply

My dev environment

Ubuntu 14.04 LTS, HBase 1.2.1, Java 8, Hadoop 2.2.0

Prerequisites:

Java. How to install: http://dmitrypukhov.pro/install-java8-on-ubuntu/

Hadoop. How to install: http://dmitrypukhov.pro/install-hadoop-on-ubuntu/ Continue reading →

Install Zookeeper on Linux

Leave a reply

Zookeeper installation steps are nice and easy for dev environment.

Download zookeeper from http://zookeeper.apache.org, extract it to some place, let it be /opt/zookeeper/
Create a simple zoo.cfg file, i.e. copy config sample /opt/zookeeper/conf/zoo_sample.cfg to /opt/zookeeper/conf/zoo.cfg
Start zookeeper
/opt/zookeeper/bin/zkServer.sh start
Stop zookeeper
/opt/zookeeper/bin/zkServer.sh stop

HBase shell on AWS EMR cluster quickstart

Leave a reply

How to start HBase client in AWS EMR and query external HBase DB

Create EMR cluster with HBase application enabled manually or using command like this:

aws emr create-cluster \
	  --release-label emr-4.6.0 \
	  --name "my-hbase-cluster" \
	  --instance-type r3.2xlarge \
	  --instance-count 2 \
	  --enable-debugging \
	  --ec2-attributes KeyName=MyKeyPair \
	  --use-default-roles \
	  --applications Name=Hadoop Name=HBase

aws emr create-cluster \

--release-label emr-4.6.0 \

--name "my-hbase-cluster" \

--instance-type r3.2xlarge \

--instance-count 2 \

--enable-debugging \

--ec2-attributes KeyName=MyKeyPair \

--use-default-roles \

--applications Name=Hadoop Name=HBase

2. Establish ssh connection to the cluster

ssh -i ~/MyKeyPair.pem hadoop@<cluster-ip-address>.us-west-2.compute.amazonaws.com

1	ssh -i ~/MyKeyPair.pem hadoop@<cluster-ip-address>.us-west-2.compute.amazonaws.com

3.To work with external database, set zookeeper quorum in /etc/hbase/conf/hbase-site.xml

<configuration>
   <property>
      <name>hbase.zookeeper.quorum</name>
      <value>my-hbase-zookeeper-address</value>
   </property>
....
</configuration>

<name>hbase.zookeeper.quorum</name>

<value>my-hbase-zookeeper-address</value>

</property>

....

</configuration>

3. Start HBase shell

hbase shell

1	hbase shell

4. In shell do queries like that:

create 'person', {NAME=>'name'}, {NAME=>'addr'}
put   'person',  '1',  'name:firstName', 'John'
put   'person',  '1',  'name:lastName', 'Smith'
put   'person',  '1',  'addr:planet', 'Earth'
put   'person',  '1',  'addr:continent', 'Australia'

list
describe 'person'
scan 'person'
get 'person', '1', {COLUMNS => ['name']}
get 'person', '1', {COLUMNS => ['addr:planet']}

create 'person', {NAME=>'name'}, {NAME=>'addr'}

put 'person', '1', 'name:firstName', 'John'

put 'person', '1', 'name:lastName', 'Smith'

put 'person', '1', 'addr:planet', 'Earth'

put 'person', '1', 'addr:continent', 'Australia'

list

describe 'person'

scan 'person'

get 'person', '1', {COLUMNS => ['name']}

get 'person', '1', {COLUMNS => ['addr:planet']}

s3-dist-cp is missing in EMR 4

Leave a reply

I got an issue with s3-dist-cp command on Spark AWS EMR 4.5 cluster.

The issue: s3-dist-cp command step fails with error: java.lang.RuntimeException: java.io.IOException: Cannot run program “s3-dist-cp” (in directory “.”): error=2, No such file or directory Continue reading →

Create Spark cluster on AWS

Leave a reply

Create Spark cluster and run custom jar on it. Continue reading →

s3cmd WARNING: Retrying failed request

Leave a reply

I used s3cmd cp and s3cmd sync commands to copy files from source s3 folder to s3 destination. Copying is too slow, I see many attempts, failed by timeout, like this

WARNING: Retrying failed request: ... (timeout)
WARNING: Waiting 3 sec...

1 2	WARNING: Retrying failed request: ... (timeout) WARNING: Waiting 3 sec...

In ~/.s3cfg file I found socket_timeout setting, which is 100 by default. Setting it to 1000 helped me:

socket_timeout = 1000

1	socket_timeout = 1000

Apache Spark error on start: java.lang.NoClassDefFoundError: org/apache/spark/SparkConf

Leave a reply

Starting a simple Spark project in IntelliJ Idea and getting an exception:

Exception in thread “main” java.lang.NoClassDefFoundError: org/apache/spark/SparkConf
at …
Caused by: java.lang.ClassNotFoundException: org.apache.spark.SparkConf
at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
… 2 more

Solution:

Change Spark dependencies scope from provided to compile in pom.xml

<dependency>
  <groupId>org.apache.spark</groupId>
  <artifactId>spark-core_2.10</artifactId>
  <version>1.5.2</version>
  <scope>compile</scope>
</dependency>

<groupId>org.apache.spark</groupId>

<artifactId>spark-core_2.10</artifactId>

<scope>compile</scope>

</dependency>

Install Apache Spark on Ubuntu

Leave a reply

Simple steps to run Spark 1.5.2 on Ubuntu 14.04 with Yarn (Hadoop 2.7.1)

Install Hadoop (see http://dmitrypukhov.pro/install-hadoop-on-ubuntu/)

Download pre-build Spark from http://spark.apache.org/downloads.html and unpack, say to /opt/spark/ Continue reading →

Spark SQL: load parquet in Spark Shell

Leave a reply

I have my data in parquet format and want to load and query it using Spark SQL.

Start Spark shell

spark-shell

1	spark-shell

Load parquet folder to table

val myDataFrame=sqlContext.load("s3://my-bucket/my-parquet-folder/")
myDataFrame.registerTempTable("myTable")

1 2	val myDataFrame=sqlContext.load("s3://my-bucket/my-parquet-folder/") myDataFrame.registerTempTable("myTable")

Now we can use this table for SQL queries:

sqlContext.sql("select * from myTable").first()

1	sqlContext.sql("select * from myTable").first()

Apache Spark Feature http://apache.org/xml/features/xinclude is not recognized

Leave a reply

Problem: Apache Spark 1.3.1 application produces the following error:

Exception in thread "main"java.lang.RuntimeException: javax.xml.parsers.ParserConfigurationException: Feature 'http://apache.org/xml/features/xinclude' is not recognized.

1	Exception in thread "main"java.lang.RuntimeException: javax.xml.parsers.ParserConfigurationException: Feature 'http://apache.org/xml/features/xinclude' is not recognized.

Fix: edit pom.xml to use older version of xercesImpl Continue reading →

Dmitry Pukhov

Software Developer, Architect

Category Archives: Big Data

Install HBase on Linux dev

My dev environment

Prerequisites:

Install Zookeeper on Linux

HBase shell on AWS EMR cluster quickstart

s3-dist-cp is missing in EMR 4

Create Spark cluster on AWS

s3cmd WARNING: Retrying failed request

Apache Spark error on start: java.lang.NoClassDefFoundError: org/apache/spark/SparkConf

Install Apache Spark on Ubuntu

Spark SQL: load parquet in Spark Shell

Apache Spark Feature http://apache.org/xml/features/xinclude is not recognized