Tag Archives: Big Data

s3cmd WARNING: Retrying failed request

I used s3cmd cp and s3cmd sync commands to copy files from source s3 folder to s3 destination. Copying  is too slow, I see many attempts, failed by timeout, like this

In ~/.s3cfg file I found socket_timeout setting, which is 100 by default. Setting it to 1000 helped me:

 

Spark SQL: load parquet in Spark Shell

I have my data in parquet format and want to load and query it using Spark SQL.

Start Spark shell

Load parquet folder to table

Now we can use this table for SQL queries:

 

 

Apache Spark Feature http://apache.org/xml/features/xinclude is not recognized

Problem: Apache Spark 1.3.1 application produces the following error:

Fix: edit pom.xml to use older version of xercesImpl Continue reading

Clear Apache Storm cluster remotely

My bash scripts to clear Storm and Zookeeper cluster remotely by ssh.

Main idea:

Connect to every zookeeper server by ssh and stop zookeeper, then delete data folder. Then connect to every Storm node by ssh, kill Storm processes and delete data folder. Connect to zookeeper servers again, start them. Connect to Storm nodes again, start them. Continue reading