I got an issue with s3-dist-cp command on Spark AWS EMR 4.5 cluster.
The issue: s3-dist-cp command step fails with error: java.lang.RuntimeException: java.io.IOException: Cannot run program “s3-dist-cp” (in directory “.”): error=2, No such file or directory
The cluster is created by this script:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
aws emr create-cluster --release-label emr-4.5.0 --name mySparkCluster --instance-type r3.2xlarge --instance-count 3 --tags MyTag1:MyValue1 MyTag2:MyValue2 --enable-debugging --ec2-attributes KeyName=MyKeyPair --use-default-roles --log-uri s3://my-bucket/logs --applications Name=Spark --configurations file:///home/dmitry/projects/mySparkApp/spark-config.json --steps Name=CopyFromS3ToHdfs, Type=CUSTOM_JAR, ActionOnFailure=TERMINATE_CLUSTER, Jar="command-runner.jar", Args=["s3-dist-cp", "--src=s3://my-bucket/source/", "--dest=hdfs:///destination"] --auto-terminate |
Solution:
Change –applications parameter to install Hadoop application along with Spark to the cluster –applicatons Name=Hadoop Name=Spark. The proper creation script is:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
aws emr create-cluster --release-label emr-4.5.0 --name mySparkCluster --instance-type r3.2xlarge --instance-count 3 --tags MyTag1:MyValue1 MyTag2:MyValue2 --enable-debugging --ec2-attributes KeyName=MyKeyPair --use-default-roles --log-uri s3://my-bucket/logs --applications Name=Hadoop Name=Spark --configurations file:///home/dmitry/projects/mySparkApp/spark-config.json --steps Name=CopyFromS3ToHdfs, Type=CUSTOM_JAR, ActionOnFailure=TERMINATE_CLUSTER, Jar="command-runner.jar", Args=["s3-dist-cp", "--src=s3://my-bucket/source/", "--dest=hdfs:///destination"] --auto-terminate |
s3-dist-cp will be recognized now.