Spark通过YARN-client提交任务不成功
自己用的Spark-1.3.1, 通过YARN Client提交任务访问Kerberos认证的Hadoop集群。
发现应用提交后始终出现如下循环提示:
15/03/31 09:00:45 INFO yarn.Client: Application report for application_1427763283312_0001 (state: ACCEPTED)
15/03/31 09:00:46 INFO yarn.Client: Application report for application_1427763283312_0001 (state: ACCEPTED)
15/03/31 09:00:47 INFO yarn.Client: Application report for application_1427763283312_0001 (state: ACCEPTED)
15/03/31 09:00:48 INFO yarn.Client: Application report for application_1427763283312_0001 (state: ACCEPTED)
15/03/31 09:00:49 INFO yarn.Client: Application report for application_1427763283312_0001 (state: ACCEPTED)
15/03/31 09:00:50 INFO yarn.Client: Application report for application_1427763283312_0001 (state: ACCEPTED)
15/03/31 09:00:51 INFO yarn.Client: Application report for application_1427763283312_0001 (state: ACCEPTED)
15/03/31 09:00:52 INFO yarn.Client: Application report for application_1427763283312_0001 (state: ACCEPTED)
15/03/31 09:00:53 INFO yarn.Client: Application report for application_1427763283312_0001 (state: ACCEPTED)
15/03/31 09:00:54 INFO yarn.Client: Application report for application_1427763283312_0001 (state: ACCEPTED)
15/03/31 09:00:55 INFO yarn.Client: Application report for application_1427763283312_0001 (state: ACCEPTED)
15/03/31 09:00:56 INFO yarn.Client: Application report for application_1427763283312_0001 (state: ACCEPTED)
15/03/31 09:00:57 INFO yarn.Client: Application report for application_1427763283312_0001 (state: ACCEPTED)
15/03/31 09:00:58 INFO yarn.Client: Application report for application_1427763283312_0001 (state: ACCEPTED)
15/03/31 09:00:59 INFO yarn.Client: Application report for application_1427763283312_0001 (state: ACCEPTED)
15/03/31 09:01:00 INFO yarn.Client: Application report for application_1427763283312_0001 (state: ACCEPTED)
15/03/31 09:01:02 INFO yarn.Client: Application report for application_1427763283312_0001 (state: ACCEPTED)
15/03/31 09:01:03 INFO yarn.Client: Application report for application_1427763283312_0001 (state: ACCEPTED)
15/03/31 09:01:04 INFO yarn.Client: Application report for application_1427763283312_0001 (state: ACCEPTED)
15/03/31 09:01:05 INFO yarn.Client: Application report for application_1427763283312_0001 (state: ACCEPTED)
...
总结下解决思路:
1> 首先想到是集群中内存资源不足,可以检查下每台机器是否有足够剩余内存( free -g);也可能是其他已经提交的Spark应用占了大部分资源;
2> 如果1>正常,我们可以看看YARN集群是否启动成功。注意“坑”可能就在这里: 即使Slave上的nodemanager进程存在,要注意检查resource manager日志,看看各个node manager是否启动成功,我的问题就出现在这里:进程在,但是日志显示node manager状态为UNHEALTHY,所以YARN集群能识别到的总内存资源为0。。。
检查了UNHEALTHY的原因,是因为/tmp下一个目录被识别为bad, 因为是临时目录,我把每个node manager的对应目录删掉,然后重启YARN集群,最终问题解决。