Scenario 1) Log4j’s RollingFileAppender
Spark uses log4j as logging facility. The default configuraiton is to write all logs into standard error, which is fine for batch jobs. But for streaming jobs, we’d better use rolling-file appender, to cut log files by size and keep only several recent files.log4j.rootLogger=INFO, rolling
log4j.appender.rolling=org.apache.log4j.RollingFileAppender
log4j.appender.rolling.layout=org.apache.log4j.PatternLayout
log4j.appender.rolling.layout.conversionPattern=[%d] %p %m (%c)%n
log4j.appender.rolling.maxFileSize=50MB
log4j.appender.rolling.maxBackupIndex=5
log4j.appender.rolling.file=/var/log/spark/${vm.logging.name}.log
log4j.appender.rolling.encoding=UTF-8
log4j.logger.org.apache.spark=WARN
log4j.logger.org.eclipse.jetty=WARN
log4j.logger.com.vmeg.code=${vm.logging.level}
This means log4j will roll the log file by 50MB and keep only 5 recent files. These files are saved in
/var/log/spark directory, with filename picked from system property dm.logging.name. We also set the logging level of our package com.vmeg.code according to vm.logging.level property. Another thing to mention is that we set org.apache.spark to level WARN, so as to ignore verbose logs from spark.Scenario 2) Standalone Mode
In standalone mode, Spark driver is running on the machine where you submit the job, and each Spark worker node will run an executor for this job. So you need to setup log4j for both driver and executor.spark-submit --master spark://127.0.0.1:7077 --driver-java-options "-Dlog4j.configuration=file:/path/to/log4j-driver.properties -Dvm.logging.level=DEBUG" --conf "spark.executor.extraJavaOptions=-Dlog4j.configuration=file:/path/to/log4j-executor.properties -Dvm.logging.name=myapp -Dvm.logging.level=DEBUG"
Scenario 3) Spark on YARN
As you can see, both driver and executor use the same configuration file. That is because in yarn-cluster mode, driver is also run as a container in YARN. Tspark-submit --master yarn-cluster --files /path/to/log4j-spark.properties --conf "spark.driver.extraJavaOptions=-Dlog4j.configuration=log4j-spark.properties" --conf "spark.executor.extraJavaOptions=-Dlog4j.configuration=log4j-spark.properties"
<b>log4j-spark.properties</b>
# Set everything to be logged to the console
log4j.rootCategory=<b><span style="background-color: #FFFF00;">WARN</span></b>, console
log4j.appender.console=org.apache.log4j.ConsoleAppender
log4j.appender.console.target=System.err
log4j.appender.console.layout=org.apache.log4j.PatternLayout
log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c{1}: %m%n
# Set the default spark-shell log level to WARN. When running the spark-shell, the
# log level for this class is used to overwrite the root logger's log level, so that
# the user can have different defaults for the shell and regular Spark apps.
log4j.logger.org.apache.spark.repl.Main=WARN
# Settings to quiet third party logs that are too verbose
log4j.logger.org.spark_project.jetty=WARN
log4j.logger.org.spark_project.jetty.util.component.AbstractLifeCycle=ERROR
log4j.logger.org.apache.spark.repl.SparkIMain$exprTyper=INFO
log4j.logger.org.apache.spark.repl.SparkILoop$SparkILoopInterpreter=INFO
log4j.logger.org.apache.parquet=ERROR
log4j.logger.parquet=ERROR
# SPARK-9183: Settings to avoid annoying messages when looking up nonexistent UDFs in SparkSQL with Hive support
log4j.logger.org.apache.hadoop.hive.metastore.RetryingHMSHandler=FATAL
log4j.logger.org.apache.hadoop.hive.ql.exec.FunctionRegistry=ERROR
#Any custom class debug
log4j.logger.com.vmeg.code=DEBUG
#Netty classes
log4j.logger.org.apache.spark.rpc.netty.NettyRpcEndpointRef=WARN,RollingAppender
log4j.logger.org.apache.spark.rpc.RpcEndpointRef=WARN,RollingAppender
log4j.logger.org.apache.spark.ExecutorAllocationManager=WARN,RollingAppender
No comments:
Post a Comment