Top 5 Mistakes to Avoid When Writing Apache Spark Applications

Be careful in managing DAG

People often do mistakes in DAG controlling. So in order to avoid such mistakes. We should do the following:
Always try to use reducebykey instead of groupbykey : The ReduceByKey and GroupByKey can perform almost similar functions, but GroupByKey contains large data. Hence, try to use ReduceByKey to the most.
Make sure you stay away from shuffles as much as possible:
  • Always try to lower the side of maps as much as possible
  • Try not to waste more time in Partitioning
  • Try not to shuffle more
  • Try to keep away from Skews as well as partitions too
Reduce should be lesser than TreeReduce: Always use TreeReduce instead of Reduce, Because TreeReduce does much more work in comparison to the Reduce on the executors.

Maintain the required size of the shuffle blocks

In the shuffle operation, the task that emits the data in the source executor is “mapper”, the task that consumes the data into the target executor is “reducer”, and what happens between them is “shuffle”.
The blocking of the shuffles is called as a shuffle block. Often spark application fails becomes the shuffle blocks become greater than 2 GB. Generally during shuffling people use around 200 partitions which is usually less, as a result of which the shuffle blocks increases in size. As a result of this when it becomes more than 2GB, the application fails. So if we increase the number of partitions, we can remove the data skew as well. Normally according to the thumb rule, we have 128 MB for each partition. So if the partition size memory  is too low also the tasks will be very slow. Hence in order to avoid failure as well as the fast running of the application,  the partitions should be less than two thousand but near to it about to hit 2000, but not exactly 2000.

Do not let the jobs to slow down

When the application is shuffled, it takes more time around 4 long hours to run. This makes the system slow.
There are two stages of aggregation and they are :
  • action on the salted keys
  • action on the unsalted keys
So we have to remove the isolated keys and then accumulation should be used which will decrease the data used as a result we can huge information can be saved from being shuffled.

Perform shading operations to avoid error

In writing down an Apache Spark application, we face errors although guava is already included in the maven dependencies in the application, but still, errors occur when the applied guava version does not match with the Spark’s guava version. So in order to match it, we have to perform Shading as follows-
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-shade-plugin</artifactId>
<version>2.2</version>
<relocations>
<relocation>
<pattern>com.google.protobuf</pattern>
<shadedPattern>com.company.my.protobuf</shadedPattern>
</relocation>
</relocations>
So  always perform shady stuff, or else all the classpath will seep out all the efforts.

Avoid wrong dimensions of executors

In any particular Spark jobs, executors are the executing nodes that are responsible for processing singular tasks in the job. These executors provide in-memory storage for RDDs that are cached by user programs through Block Manager. They are created at the very starting of the particular Spark application and are on for the whole application span. After processing the entity works, the deliver the output to the driver. The mistakes that we do during the writing of the Spark application with the executors are that we take the wrong size executors. Things that we go wrong in the assigning of the following:
  • Number of Executors
  • Cores of each executor
  • Memory for each executor
Normally people use 6 executors, 16 cores each and 64 GB of RAM.
When using 16 core  for each executor, the total number of cores for 16 executors become 96. And the memory per node becomes 64/16 i.e. 4 GB for each executor. Hence if it becomes most granular for using smallest size executors we fail to make use of the advantages of processing all the tasks in the same java virtual machine. But in the same calculation if it becomes least granular also it becomes a problem because no memory remains free for overhead for OS/Hadoop daemons. And instead of 16 cores, if we use 15 cores also it will result in bad throughput. So the perfect number  is 5 cores per executor. Because for 6*15 we have 90 cores, so the number of executors will be 90/5 i.e 18. So leaving one executor for AM, we have 17 remaining, so executors in 1 node will be 3, Hence, RAM= 63/3= 21 GB, 21 x (1-0.07) ~ 19 GB. Therefore for correct application people should use 17 executors, 5 cores each and 19 GB of RAM.

The above tips should be followed in order to avoid mistakes while an Apache Spark Application development.

Comments