22 Jan / 2015

Understanding Spark Caching

Spark excels at processing in-memory data.  We are going to look at various caching options and their effects, and (hopefully) provide some tips for optimizing Spark memory caching.

When caching in Spark, there are two options

  1. Raw storage
  2. Serialized

Here are some differences between the two options

Raw caching Serialized Caching
Pretty fast to process Slower processing than raw caching
Can take up 2x-4x more spaceFor example, 100MB data cached could consume 350MB memory Overhead is minimal
can put pressure in JVM and JVM garbage collection less pressure
usage:rdd.persist(
StorageLevel.MEMORY_ONLY)  or  rdd.cache()
usage:rdd.persist(
StorageLevel.MEMORY_ONLY_SER)

So what are the trade offs?

Here is a quick experiment.  I cache a bunch of RDDs using both options and measure memory footprint and processing time.  My RDDs range in size from 100MB to 1GB.

Testing environment:

  • 3 node spark cluster running on Amazon EC2 (m1.large type with 8G memory per node)
  • Reading data files from S3 bucket

setup

Testing method:

$   ./bin/spark-shell  --driver-memory 8g
> val f = sc.textFile("s3n://bucket_path/1G.data")
> f.persist(org.apache.spark.storage.StorageLevel.MEMORY_ONLY) // specify the cache option
> f.count()  // do this a few times and measure times
// also look at RDD memory size from Spark application UI, under 'Storage' tab

On to the results:

Data Size 100M 500M 1000M (1G)
Memory Footprint (MB)
raw 373.8 1,869.20 3788.8
serialized 107.5 537.6 1075.1
count() time (ms)
cached raw 90 ms 130 ms 178 ms
cached serialized 610 ms 1,802 ms 3,448 ms
before caching 3,220 ms 27,063 ms 105,618 ms

 

spark-caching-1

spark-caching-2

 

 

Conclusions

  1. raw caching consumes has a bigger footprint in  in memory – about 2x – 4x (e.g. 100MB RDD becomes 370MB)
  2. Serialized caching consumes almost the same amount of memory as RDD (plus some overhead)
  3. Raw cache is very fast to process, and it scales pretty well
  4. Processing serialized cached data takes longer

So what does all this mean?

  1. For small data sets (few hundred megs) we can use raw caching.  Even though this will consume more memory, the small size won’t put too much pressure on Java garbage collection.
  2. Raw caching is also good for iterative work loads (say we are doing a bunch of iterations over data).  Because the processing is very fast
  3. For medium / large data sets (10s of Gigs or 100s of Gigs) serialized caching would be helpful.  Because this will not consume too much memory.  And garbage collecting gigs of memory can be taxing

 

Sujee Maniyam
Sujee is a founder, principal at Elephant Scale where he provides consulting and training on Big Data technologies

6 Comments:


  • By zeromem 06 Apr 2015

    Hi, I find this blog in a stackoverflow question: Apache spark in memory caching (http://stackoverflow.com/questions/26858193/apache-spark-in-memory-caching), you suggest three ways to cache rdd across jobs.
    I encounter the same question too, i wanner some rdd cached across jobs in Spark cluster. but i do not understand your first suggestion: Cache the RDD using a same context and re-use the context for other jobs.
    is the “context” here means a SparkContext? in my mind, a application use a new sparkcontext, and when the application over, the sc will stop.
    can you give me more information about how to implement this? thank you.

  • By Understanding Spark caching 11 Apr 2015

    […] Read More […]

  • By SparkSQL serialized caching - TecHub 01 Oct 2015

    […] core supports both raw storage and serialized RDD caching. The good article explains this. If you use persist() – you may specify any of levels of caching, but if you’re […]

  • By Ramesh 28 Aug 2016

    Hi Maniyam,

    I would like to know if we have performed a .cache() on an RDD. I am assuming that the RDD value is cached only on those nodes where actually RDD was computed initially.
    Meaning, If there is a cluster of 100 Nodes, and RDD is computed in partitions of first and second nodes. If we cached this RDD, then Spark is going to cache its value only in first or second nodes.
    So when this Spark application is trying to use this RDD in later stages, then Spark driver has to get the value from first/second nodes.
    Am i correct ?
    (OR)

    Is it something that the RDD value is persisted in driver memory and not on nodes ?

  • By SP Singh 23 Mar 2017

    Very informative…

Leave a Reply



Copyright 2015 Sujee Maniyam (