Posts

Showing posts with the label storage

Extending an Amazon S3 Integration to Google Cloud Storage With the Interop API

Getting Started With The S3-Interop API for GCS To start the process, enable the Google Cloud Storage service in the Google Cloud console and create a project and bucket for testing. You can then enable the S3-interoperable API in the Interoperability tab within Project Settings. Google enables the S3-interoperability API on a per-user basis for each project. This means that you’ll want to ensure you have a unique credential or user account for each end-user or service if you want more meaningful access logs. While you’re in the interoperability settings, create an access key and save it locally for reference. As the Interoperability settings page describes, the secret key lets you authorize requests with HMAC authentication. You can calculate the signatures for each request manually in your shell or REPL, or use a library. Read full article >>>

Effective Spark DataFrames With Alluxio

Image
Many organizations deploy Alluxio together with Spark for performance gains and data manageability benefits. Qunar recently deployed Alluxio in production, and their Spark streaming jobs sped up by 15x on average and up to 300x during peak times. They noticed that some Spark jobs would slow down or would not finish, but with Alluxio, those jobs could finish quickly. In this blog post, we investigate how Alluxio helps Spark be more effective. Alluxio increases performance of Spark jobs, helps Spark jobs perform more predictably, and enables multiple Spark jobs to share the same data from memory. Previously, we investigated how Alluxio is used for Spark RDDs. In this article, we investigate how to effectively use Spark DataFrames with Alluxio. Alluxio and Spark Cache Storing Spark DataFrames in Alluxio memory is very simple, and only requires saving the DataFrame as a file to Alluxio. This is very simple with the Spark DataFrame write API. DataFrames are commonly written as parquet fi...

HDFS scalability: the limits to growth

Some time ago I came across very interesting article by Konstantin V. Shvachko (now Senior Staff Software Engineer at LinkedIn) concerning the limits of hadoop scalability. The main conclusion of it is that "a 10,000 node HDFS cluster with a single name-node is expected to handle well a workload of 100,000 readers, but even 10,000 writers can produce enough workload to saturate the name-node, making it a bottleneck for linear scaling. Such a large difference in performance is attributed to get block locations (read workload) being a memory-only operation, while creates (write workload) require journaling, which is bounded by the local hard drive performance. There are ways to improve the single name-node performance, but any solution intended for single namespace server optimization lacks scalability." Konstantin continues: "The most promising solutions seem to be based on distributing the namespace server itself both for workload balancing and for reducing the si...

File Format Benchmark - Avro, JSON, ORC & Parquet (Owen O’Malley, 2016 Strata + Hadoop World)

Image
Benchmarks from Owen O’Malley presented at 2016 Strata + Hadoop World conference in New York. 3 datasets: 1. NYC Taxi Data (Every taxi cab ride in NYC from 2009) 2. Github Logs (All actions on Github public repositories) 3. Sales (Generated data)   3 compression mechanisms: 1. None 2. Snappy 3. Zlib (Gzip) 3 usecases: 1. Full Table Scans 2. Column Projection 3. Predicate Pushdown  Compression results     Use cases results Full Table Scans   Column Projection Dataset format compression us/row projection Percent time github orc zlib 21.319 0.185 0.87% github parquet zlib 72.494 0.585 0.81% sales orc zlib 1.866 0.056 3.00% sales parquet zlib 12.893 0.329 2.55% taxi orc zlib 2.766 0.063 2.28% taxi parquet zlib 3.496 ...