Posts

Showing posts with the label Kubernetes

Building a Large-scale Distributed Storage System Based on Raft

Image
In recent years, building a large-scale distributed storage system has become a hot topic.  Distributed consensus algorithms like Paxos and Raft are the focus of many technical articles. But those articles tend to be introductory, describing the basics of the algorithm and log replication. They seldom cover how to build a large-scale distributed storage system based on the distributed consensus algorithm.  Since April 2015, we PingCAP have been building TiKV, a large-scale open source distributed database based on Raft. It’s the core storage component of TiDB, an open source distributed NewSQL database that supports Hybrid Transactional and Analytical Processing (HTAP) workloads. Earlier in 2019, we conducted an official Jepsen test on TiDB, and the Jepsen test report was published in June 2019. In July the same year, we announced that TiDB 3.0 reached general availability, delivering stability at scale and performance boost. In this article, I’d like to share some of our firs...

What’s Behind Lyft’s Choices in Big Data Tech

Image
Lyft was a late entrant to the ride-sharing business model, at least compared to its competitor Uber, which pioneered the concept and remains the largest provider. That delay in starting out actually gave Lyft a bit of an advantage in terms of architecting its big data infrastructure in the cloud, as it was able to sidestep some of the challenges that Uber faced in building out its on-prem system. Lyft and Uber, like many of the young Silicon Valley companies shaking up established business models, aren’t shy about sharing information about their computer infrastructure. They both share an ethos of openness in regards to using and developing technology. That openness is also pervasive at Google, Facebook, Twitter, and other Valley outfits that created much of the big data ecosystem, most of which is, of course, open source. So when the folks at Lyft were blueprinting how to construct a system that could do all the things that a ride-sharing app has to do – tracking and connectin...

Cloud Migration Best Practices: How to Move Your Project to Kubernetes

Image
Moving your app or web services to the cloud is more of a must than an option these days. The cloud infrastructure we have today is not only more capable and more stable, but also more scalable. By moving to the cloud, you gain a lot of benefits while also significantly reducing operational stress and costs. A more popular option now is to move to a container-based cloud environment, with Kubernetes being the most popular method to do so — and the most scalable in the long run. There are three approaches you can take and certain best practices to follow, which we are going to look into in this article. Why Kubernetes? Before we get to the best practices and methods you can use to migrate to Kubernetes, it is worth taking the time to understand why the container-based environment provided by Kubernetes is the way to go. For starters, Kubernetes offers the most flexibility when it comes to setting up your cloud environment. Kubernetes has two major parts: the master cluster and no...

Google announces Kubernetes Operator for Apache Spark

Image
The beta release of "Spark Operator" allows native execution of Spark applications on Kubernetes clusters -- no Hadoop or Mesos required. Apache Spark is a hugely popular execution framework for running data engineering and machine learning workloads. It powers the Databricks platform and is available in both on-premises and cloud-based Hadoop services, like Azure HDInsight, Amazon EMR and Google Cloud Dataproc. It can run on Mesos clusters too. But what of you just want to run your Spark workloads on a Kubernetres (k8s) cluster sans Mesos, and without the Hadoop YARN strings attached? While Spark first added Kubernetes-specific features in its 2.3 release, and improved them in 2.4, getting Spark to run natively on k8s, in a fully integrated fashion, can still be a challenge. KUBE OPERATOR Today, Google, which created Kubernetes in the first place, is announcing the beta release of the Kubernetes Operator for Apache Spark -- "Spark Operator" for short. Sp...

The Top Tech Skills of 2018

Image
The Top Tech Skills of 2018: Kotlin & Kubernetes Made Their Mark Original Article >>>

Progress for big data in Kubernetes

Image
Kubernetes is really cool because managing services as flocks of little containers is a really cool way to make computing happen. We can get away from the idea that the computer will run the program and get into the idea that a service happens because a lot of little computing just happens. This idea is crucial to making reliable services that don’t require a ton of heroism to stand up or keep running. But there is a dark side here. Containers want to be agile because that is the point of containers in the first place. We want containers because we want to make computing more like a gas made up of indistinguishable atoms instead of like a few billiard balls with colors and numbers on their sides. Stopping or restarting containers should be cheap so we can push flocks of containers around easily and upgrade processes incrementally. If ever a container becomes heavy enough that we start thinking about that specific container, the whole metaphor kind of dissolves. So that metap...

Processing streams of data with Apache Kafka and Spark

Image
Data Data is produced every second, it comes from millions of sources and is constantly growing. Have you ever thought how much data you personally are generating every day? Data: direct result of our actions There’s data generated as a direct result of our actions and activities: Browsing twitter Using mobile apps Performing financial transactions Using a navigator in your car Booking a train ticket Creating an online document Starting a YouTube live stream Obviously, that’s not it. Data: produced as a side effect For example, performing a purchase where it seems like we’re buying just one thing – might generat...