Posts

Showing posts with the label algorithms

Building a Large-scale Distributed Storage System Based on Raft

Image
In recent years, building a large-scale distributed storage system has become a hot topic.  Distributed consensus algorithms like Paxos and Raft are the focus of many technical articles. But those articles tend to be introductory, describing the basics of the algorithm and log replication. They seldom cover how to build a large-scale distributed storage system based on the distributed consensus algorithm.  Since April 2015, we PingCAP have been building TiKV, a large-scale open source distributed database based on Raft. It’s the core storage component of TiDB, an open source distributed NewSQL database that supports Hybrid Transactional and Analytical Processing (HTAP) workloads. Earlier in 2019, we conducted an official Jepsen test on TiDB, and the Jepsen test report was published in June 2019. In July the same year, we announced that TiDB 3.0 reached general availability, delivering stability at scale and performance boost. In this article, I’d like to share some of our firs...
Image
Here’s a curated list of resources for data engineers, with sections for algorithms and data structures, SQL, databases, programming, tools, distributed systems, and more. Useful articles The AI Hierarchy of Needs The Rise of Data Engineer The Downfall of the Data Engineer A Beginner’s Guide to Data Engineering Part I Part II Part III Functional Data Engineering — a modern paradigm for batch data processing How to become a Data Engineer (in Russian) Talks Data Engineering Principles - Build frameworks not pipelines by Gatis Seja Functional Data Engineering - A Set of Best Practices by Maxime Beauchemin Advanced Data Engineering Patterns with Apache Airflow by Maxime Beauchemin Creating a Data Engineering Culture by Jesse Anderson Algorithms & Data Structures Algorithmic Toolbox in Russian Data Structures in Russian Data Structures & Algorithms Specialization on Coursera Algorithms Specialization from Stanford on Coursera SQL Com...

Looking Back at Google’s Research Efforts in 2018

Image
2018 was an exciting year for Google's research teams, with our work advancing technology in many ways, including fundamental computer science research results and publications, the application of our research to emerging areas new to Google (such as healthcare and robotics), open source software contributions and strong collaborations with  Google product teams, all aimed at providing useful tools and services. Below, we highlight just some of our efforts from 2018, and we look forward to what will come in the new year: Ethical Principles and AI AI for Social Good Assistive Technology Quantum computing Natural Language Understanding Perception Computational Photography Algorithms and Theory Software Systems AutoML Tensor Processing Units (TPUs)  Open Source Software and Datasets Robotics Applications of AI to Other Fields Read more >>>

Machine Learning Platforms For Developers

Machine learning platforms are not the wave of the future. It's happening now. Developers need to know how and when to harness their power. Working within the ML landscape while using the right tools like Filestack can make it easier for developers to create a productive algorithm that taps into its power. The following machine learning platforms and tools — listed in no certain order — are available now as resources to seamlessly integrate the power of ML into daily tasks. 1.  H2O H2O was designed for the Python, R, and Java programming languages by H2O.ai. By using these familiar languages, this open source software makes it easy for developers to apply both predictive analytics and machine learning to a variety of situations. Available on Mac, Windows, and Linux operating systems, H2O provides developers with the tools they need to analyze data sets in the Apache Hadoop file systems as well as those in the cloud. 2.  Apache PredictionIO Developer...

Self-Imitation Learning (SIL)

Image
This paper proposes Self-Imitation Learning (SIL), a simple off-policy actor-critic algorithm that learns to reproduce the agent's past good decisions. This algorithm is designed to verify our hypothesis that exploiting past good experiences can indirectly drive deep exploration. Our empirical results show that SIL significantly improves advantage actor-critic (A2C) on several hard exploration Atari games and is competitive to the state-of-the-art count-based exploration methods. We also show that SIL improves proximal policy optimization (PPO) on MuJoCo tasks. Details >>

An Introduction to Deep Learning for Tabular Data

Image
There is a powerful technique that is winning Kaggle competitions and is widely used at Google (according to Jeff Dean), Pinterest, and Instacart, yet that many people don’t even realize is possible: the use of deep learning for tabular data, and in particular, the creation of embeddings for categorical variables. Despite what you may have heard, you can use deep learning for the type of data you might keep in a SQL database, a Pandas DataFrame, or an Excel spreadsheet (including time-series data). I will refer to this as tabular data , although it can also be known as relational data , structured data , or other terms (see my twitter poll and comments for more discussion).    From the Pinterest blog post 'Applying deep learning to Related Pins' Tabular data is the most commonly used type of data in industry, but deep learning on tabular data receives far less attention than deep learning for computer vision and natural language processing.  Details ...

Comparing production-grade NLP libraries

A comparison of the accuracy and performance of Spark-NLP vs. spaCy, and some use case recommendations:  https://www.oreilly.com/ideas/comparing-production-grade-nlp-libraries-accuracy-performance-and-scalability A step-by-step guide to building and running a natural language processing pipeline:  https://www.oreilly.com/ideas/comparing-production-grade-nlp-libraries-running-spark-nlp-and-spacy-pipelines A step-by-step guide to initialize the libraries, load the data, and train a tokenizer model using Spark-NLP and spaCy:  https://www.oreilly.com/ideas/comparing-production-grade-nlp-libraries-training-spark-nlp-and-spacy-pipelines

27 Great Resources About Logistic Regression

27 Great Resources About Logistic Regression: Customer Churn – Logistic Regression with R Predicting Flights Delay Using Supervised Learning, Logistic Regression Logistic Regression vs Decision Trees vs SVM: Part II Logistic Regression Vs Decision Trees Vs SVM: Part I Making data science accessible – Logistic Regression Logistic Regression using python Logistic Regression and Maximum Entropy explained with examples Decision tree vs Logistic Regression Excluding variables from a logistic regression model based on correlation Regression, Logistic Regression and Maximum Entropy  + Oversampling/Undersampling in Logistic Regression Fraud Detection using logistic regression Explaining variability in logistic regression Handling Imbalanced data when building regression models Multiple logistic Regression Power Analysis Model Accuracy - In logistic Regression Outliers in Logistic Regression Logistic Regression - Hosmer Lemeshow test Logistic regression intercept term n...