Posts

Showing posts from July, 2015

Weak transaction isolation

Concurrency bugs caused by weak transaction isolation are not just a theoretical problem. They may cause customer data to be corrupted. Many popular relational databases, which are usually considered ACID use weak isolation, so they would not necessarily have prevented these bugs from occurring. What exactly the Isolation guarantee in the SQL standard means based on what they call “read phenomena”. There are three types of phenomena: Dirty reads – If another transaction writes, but does not commit, during your transaction, is it possible that you will see their data? Non-repeatable reads – If you read the same row twice, is it possible that you might get different data the second time? Phantom reads – If you read a collection of rows twice, is it possible that different rows will be returned the second time? In the SQL standard, there are four levels of transactional isolation based on which of these phenomena they prevent (from weakest to strongest): Read Uncommitted – A tra...

Data loss in replicated systems

Image
What happens if the data on disk is corrupted, or the data is wiped out due to hardware error or misconfiguration? Here is the problem that losing disk state really induces in an ensemble of ZooKeeper servers: https://fpj.me/2015/05/28/dude-wheres-my-metadata/

To Schema On Read or to Schema On Write, That is the Hadoop Data Lake Question

The Hadoop data lake concept can be summed up as, “Store it all in one place, figure out what to do with it later.” But while this might be the general idea of your Hadoop data lake, you won’t get any real value out of that data until you figure out a logical structure for it. And you’d better keep track of your metadata one way or another. It does no good to have a lake full of data, if you have no idea what lies under the shiny surface. At some point, you have to give that data a schema, especially if you want to query it with SQL or something like it. The eternal Hadoop question is whether to apply the brave new strategy of schema on read, or to stick with the tried and true method of schema on write. What is Schema on Write? Schema on write has been the standard for many years in relational databases. Before any data is written in the database, the structure of that data is strictly defined, and that metadata stored and tracked. Irrelevant data is discarded, data types, lengths and...