The Hadoop data lake concept can be summed up as, “Store it all in one place, figure out what to do with it later.” But while this might be the general idea of your Hadoop data lake, you won’t get any real value out of that data until you figure out a logical structure for it. And you’d better keep track of your metadata one way or another. It does no good to have a lake full of data, if you have no idea what lies under the shiny surface. At some point, you have to give that data a schema, especially if you want to query it with SQL or something like it. The eternal Hadoop question is whether to apply the brave new strategy of schema on read, or to stick with the tried and true method of schema on write. What is Schema on Write? Schema on write has been the standard for many years in relational databases. Before any data is written in the database, the structure of that data is strictly defined, and that metadata stored and tracked. Irrelevant data is discarded, data types, lengths and...