Posts

Showing posts from May, 2021

Automated Data Wrangling

Image
  A growing array of techniques apply machine learning directly to the problems of data wrangling. They often start out as open research projects but then become proprietary. How can we build automated data wrangling systems for open data? We work with a lot of messy public data. In theory it’s already “structured” and published in machine readable forms like Microsoft Excel spreadsheets, poorly designed databases, and CSV files with no associated schema. In practice it ranges from almost unstructured to… almost structured. Someone working on one of our take-home questions for the data wrangler & analyst position recently noted of the FERC Form 1: “This database is not really a database – more like a bespoke digitization of a paper form that happened to be built using a database.” And I mean, yeah. Pretty much. The more messy datasets I look at, the more I’ve started to question Hadley Wickham’s famous Tolstoy quip about the uniqueness of messy data. There’s a taxonomy of diffe...

What is a Vector Database?

Image
  The meteoric rise in Machine Learning in the last few years has led to increasing use of vector embeddings. They are fundamental to many models and approaches, and are a potent tool for applications such as semantic search, similarity search, and anomaly detection. The unique nature, growing volume, and rising importance of vector embeddings make it necessary to find new methods of storage and retrieval. We need a new kind of database. Continue reading >>>