Announcing Great Expectations v0.4 (We have SQL…!)
Based on
feedback from the past month, we’ve revised, improved, and extended
Great Expectations. 284 commits, 103 files changed, and 7 new
contributors later, we’ve just released v0.4!
Here’s what’s new.
#1 Native SQL
By
far the most common request we received was the ability to run
expectations natively in SQL. This was always on the roadmap. The
community response made it our top priority.
We’ve introduced a new class called
SQLAlchemyDataset
. It contains all* the same expectations as the original PandasDataset
class, but instead of executing them against a DataFrame in local
memory, it executes them against a database table using the SQLAlchemy
core API.
This gets us several wins, all at once:
- Since SQLAlchemy binds to most popular databases, we get immediate integration with all of those systems. We’ve already heard from teams developing against postgresql, Presto/Hive, and SQL Server. We expect to see lots more adoption on this front soon.
- Since
the SQLAlchemy API is consistent across databases, we can maintain
compatibility with many databases with a minimum of new code in Great
Expectations. (Note: it’s not unlikely that we will eventually have to
include some non-standard code for specific databases. In that case, we
can subclass
WeirdDBDataset
fromSQLAlchemyDataset
to keep the code footprint to a minimum.) - This approach takes the compute to the data. For pipeline testing to work in practice, expectations must be able to execute natively within whatever data processing systems people are working with already. Almost everybody uses SQL somewhere in their stack. Now Great Expectations can live there, too.
Practically speaking, this means that teams that manage most of their pipelines in SQL can apply pipeline testing using the same expectation syntax that the Pandas version uses, without copying tables out of the database all the time.
Caveat: A moment ago, we said that
SQLAlchemyDataset
“contains all* the same expectations as the original PandasDataset
class.” That’s technically true. However, they’re not all implemented yet. (See the release notes for the full list.)
We
hope that some of you will find it in your hearts to help finish these
NotYetImplemented expectations. (Because of the magic of decorators like
@column_map_expectation
,
implementing a new expectation is often just a couple lines of code.)
If not, the core team will continue to chip away at them.#2 A cleaner expectations result API
In GreatExpectations v0.3.*, there were several subtle but pervasive inconsistencies in the
result_objs
that are returned from expectations. The biggest complaints that we
heard from power users of Great Expectations revolved around these
inconsistencies. (You can find details in Issue 175 and the release notes)
These
are fixed in v0.4. This API cleanup puts the project on much firmer
footing for future releases. For teams that have been using Great
Expectations extensively: thanks for surfacing points of confusion and
helping us resolve them. For teams that are just starting to use Great
Expectations: trust us, we’ve just saved you a bunch of headaches down
the road.
That
said, we know that introducing a change of this kind will cause
headaches of its own: it will break downstream code that consumes
expectation results. To fix them, you shouldn’t need to do anything more
complicated than unpack json objects differently. And you can always
pin to version v0.3.2 as a temporary fix.
We promise we won’t do this too often. Please get in touch via Issue #247 if this migration gives you any trouble.
Other notable changes
Thanks, @dlwhite5
for diving deep into the pandas internals so that operations on a
PandasDataset now returns another PandasDataset (instead of a regular
pandas.DataFrame)
Thanks to @ccnobbli for implementing
expect_column_parameterized_distribution_ks_test_p_value_to_be_greater_than
!
This expectation allows you to compare a column against parameterized
continuous distributions implemented in scipy (e.g. normal, Poisson,
beta, etc.)
Thanks @schrockn for suggesting and implementing
ge.from_pandas()
,
to make pandas-to-great_expectations conversion more discoverable and
user-friendly. We also implemented a top-level validate option ge.validate()
for the same reason.
Thanks @louispotok, for adding a
column_index
parameter to expect_column_to_exist
, so that users can test column orderings.
Thanks to @rjurney for suggesting a
ge.read_json()
helper function to read files that contain json lines.
We
made also made some deep, behind-the-scenes improvements to the Great
Expectations testing framework to ensure parity across data contexts.
This is a big enough deal that it will probably get its own blog post
soon.
Full release notes are here.
Onward and upward
Thanks again to everyone who contributed feedback and code to this release. Please keep it coming!
We’re excited to make Great Expectations more useful. Together, we will obliterate pipeline debt, once and for all.
Details >>
Comments
Post a Comment