Posts

Showing posts with the label Python
Image
Here’s a curated list of resources for data engineers, with sections for algorithms and data structures, SQL, databases, programming, tools, distributed systems, and more. Useful articles The AI Hierarchy of Needs The Rise of Data Engineer The Downfall of the Data Engineer A Beginner’s Guide to Data Engineering Part I Part II Part III Functional Data Engineering — a modern paradigm for batch data processing How to become a Data Engineer (in Russian) Talks Data Engineering Principles - Build frameworks not pipelines by Gatis Seja Functional Data Engineering - A Set of Best Practices by Maxime Beauchemin Advanced Data Engineering Patterns with Apache Airflow by Maxime Beauchemin Creating a Data Engineering Culture by Jesse Anderson Algorithms & Data Structures Algorithmic Toolbox in Russian Data Structures in Russian Data Structures & Algorithms Specialization on Coursera Algorithms Specialization from Stanford on Coursera SQL Com...

Python at Netflix

Image
As many of us prepare to go to PyCon, we wanted to share a sampling of how Python is used at Netflix. We use Python through the full content lifecycle, from deciding which content to fund all the way to operating the CDN that serves the final video to 148 million members. We use and contribute to many open-source Python packages, some of which are mentioned below. If any of this interests you, check out the jobs site or find us at PyCon. We have donated a few Netflix Originals posters to the PyLadies Auction and look forward to seeing you all there. Open Connect Open Connect is Netflix’s content delivery network (CDN). An easy, though imprecise, way of thinking about Netflix infrastructure is that everything that happens before you press Play on your remote control (e.g., are you logged in? what plan do you have? what have you watched so we can recommend new titles to you? what do you want to watch?) takes place in Amazon Web Services (AWS), whereas everything that happens after...

Awesome ML interpretability resources

Image
A curated list of awesome machine learning interpretability resources .  Comprehensive Software Examples and Tutorials Explainability- or Fairness-Enhancing Software Packages Browser Python R Free Books Other Interpretability and Fairness Resources and Lists Review and General Papers Limitations of Interpretability Teaching Resources Interpretable ("Whitebox") or Fair Modeling Packages C/C++ Python R

50 of the most popular Python libraries and frameworks that are used in data science

Image
This article introduces a landscape diagram which shows 50 or so of the most popular Python libraries and frameworks used in data science. Landscape diagrams illustrate components within a technology stack alongside their complementary technologies. In other words, “How do the parts fit together?” Landscape diagrams provide useful learning materials, helping people conceptualize and discuss complex technology topics. Of course it’s important to keep this diagram curated and updated as the Python ecosystem evolves. We’ll do that. One caveat: trying to fit lots of complex, interconnected parts into a neatly formatted 2D grid is a challenge. Any diagram must “blur the lines” of definitions to simplify the illustration, and those definitions could be debated at length. On the one hand, the diagram does not include an exhaustive list. We chose popular libraries among widely-used categories, but had to skip some. For example we didn’t go into the varied universe of audio processing lib...

Mozilla releases Iodide, an open source browser tool for publishing dynamic data science

Image
Mozilla wants to make it easier to create, view, and replicate data visualizations on the web, and toward that end, it today unveiled Iodide, an “experimental tool” meant to help scientists and engineers write and share interactive documents using an iterative workflow. It’s currently in alpha, and available from GitHub in open source. “In the last ten years, there has been an explosion of interest in ‘scientific computing’ and ‘data science’: that is, the application of computation to answer questions and analyze data in the natural and social sciences,” Brendan Colloran, staff data scientist at Mozilla, wrote in a blog post. “To address these needs, we’ve seen a renaissance in programming languages, tools, and techniques that help scientists and researchers explore and understand data and scientific concepts, and to communicate their findings. But to date, very few tools have focused on helping scientists gain unfiltered access to the full communication potential of modern w...

Data scientist salaries and jobs in Europe - 2018 snapshot

Image
Glassdoor names “Data Scientist” as the best job in the United States for 2019 and LinkedIn ranks it number one among the top 10. Topping the list for four years in a row,  Data Scientist has a job score of 4.7, job satisfaction rating of 4.3 with 6,510 open positions paying a median base salary of $108,000 in the U.S. But what is the scenario for Data Scientists in Europe? What is the demand and supply? Which countries in EU are the best destinations for Data Scientists and what salaries can they expect? A recent report titled Data Science Salary Report 2019 Europe by Big Cloud  answers some of these critical questions.   First, a little flashback: According to a report by the European Commission in 2017, the number of data workers in Europe will increase up to 10.43 million, with a compound average growth rate of 14.1% by 2020. The EU forecasted to face a data skills gap corresponding to 769,000 unfilled positions by 2020 in the baseline scenario and being concen...

TensorFlow Privacy - training machine learning models with privacy for training data

Google has released  TensorFlow Privacy , a free Python library that lets people train TensorFlow models compliant with more stringent user data privacy standards. It uses differential privacy, a technique for training machine learning systems that increases user privacy by letting developers set various trade-offs relating to the amount of noise applied to the user data being processed. This repository contains the source code for TensorFlow Privacy, a Python library that includes implementations of TensorFlow optimizers for training machine learning models with differential privacy. The library comes with tutorials and analysis tools for computing the privacy guarantees provided. The TensorFlow Privacy library is under continual development, always welcoming contributions. In particular, we always welcome help towards resolving the issues currently open.

What's the future of the pandas library?

Image
Pandas is a powerful, open source Python library for data analysis, manipulation, and visualization. I've been teaching data scientists to use pandas since 2014, and in the years since, it has grown in popularity to an estimated 5 to 10 million users and become a "must-use" tool in the Python data science toolkit. I started using pandas around version 0.14.0, and I've followed the library as it has significantly matured to its current version, 0.23.4. But numerous data scientists have asked me questions like these over the years: "Is pandas reliable?" "Will it keep working in the future?" "Is it buggy? They haven't even released version 1.0!" Version numbers can be used to signal the maturity of a product, and so I understand why someone might be hesitant to rely on "pre-1.0" software. But in the world of open source, version numbers don't necessarily tell you anything about the maturity or reliability ...

Best Machine Learning Tools

The best trained soldiers can’t fulfill their mission empty-handed. Data scientists have their own weapons  —  machine learning (ML) software. There is already a cornucopia of articles listing reliable machine learning tools with in-depth descriptions of their functionality. Our goal, however, was to get the feedback of industry experts. And that’s why we interviewed data science practitioners — gurus, really  — regarding the useful tools they choose for  their  projects. The specialists we contacted have various fields of expertise and are working in such companies as Facebook and Samsung. Some of them represent AI startups (Objection Co, NEAR.AI, and Respeecher); some teach at universities (Kharkiv National University of Radioelectronics). The AltexSoft data science team joined the discussion, too. And if you’re looking for a particular type of tools, just skip to your sector of interest: Languages used in machine learning Data analytics an...

Amazing Infographics and Other Visual Tutorials

Data Science Summarized in One Picture   R for Big Data in One Picture   A Cheat Sheet on Probability   Data Science in Python: Pandas Cheat Sheet   Cheat Sheet: Data Visualisation in Python   Machine Learning Cheat Sheet   The Periodic Table Of AI   Three Periodic Tables   40 maps that explain the Internet   A Guide to the Internet of Things   IoT Tectonics   13 Great Data Science Infographics   Unstructured Data: InfoGraphics   Great Machine Learning Infographics   What is Hadoop? Infog...

Gartner Hype Cycle for Data Science and Machine Learning, 2017

Image
The hype around data science and machine learning has increased from already high levels in the past year. Data and analytics leaders should use this Hype Cycle to understand technologies generating excitement and inflated expectations, as well as significant movements in adoption and maturity. The Hype Cycle The Peak of Inflated Expectations is crowded and the Trough of Disillusionment remains sparse, though several highly hyped technologies are beginning to hear the first disillusioned rumblings from the market. In general, the faster a technology moves from the innovation trigger to the peak, the faster the technology moves into the trough as organizations quickly see it as just another passing fad. This Hype Cycle is especially relevant to data and analytics leaders, chief data officers, and heads of data science teams who are implementing machine-learning programs and looking to understand the next-generation innovations. Technology provider product marketers and strategists...