A full day of workshops dedicated to understand the basics and the challenges of big data by the Criteo Data Reliability Engineering team. What is data engineering all about? How can we leverage Scala, Python and SQL to do data transformations? Which language is right in which circumstances? What about ad-hoc data exploration and the data engineering development cycle? How can we manage truly interactive reporting on top of TBs of data? And finally, what are the best practices for monitoring and organizing all of a company’s datasets?
These are core questions a budding data engineer should ask herself and once armed with responses she’ll be better prepared to not only be a more efficient data engineer, but also confront the FUD and general misinformation surrounding data engineering and big data systems in general.
Just ensure that we know the basic on the Scala Type System.
Non-Strict Evaluation is also a thing
Demystify Functor, Monad, map, flatMap.
Understanding the different usage of implicit.
Adhoc Polymorphism in Functional Programming.
Learn what Cats can bring to your projects.
There are lots of conferences these days and almost all of them promise new and magical insights that will surely revolutionize the way you work. This is not one of them. NABD’s main conference is first and foremost about engineers solving problems and sharing those resolutions with others–and we encourage our speakers to share the bruises they’ve accumulated along the way, because even the best of us have seen some pretty spectacular failures.
So, come join us for a day of sharing, learning, fun and yes, perhaps some group therapy!
A story telling of Warp10 use cases from OVHcloud and CleverCloud
At OVHcloud and CleverCloud we make extensive use of Time Series. From monitoring to machine learning, our usage has grown over the years, also now to billing and IoT.
We propose to demonstrate why we choose Warp10, how it can be your best friend and how it saves lifes!
Two years ago, Spotify introduced Scio, an open-source Scala framework to develop data pipelines and deploy them on Google Dataflow. In this talk, we will discuss the evolution of Scio, and share the highlights of running Scio in production for two years. We will showcase several interesting data processing workflows ran at Spotify, what we learned from running them in production, and how we leveraged that knowledge to make Scio faster, and safer and easier to use.
Often times what we expect from a dataset doesn't match what's actually there. If you don't know the accuracy of the data it's difficult to trust any metrics, insights, or models downstream. I work on a team at Spotify that aims to solve this problem, and in this talk we will cover the libraries, infrastructure, and organizational processes we've implemented to address this.
In this talk, we will present Cirrus, a new system in the RISELab (UC Berkeley) that aims to facilitate the development of ML workflows on serverless platforms. During the presentation, we will discuss the challenges of building large-scale systems on existing serverless platforms and propose ways to address those challenges.
Machine learning (ML) workflows are complex. The typical workflow consists of distinct stages of user interaction, such as preprocessing, training, and tuning, that are repeatedly executed by users but have heterogeneous computational requirements. Serverless computing is a compelling model to address the resource management problem, in general,
but there are numerous challenges to adopt it for existing ML frameworks due to significant restrictions on local resources.
In this talk, we will present the Cirrus system design and API and discuss the mechanisms it uses to efficiently preprocess data, train models, and tune model parameters at scale. At the end, we will propose a new serverless architecture that better supports data-intensive workloads.
What's the state-of-the-art Data Platform? What services my data platform provides or should provide to my users? Where should I focus the effort?
We will go together through 6 topics: Platform, Operations, Discovery, Monitoring, Lineage and Business value, and define criteria to evaluate the platform on a scale of 1 (you can do better) to 5 (it rocks!) This method has been applied to Criteo use cases and challenges, hundreds of individual contributors and 200TBs of new data coming every day. We use it to evaluate our services and build our roadmap for the next years.
ML engineers can spend many cycles on iterative model development: manual, ad-hoc experimentation to improve a model's performance over an established baseline. We've seen them labor for weeks, months, and sometimes years to improve model performance—in many cases, this is an engineer's entire job. In this talk, we introduce the paradigm of search-driven model development by developing search spaces instead of developing models. In practice, we have applied this paradigm to reproduce the results from months of iterative work in 24 hours.