A full day of workshops dedicated to understand the basics and the challenges of big data by the Criteo Data Reliability Engineering team. What is data engineering all about? How can we leverage Scala, Python and SQL to do data transformations? Which language is right in which circumstances? What about ad-hoc data exploration and the data engineering development cycle? How can we manage truly interactive reporting on top of TBs of data? And finally, what are the best practices for monitoring and organizing all of a company’s datasets?
These are core questions a budding data engineer should ask herself and once armed with responses she’ll be better prepared to not only be a more efficient data engineer, but also confront the FUD and general misinformation surrounding data engineering and big data systems in general.
Register now, the first 40 registrations are eligible to attend “Functional Programming with Scala”
Just ensure that we know the basic on the Scala Type System.
Non-Strict Evaluation is also a thing
Demystify Functor, Monad, map, flatMap.
Understanding the different usage of implicit.
Adhoc Polymorphism in Functional Programming.
Learn what Cats can bring to your projects.
There are lots of conferences these days and almost all of them promise new and magical insights that will surely revolutionize the way you work. This is not one of them. NABD’s main conference is first and foremost about engineers solving problems and sharing those resolutions with others–and we encourage our speakers to share the bruises they’ve accumulated along the way, because even the best of us have seen some pretty spectacular failures.
So, come join us for a day of sharing, learning, fun and yes, perhaps some group therapy!
Danai Koutra
Yongjoo Park
Jie Song
Rui Liu
Mark Heimann
Jiliang Tang
Jason Harper
Jeremy McMinis
Sean Law
Fletcher Liverance
Piyush Narang
Ivan Lobov
Justin Coffey
Guillaume Bort
Criteo is using machine learning to tackle a wide variety of use-cases in online advertising such as click prediction, recommendation systems, auction theory and many others. These problems and challenges become much more interesting at our scale. We train our models on datasets that are hundreds of terabytes in size and need to perform inference in less than a millisecond. Building scalable infrastructure for ML practitioners within Criteo is a crucial part of helping them develop, productionize and iterate on their models.This talk covers how we tackle some of these challenges related to machine learning infrastructure at Criteo. In particular we focus on how we can increase the speed of iteration when it comes to feature engineering, feature management and building ML datasets. We plan to talk about the various problems we’re trying to solve in this space as well as how we leverage various open source and internal components to address them.
Networks naturally capture a host of real-world interactions, from social interactions and email communication to web browsing to brain activity. Over the past few years, representation learning over networks has been shown to be successful in a variety of downstream tasks, such as classification, link prediction, and visualization. Most existing approaches seek to learn node representations that capture node proximity. In this talk, I will discuss our recent work on a different class of node representations that aim to preserve the structural similarity between the nodes. I will present the lessons learned from designing efficient structural embedding methods for large-scale heterogeneous data, including ways to overcome the computational challenges and massive storage requirements that many existing techniques face. Throughout the talk, I will discuss applications to professional role discovery, entity resolution, entity linking across data sources, and more.
We all know how hard Big Data stacks can be to build, use and maintain. Gartner estimates that 85% of big data projects are killed before production release. In this talk engineering leaders from Criteo's Data Reliability Engineering team will show how wide spread use of SQL addressed the two biggest issues in data engineering: systems efficiency and developer productivity. Criteo has hundreds of PBs of data under management with over 100K cores and 1PB+ of main memory available for processing it. In addition to the pure scale of the system there are 500+ developers from around the world interacting with the system directly, the vast majority of whom have at one point or another push data transformation code into production. The unique challenges of truly huge scale, highly concurrent workloads and geographic distribution of users required an equally unique approach (and quite a lot of serious engineering and good old fashioned elbow grease). One doesn't have to look to very far back to realize that the RDBMS paradigm of a referentially transparent, lazily evaluated, declarative (and highly expressive) language executing on top of a separately optimizable and easily abstracted away run-time could reap huge benefits. With the advent of technologies like Hive, Spark-SQL and Presto we are clearly not the first engineers to think of the problem in these terms, but we decided to see just how far we could push SQL by leveraging it in every nook and cranny of our data infrastructure.
Graphs provide a universal representation of data with numerous types while deep learning has demonstrated immense ability in representation learning. Thus, bridging deep learning with graphs presents astounding opportunities to enable general solutions for a variety of real-world problems. However, traditional deep learning techniques that were disruptive for regular grid data such as images and sequences are not immediately applicable to graph-structured data. Therefore, marrying these two areas faces tremendous challenges. In this talk, I will first discuss these opportunities and challenges, then share a series of researches about deep learning on graphs from my group and finally discuss about promising research directions.
In this talk, you’ll learn of a brand new and scalable approach to explore time series or sequential data. If anybody has ever asked you to analyze time series data and to look for new insights then this is definitely the open source tool that you’ll want to add to your arsenal.
The problem of building a recommender system from implicit feedback is well studied and has many proposed solutions, from BPR to VAE-CF. But in real world applications it may face significant constraints: billions of products and hundreds of millions of users, biased data sampling, user cold-start problems, and more. No state-of-the-art model is able to solve all of these, and we needed to implement something reliable and working in relatively short time. In this talk we’ll walk you through our explored options and how we managed to solve in a “good enough” way most of these challenging problems.
In this talk, you’ll learn of a brand new and scalable approach to explore time series or sequential data. If anybody has ever asked you to analyze time series data and to look for new insights then this is definitely the open source tool that you’ll want to add to your arsenal.
Answering crucial socioeconomic questions often requires combining and comparing data across two or more independently collected data sets.
However, these data sets are often reported as aggregates over data collection units, such as geographical units, which may differ across data sets.
Examples of geographical units include county, zip code, school district, etc., and as such, they can be incongruent. To be able to compare these data, it is necessary to realign the aggregates from the source units to a set of target spatially congruent geographical units. Existing intelligent areal interpolation/realignment methods, however, make strong assumptions about the spatial properties of the attribute of interest based on domain knowledge of its distribution. A more practical approach is to use available reference data sources to aid in this alignment. The selection of the references is vital to the quality of prediction.
In this paper, we devise GeoAlign, a novel multi-reference crosswalk algorithm that estimates aggregates in desired target units. GeoAlign is adaptive to new attributes with need for neither distribution-related domain knowledge of the attribute of interest nor knowledge of its spatial properties in Geographic Information System (GIS). We show that \algo can easily be extended to perform aggregate realignment in multi-dimensional space for general use. Experiments on real, public government datasets show that \algo achieves equal or better accuracy in root mean square error (RMSE) than the leading state-of-the-art approach without sacrificing scalability and robustness.
Problems involving multiple networks are prevalent in many scientific and other domains. In particular, network alignment, or the task of identifying corresponding nodes in different networks, has applications across the social and natural sciences. Motivated by recent advancements in node representation learning for single-graph tasks, we propose REGAL (REpresentation learning-based Graph ALignment), a framework that leverages the power of automaticallylearned node representations to match nodes across different graphs. Within REGAL we devise xNetMF, an elegant and principled node embedding formulation that uniquely generalizes to multi-network problems. Our results demonstrate the utility and promise of unsupervised representation learning-based network alignment in terms of both speed and accuracy. REGAL runs up to 30× faster in the representation learning stage than comparable methods, outperforms existing network alignment methods by 20 to 30% accuracy on average, and scales to networks with millions of nodes each.
Ever need to process eight hundred and fifty million records per second? Ever need to support thousands of publishers/subscribers and tens of thousands of streams/tables? Come learn how to build an exabyte scale data lake for the next generation of engineers, analysts and scientists and see what it takes to build analytics systems that unify batch ETL, streaming and machine learning.
Maximum Inner Product Search (MIPS) is an important component in many machine learning applications, including recommendation systems. There has been substantial research on sub-linear time approximate algorithms for MIPS. To achieve fast query time, state-of-the-art techniques require significant preprocessing, which can be a burden when the number of subsequent queries is not sufficiently large to amortize the cost. Furthermore, existing methods do not have the ability to directly control the suboptimality of their approximate results with theoretical guarantees. In this paper, we propose the first approximate algorithm for MIPS that does not require any preprocessing, and allows users to control and bound the suboptimality of the results. We cast MIPS as a Best Arm Identification problem, and introduce a new bandit setting that can fully exploit the special structure of MIPS. Our approach outperforms state-of-the-art methods on both synthetic and real-world datasets.
Groundspeed Analytics works with partners in the insurance industry to extract and structure information from business documents. We process everything from pdfs, excel, to tiffs faxed in 1982! After extracting the text from these documents, we run it through our data pipeline to identify entities, terms, and details in the documents, and structure them in the format required by our customers (often spreadsheets, sometimes relational databases). In this talk, I will give an overview of the kinds of problems we're tackling (OCR, NLP, and similarity scoring) as well as the infrastructure and projects we're leveraging to solve them (Kubernetes, Kubeflow, and Seldon).
BBB
Computer Science Building
2260 Hayward St
Ann Arbor, MI 48109