Docker in a Nutshell

I want to start to tackle two very important questions that we are going to be answering throughout this blog post. The two important questions are: What is Docker? Why do we use Docker? Let’s answer first Why we do use Docker by going through a quick little demo right now. Let’s have a look at this … Continue reading Docker in a Nutshell

Spark study notes: core concepts visualized

Learning Spark is not an easy thing for a person with less background knowledge on distributed systems. Even though I have been using Spark for quite some time, I find it time-consuming to get a comprehensive grasp of all the core concepts in Spark. The official Spark documentation provides a very detailed explanation, yet it focuses more … Continue reading Spark study notes: core concepts visualized

How to Create an ARIMA Model for Time Series Forecasting in Python

A popular and widely used statistical method for time series forecasting is the ARIMA model. ARIMA is an acronym that stands for AutoRegressive Integrated Moving Average. It is a class of model that captures a suite of different standard temporal structures in time series data. In this tutorial, you will discover how to develop an … Continue reading How to Create an ARIMA Model for Time Series Forecasting in Python

Introduction to the Apache Pulsar pub-sub messaging platform

Apache Pulsar (incubating) is an enterprise-grade publish-subscribe (aka pub-sub) messaging system that was originally developed at Yahoo. Pulsar was first open-sourced in late 2016, and is now undergoing incubation under the auspices of the Apache Software Foundation. At Yahoo, Pulsar has been in production for over three years, powering major applications like Yahoo! Mail, Yahoo! Finance, Yahoo! Sports, Flickr, the … Continue reading Introduction to the Apache Pulsar pub-sub messaging platform

How we built a data pipeline with Lambda Architecture using Spark/Spark Streaming

Walmart Labs is a data-driven company. Many business and product decisions are based on the insights derived from data analysis. I work in Expo which is the A/B Testing platform for Walmart. As part of the platform we built a data ingestion and reporting pipeline which is used by the experimentation team to identify how … Continue reading How we built a data pipeline with Lambda Architecture using Spark/Spark Streaming

Introduction to Apache Spark

What is the need of spark?  Hadoop MapReduce is limited to Batch Processing. Apache Storm/S4 is limited to real-time Stream Processing. Apache Impala/Tez is limited for Interactive Processing. Neo4j/Apache Giraph is limited to Graph Processing. Hence there was no powerful engine in the industry that can process the data in real-time (streaming) as well as … Continue reading Introduction to Apache Spark