Spark study notes: core concepts visualized

Learning Spark is not an easy thing for a person with less background knowledge on distributed systems. Even though I have been using Spark for quite some time, I find it time-consuming to get a comprehensive grasp of all the core concepts in Spark. The official Spark documentation provides a very detailed explanation, yet it focuses more … Continue reading Spark study notes: core concepts visualized

The basis of Azure Data Factory

In the world of big data, raw, unorganized data is often stored in relational, non-relational, and other storage systems. However, on its own, raw data doesn’t have the proper context or meaning to provide meaningful insights to analysts, data scientists, or business decision makers. Big data requires service that can orchestrate and operationalize processes to … Continue reading The basis of Azure Data Factory

How to Create an ARIMA Model for Time Series Forecasting in Python

A popular and widely used statistical method for time series forecasting is the ARIMA model. ARIMA is an acronym that stands for AutoRegressive Integrated Moving Average. It is a class of model that captures a suite of different standard temporal structures in time series data. In this tutorial, you will discover how to develop an … Continue reading How to Create an ARIMA Model for Time Series Forecasting in Python

Comparing Pulsar and Kafka: unified queuing and streaming

In previous blog posts, we described several reasons why Apache Pulsar is an enterprise-grade streaming and messaging system that you should consider for your real-time use cases. We also took a deep dive into enterprise features like real-time durable storage to prevent data loss, multi-tenancy, geo-replication, encryption, and security. These blog posts helped develop an understanding of Pulsar and generated a lot … Continue reading Comparing Pulsar and Kafka: unified queuing and streaming

Why Apache Pulsar? Part 2

This is part 2 of a series of blog posts that highlights key features of Apache Pulsar (incubating). Apache Pulsar is a next-generation pub-sub messaging system developed at Yahoo. In part 1 of the series, we discussed how Pulsar supports a flexible messaging model, multi-tenancy, geo-replication, and durability. In this post, we’ll continue the discussion by showing how … Continue reading Why Apache Pulsar? Part 2

Why Apache Pulsar? Part 1

Apache Pulsar (incubating) is a next-generation pub/sub messaging system developed at Yahoo. Pulsar was developed from the ground up to address several shortcomings of existing open source messaging systems and has been running in production for three years, powering critical applications like Yahoo! Mail, Yahoo! Finance, Yahoo! Sports, Flickr, the Gemini Ads Platform, and Sherpa, Yahoo’s … Continue reading Why Apache Pulsar? Part 1

Introduction to the Apache Pulsar pub-sub messaging platform

Apache Pulsar (incubating) is an enterprise-grade publish-subscribe (aka pub-sub) messaging system that was originally developed at Yahoo. Pulsar was first open-sourced in late 2016, and is now undergoing incubation under the auspices of the Apache Software Foundation. At Yahoo, Pulsar has been in production for over three years, powering major applications like Yahoo! Mail, Yahoo! Finance, Yahoo! Sports, Flickr, the … Continue reading Introduction to the Apache Pulsar pub-sub messaging platform