Designing NRT(NearRealTime) stream processing systems: Using Storm

Scale
06/01/2015 - 12:50 to 13:30
Stage 2
long talk (40 min)
Beginner

Session abstract: 

The essence of near-real-time stream processing is to compute huge volumes of data as it is received. This talk will focus on creating a pipeline for collecting huge volumes of data anfd processing near-real time using Storm. 

Storm is a high-volume, continuous, reliable stream processing system developed at BackType and open-sourced by Twitter. Storm is being widely used in lot of organizations and has variety of uses-cases like:    

* Realtime analytics     

* Distributed RPC     

* ETL etc.   

 

  During the course of 40 minutes using an example of Real-time Wikipedia edit we will try and understand:   

 * Basic concepts of stream-processing.  

   * High level understanding of components involved in Storm.     

* Writing producer in Python which will will push in Queue the real-time edit feed from Wikipedia.     

 * Write storm topologies in python to consume feed and process real-time metrics like:           

      * Number of articles edited.       

      * Category wise count of articles being edited.      

      * Distinct people editing the articles       

      * GeoLocation counters etc.   

 * Technological challenges revolving around near-real time stream processing systems:    

 * Achieve low latency for processing as compared to batch processing.      

 * State-management in workers to maintain aggregated counts like counting edits for same category of articles.         

 * Handling failures and crashes

 * Deployment Startergies.