Follow Us:


HomeUncategorizedBigData Streaming Platform

BigData Streaming Platform

#kafka #bigdata #hadoop #streaming #mlib

We have seen how kafka has transformed the whole landscape of data platforms. On one side it provides a lucid and smooth data management across the platform, on other side it does pose multiple challenges for data owners.

Lets first dig into some of the real world challenges before delving into potential solutions and design approaches

Unconventional data growth and nightmare of schema management

Days were happier when data analytics had to go through limited and structured data to run their report and do some predictive analysis. However, with exponential data growth, mostly unstructured, thanks to social media now same guys have to run through vast amount to data to tap on to different aspect of market.

Legacy System migration

You love it or hate it legacy system are going to stay, you can’t just switch off the mainframes. You have to live with them, moreover, have to use those data in conjunction with new data models too.

Batch or Stream

Most organization who has not even finished with understanding and reaping the benefit of Hadoop and other batch systems, and here it comes boom, Kafka based streaming solution.

Nightmare of Schema Management

As heterogeneous data sources keep growing so requirement and demand of business too. That leads to an obvious problem which is schema management. Most common use cases for schema management

  • Schema evolution management using either storing the schema or tagging it
  • Schema registry is another smart way to managing changing schema

Auto Schema detection/generation

Getting rid of schema registry all together could be the best solution. Rather than having a registry, the schema’s could be detected from the data itself and as the schema evolves that could be fed to machine learning lib to predict the schema based on incoming stream.

Unified Serialization approach

Having Avro as a standard serialization protocol can get rid of lot of interoperability between various Opensource systems. Bring all data does not matter the originating source should be converted into Avro before landing into Kafka.

Kafka and Spark for ETLs

It is time to say goodbye to earlier ETL tools and embrace Spark for all heavy computation with Kafka to push the data across the platform.

DevOps is not optional

In my experience, I have been people from legacy system background, tend to believe that DevOps and automation is optional or second step or should be done once the application development has done. DevOps solution should be development alongside the application design and should be taken as software development project rather than adhoc automation



There are no reviews yet.

Be the first to review “BigData Streaming Platform”

Your email address will not be published. Required fields are marked *