Follow Us:

Stream Stack

Home Products Stream Stack

Stream Stack lets you move your data {to,from} cloud in a seamless manner.

Stream Stack : Solving Data management bottlenecks

#kafka #bigdata #hadoop #streaming #mlib

We have seen how Kafka has transformed the whole landscape of data platforms. On one side it provides lucid and smooth data management across the platform, on the other side, it does pose multiple challenges for data owners.

Lets first dig into some of the real-world challenges before delving into potential solutions and design approaches

Unconventional data growth and the nightmare of schema management

Days were happier when data analytics had to go through limited and structured data to run their reports and do some predictive analysis. However, with exponential data growth, mostly unstructured, thanks to social media now same guys have to run through vast amounts to data to tap on to different aspects of the market.

Legacy System migration

You love it or hate it legacy system is going to stay, you can’t just switch off the mainframes. You have to live with them, moreover, have to use those data in conjunction with new data models too.

Batch or Stream

Most organization who has not even finished with understanding and reaping the benefit of Hadoop and other batch systems, and here it comes boom, Kafka based streaming solution.

Nightmare of Schema Management

As heterogeneous data sources keep growing so requirement and demand of business too. That leads to an obvious problem which is schema management. Most common use cases for schema management

  • Schema evolution management using either storing the schema or tagging it
  • Schema registry is another smart way to managing changing schema

Auto Schema detection/generation

Getting rid of the schema registry all together could be the best solution. Rather than having a registry, the schema could be detected from the data itself and as the schema evolves that could be fed to machine learning lib to predict the schema based on the incoming stream.

Unified Serialization approach

Having Avro as a standard serialization protocol can get rid of a lot of interoperability between various Opensource systems. Bring all data does not matter the originating source should be converted into Avro before landing into Kafka.

Kafka and Spark for ETLs

It is time to say goodbye to earlier ETL tools and embrace Spark for all heavy computation with Kafka to push the data across the platform.

DevOps is not optional

In my experience, I have been people from the legacy system background, tend to believe that DevOps and automation is an optional or second step or should be done once the application development has done. DevOps solution should be developed alongside the application design and should be taken as a software development project rather than ad-hoc automation