BigData Streaming Platform
#kafka #bigdata #hadoop #streaming #mlib
We have seen how kafka has transformed the whole landscape of data platforms. On one side it provides a lucid and smooth data management across the platform, on other side it does pose multiple challenges for data owners.
Lets first dig into some of the real world challenges before delving into potential solutions and design approaches
Unconventional data growth and nightmare of schema management
Days were happier when data analytics had to go through limited and structured data to run their report and do some predictive analysis. However, with exponential data growth, mostly unstructured, thanks to social media now same guys have to run through vast amount to data to tap on to different aspect of market.
Legacy System migration
You love it or hate it legacy system are going to stay, you can’t just switch off the mainframes. You have to live with them, moreover, have to use those data in conjunction with new data models too.
Batch or Stream
Most organization who has not even finished with understanding and reaping the benefit of Hadoop and other batch systems, and here it comes boom, Kafka based streaming solution.
Nightmare of Schema Management
As heterogeneous data sources keep growing so requirement and demand of business too. That leads to an obvious problem which is schema management. Most common use cases for schema management
- Schema evolution management using either storing the schema or tagging it
- Schema registry is another smart way to managing changing schema
Auto Schema detection/generation
Getting rid of schema registry all together could be the best solution. Rather than having a registry, the schema’s could be detected from the data itself and as the schema evolves that could be fed to machine learning lib to predict the schema based on incoming stream.
Unified Serialization approach
Having Avro as a standard serialization protocol can get rid of lot of interoperability between various Opensource systems. Bring all data does not matter the originating source should be converted into Avro before landing into Kafka.
Kafka and Spark for ETLs
It is time to say goodbye to earlier ETL tools and embrace Spark for all heavy computation with Kafka to push the data across the platform.
DevOps is not optional
In my experience, I have been people from legacy system background, tend to believe that DevOps and automation is optional or second step or should be done once the application development has done. DevOps solution should be development alongside the application design and should be taken as software development project rather than adhoc automation