The life-cycle of every analytics projects consists of 4 Key Elements: Data Acquisition, Processing, Surfacing, and Action. Each element plays an essential role in the value chain of analytics.
Element #1: Data Acquisition
In order to effectively perform data acquisition, one needs to have a wide range of systems and technology knowledge. Although, the ideal analytics professional is strongly dependent on the domain.
The four different types of data are Clickstream, Databases, APIs, and Logs. Each of which has their own unique ways to handle data collections.
Clickstream data is typically obtained through integration with tools like Google Analytics or Adobe Analytics. The role of Clickstream data is to provide an understanding of user behaviour on the site or applications being processed.
Analytics practitioners trying to acquire new Clickstream Data, normally define new events or attribute to be collected within the tag management system.
What you need to know:
Tag management systems – Useful to have in order to be able to define what type of events or attributes need to ingested by the systems
Generally the source of information for internal system information that needs to be persisted. It frequently contains transactional information, relationships between various objects, and profile information. The traditional method for extraction out of databases is SQL.
Internal servers store information within a database as part of their processes. Development teams need to make sure the right information is stored with those.
What you need to know:
SQL – In order to extract data from a database
ETL tools, such as Airflow – For more advanced operations
Usually the way of acquiring data when dealing with external systems. Some prime examples include operating a Webshop on Amazon, getting some metrics related to your ad spend on Facebook, or through getting information from any other external provider.
In this particular scenario calls to external servers need to be made, a worker needs to be produced that calls the different API endpoints and structure the data for ingestion. Thereafter, data placed either in a database, regularly a data-mart or onto a big data file system.
What you need to know:
How to interact with APIs, including Authorization, SOAP and REST APIs. Also, programming knowledge is needed to be able to interface with this type of data source.
Logs are another source of incredibly valuable data typically garnered within internal systems in order to store and analyze events data. Normally handles as data streams with terms such as “Data Firehose” and stored in Big Data platforms.
Logs data is collected either as part of the process to put events onto an event bus from an API through an internal log collection process. Once onto an event bus, they can be pushed to a big data platform or potentially a database through a data sink connector.
What you need to know:
Spark and Kafka – The typical frameworks or operating on log data.
Hadoop platform – Pushing it for long term storage or offline processing everybody.
Element #2: Processing
This crucial element is responsible to transform raw data and refine it into useful data. It consists of different sub-tasks that need to be processed on data-sets, cleansing, combining and structuring the data-sets, handling aggregation as well as conducting any additional advanced analytics processing on top of the data.
A task everybody working in the field of analytics must do. It requires a deep dive straight into the data and seeking for possible gaps or anomalies, trying to construct the data in a way that it could handle most of the problems.
Data cleansing is the result of a few types of data cleansing: Missing values, Text Normalization, Categorization, ID Matching, Deduplication, and Miss Attribution. The identification and cleaning of each of these cases is a time-consuming effort that needs to be performed to a certain level within each data source.
Merging and Denormalizing
Looking at combining different data-sets together and making more actionable and easily queryable data-sets. The process of merger and denormalization is to create a data-set that is available for further uses that contain the necessary information needed for further processing in an easy way.
There are different levels of aggregations required for different purposes.
Temporary Processing: Fundamentally consists of a materialized subqueries that can be used to provide additional information downstream in an effective manner.
Full Data Extracts: Presenting key metrics for reporting or slice and dice purposes.
Customer Level Aggregations: Used for analysis and additional processing.
Advanced Analytics Processing
Various advanced analytics and machine learning methods can be applied on top of the aggregates that have been computed. The purpose of advanced analytics is to create artificial data that can have predictive power and purpose in decision making.
Element #3: Surfacing
Informative data must be surfaced in an adequate way to be meaningful. There is a multitude of different data surfacing methods ranging from making data available into a dashboard or standard report, an analysis deck an OLAP cube or just opening data as a service.
Dashboard and standard reports
This tends to be the first way to share processed information. It ordinarily sits with the performance measurement part of the role of analytics professionals.
They are another way to share the insights gathered during the different of the analytical process. The reports tend to be shared as a Powerpoint, Word document, or a plain Jupyter notebook.
They allow for slice and dice processing of data. It is a powerful tool for highly dimensional data-sets. Open source tools such Druid enable this type of processing.
While the other channels of data surfacing are focused on surfacing data and information directly to people this one is intended to be directly for the machine. Integrating aggregates and predictions into production systems, be it by offering an API, storing them into Database Tables, surfacing file exports to be used directly within product or processes.
Element #4: Action
In certain scenarios, we can see analytics being divided into 3 sub-domains: Descriptive, Predictive, and Prescriptive Analytics. This separation is fairly restrictive; experts say for analytics to be useful it should be prescriptive while using statistical or modelling techniques that are descriptive or predictive for instance.
Analytics without subsequent action would be considered just research. There has been a growing discussion regarding analytics and the need to provide actionable insights. From a critical point of view, the real key to analytics is not actionable insights but the conversion of these insights into effective actions.