Data lifecycle – phases and challenges

The end-to-end lifecycle of data involves multiple phases as represented below, with challenges at each phase. Big Data needs contextual management from a haphazard, heterogeneous mix. There is also a question of credibility, uncertainty and error in understanding the relevance of data. This management of data requires smart systems and also better human collaboration for user interaction. A lot of debate on Big Data is focused on the technology aspect. However, there is also
a lot more than technology required to set up the fundamental basis of managing data analysis. It doesn’t reckon throwing away of existing structures, warehouses and analytics. Instead, needs to build up on existing capabilities in data quality, Master Data Management and data protection frameworks. Data Management needs to be seen from a business perspective, by prioritizing business needs and taking realistic actions.

Data Acquisition & Data Warehousing

Data always has a source. It doesn’t come out of nowhere, and just as big as data is, so are the multifarious sources that can produce up to 1 million terabytes of raw data every day. This enormity and dispersion in data is not of much use, unless it is filtered and compressed on the basis of several criteria. The foremost challenge in this aspect is to define these criteria for filters, so as to not lose out any valuable information. For instance, customer preference data can sourced from the information they share on key social media channels. But then, how to tap the non-social media users who might also be an important customer segment. What are the data sources for them?

Data reduction is a science that needs substantial research to establish an intelligent process that brings down raw data to a user-friendly size without missing out the minute information pieces of relevance. And this is required in real-time, as it would be an expensive and arduous affair to store the data first and reduce later.

An important part of building a robust Data Warehousing platform is the consolidation of data across various sources to create a good repository of master data, which will help in providing consistent information across the organization.

Data Extraction & Structuring

Data that has been collected, even after filtering, is not in a format ready for analysis. It is has multiple modes of content, such as text, pictures, videos, multiple sources of data with different file formats. This mandates for a Data Extraction Strategy that integrates data from diverse enterprise information repositories and transforms it into a consumable format.

Data is basically of two categories – structured and unstructured. Structured data is that which is available in a preset format such as row and column based data bases. These are easy to enter, store and analyze. This type of data is mostly actual and transactional.

Unstructured data on the other hand is free form, attitudinal and behavioral. This does not come in traditional formats. It is heterogeneous, variable and comes in multiple formats, such as text, document, image, video and so on. Unstructured data is growing at a super-fast speed. In 2011, IDC held a study that stated that 90 percent of all data in the next decade will be unstructured. However, from a business benefit perspective, true value and insights reside in this massive volume of unstructured data that is rather difficult to tame and channelize.

Extract-Transform-Load (ETL) is the process that covers the entire stage of getting data loaded in the proper, cleaned format from the source to the target data warehouse. There are several ETL tools in available, principles of making the right selection are same as that of deciding the right course of big data implementation.

Data Modeling & Data Analysis

Once the proper mechanism of creating a data repository is established, then sets in the rather complex procedure of Data Analysis. Big Data Analytics is one of the most crucial aspects and room for development in the data industry. Data analysis is not only about locating, identifying, understanding, and presenting data. Industries demand for large-scale analysis that is entirely automated which requires processing of different data structures and semantics in an understandable and computer intelligent format.

Technological advancements in this direction are making this kind of analytics of unstructured possible and cost effective. A distributed grid of computing resources utilizing easily scalable architecture, processing framework and non-relational, parallel-relational databases is redefining data management and governance. Databases today have shifted to non-relational to meet the complexity of unstructured data. NoSQL database solutions are capable of working without fixed table schemas, avoid join operations, and scale horizontally.

Data sciences has emerged as a sophisticated discipline that draws from various elements across statistical techniques, mathematical modelling and visualization. It encapsulates:

- Data manipulation & analytic applications addressing automation, application development and testing
- Data modelling covering key areas like experimental design, graphical models and path analysis
- Statistics and machine learning through classical and spatial statistics, simulation and optimization techniques
- Text data analysis through pattern analysis, text mining and NLP by developing and integrating solutions or deploying packaged solutions

Data Interpretation

The most important aspect of success in Big Data Analytics is the presentation of analyzed data in a user-friendly, re-usable and intelligible format. And the complexity of data is adding to the complexity of its presentation as well. Sometimes, simple tabular representations may not be
sufficient to represent data in certain cases, requiring further explanations, historical incidences, etc. Sometimes, predictive or statistical analysis from the data is also expected from the analytics tool to support decision making. In other words, the final phase or culmination of the entire Big Data exercise is Data Interpretation or Data Visualization.

Visualization of data is a key component of Business Intelligence. Here’s a snapshot of the Visualization framework that assists in Business Intelligence.

Interactive Data Visualization from static graphs and spreadsheets to using mobile devices and interacting with data in real time – the future of data interpretation is becoming more agile and responsive.