EMCDSA - Data Science & Big Data Analytics

Chapter 1 - Introduction to Big Data Analytics

Three attributes stand out as defining Big Data Characteristics : 

  1. Huge volume of data : Rather than thousands or millions of rows, Big Data can be billions of rows and millions of columns
  2. Complexity of data types and structures : big Data reflects the variety of new data sources, formats, and structures, including digital traces being left on the web and other digital repositories for subsequent analysis.
  3. Speed of new data creation and growth : big Data can describe high velocity data, with rapid data ingestion and near real time analysis.

Due to its size or structure, Big Data cannot be efficiently analyzed using only traditional databases or methods. Big Data problems require new tools and technologies to store, manage and realize the business benefit.

As per McKinsey : Big Data is data whose scale, distribution, diversity, and/or timeliness require the use of new technical architectures and analytics to enable insights that unlock new sources of business value.

Sources of Big Data: 

  1. Mobile Sensors - IOT
  2. Social Media
  3. Video Surveillance
  4. Video Rendering
  5. Smart Grids
  6. Geophysical Exploration
  7. Medical Imaging
  8. Gene Sequencing

Data Structures

80-90% of future data growth coming from non-structured data types.

  1. Structured Data : Data containing a defined data type, format, and structure (that is, transaction data, online analytical processing (OLAP) data cubes, traditional RDBMS, CSV files, and even simple spread sheets.
  2. Semi-structured Data : Textual data files with a discernible pattern that enables parsing(such as Extensible Markup Language (XML) data files are self-describing and defined by an XML schema).
  3. Quasi - Structured Data : Textual data with erratic data formats that can be formatted with effort tools, and time (for instance, web click-stream data that may contain inconsistencies in data values and formats).
  4. Unstructured Data : Data that has no inherent structures, which may include text documents, PDFs, images and video.

Analyst Perspective on Data Repositories:

Database administrator training is not required to create spreadsheets. Spreadsheets are easy to share, and end users have control over the logic involved. however, their proliferation can result in "many version of truth". In other words, it can be challenging to determine if a particular user has the most relevant version of spreadsheets, with the most current data and logic in it.

With the proliferation of data islands (or spread-marts), the need to centralize the data is more pressing than ever.

Enterprise Data Warehouses (EDW)s are critical for reporting and BI tasks and solve many of the problems that proliferating spreadsheets introduce, such as which of multiple versions of a spreadsheet is correct, EDWs - and a good BI strategy - provide direct data feeds from sources that are centrally managed backed up, and secured.