Tuesday, March 1, 2016

Structured and Unstructured Data

Data is on the rise today and this gave birth to big data. However, with such amount of data available today, it is not guaranteed that it will be in a desired form for any kind of analysis. Data is currently available in two forms:

Unstructured Data
This generally refers to any kind of data that is not stored in a traditional table format and is raw and unorganized. The data exists in different forms and it is difficult to identify attributes and derive information from it. Some of the sources of unstructured data are:
  • Word Processing Files
  • PDF files
  • Emails
  • Digital Images
  • Audio
  • Video
  • Social Media Posts

These files are the means of communication of data in most of the businesses. The content of these files has data but it has to be broken down and categorized to understand it.
Structured Data
Structured data is any data which is stored in some kind of a table, be it a spreadsheet or a database. This data exists in rows and columns each of which are created with some titles and it is easy to order and process this data with data mining tools. More emphasis is given to what fields of data will be stored and how the data will be stored. Structured data is similar to a filing cabinet which is perfectly organized so that everything is identified, labeled and easy to access. Some examples of structured data:
1. Machine Generated
  • Sensory Data - GPS data, manufacturing sensors, medical devices
  • Point-of-Sale Data - Credit card information, location of sale, product information
  • Call Detail Records - Time of call, caller and recipient information
  • Web Server Logs - Page requests, other server activity

2. Human Generated
  • Input Data - Any data inputted into a computer: age, zip code, gender, etc.

This video explains more about the differences between structured and unstructured data:

Data in organizations:
·       Dell survey shows structured data represents at least 75 percent of data under management for the majority of organizations
·       Nearly one-third of companies surveyed still don’t actively manage unstructured data
·       83 percent of organizations cite growth in transactional data (including e-commerce) as one of the most important sources of structured data growth within their organization, with 51 percent also citing growth in management data, such as ERP systems.
·       Although there is an increasing industry focus on the proliferation of social data, an increase in the creation of internally generated documents was seen as the top driver of unstructured data growth, identified by more than 50 percent of respondents.
An example of how the data can be analyzed using a data warehouse is as shown:

This primarily involves extraction from the original data source, performing transformations to suit different data sources and finally load into a separate database (ETL).

Limitations of data warehousing:
Some of the limitations of using a data warehouse are:
·       The transformations from individual data sources to the data warehouse usually represents 50% of the total data warehouse effort
·       Data owners lose control over their data, raising ownership (responsibility and accountability), security and privacy issues
·       Initial implementation time is high and expensive
·       Updating to add new data sources is time consuming and expensive
·       Limited flexibility since it requires multiple separate data marts for multiple uses and types of users
·       Data is static and dated and changes cannot be monitored
·       No data drill-down capabilities

Future of data warehousing (supernova schema):
According to Kimball, data warehousing has never been more valuable and interesting than it is currently in the age of big data. Since dimensions are the most important part of a warehouse, they need to be more powerful to support advanced queries and analytics. In a supernova schema, attributes of a dimension can be complex objects and not just simple text. Supernova dimensions become much more flexible and extensible from one analysis to the other. The following figure shows how a supernova customer dimension is different from a traditional customer dimension.



References:

No comments:

Post a Comment