Data is on the rise today and this
gave birth to big data. However, with such amount of data available today, it
is not guaranteed that it will be in a desired form for any kind of analysis.
Data is currently available in two forms:
Unstructured Data
This generally refers to any kind
of data that is not stored in a traditional table format and is raw and unorganized.
The data exists in different forms and it is difficult to identify attributes
and derive information from it. Some of the sources of unstructured data are:
- Word Processing Files
- PDF files
- Emails
- Digital Images
- Audio
- Video
- Social Media Posts
These files are the means of
communication of data in most of the businesses. The content of these files has
data but it has to be broken down and categorized to understand it.
Structured Data
Structured data is any data which
is stored in some kind of a table, be it a spreadsheet or a database. This data
exists in rows and columns each of which are created with some titles and it is
easy to order and process this data with data mining tools. More emphasis is
given to what fields of data will be stored and how the data will be stored. Structured
data is similar to a filing cabinet which is perfectly organized so that
everything is identified, labeled and easy to access. Some examples of
structured data:
1. Machine
Generated
- Sensory Data - GPS data, manufacturing sensors, medical devices
- Point-of-Sale Data - Credit card information, location of sale, product information
- Call Detail Records - Time of call, caller and recipient information
- Web Server Logs - Page requests, other server activity
2. Human Generated
- Input Data - Any data inputted into a computer: age, zip code, gender, etc.
This video explains more about the
differences between structured and unstructured data:
Data in organizations:
· Dell
survey shows structured data represents at least 75 percent of data under management for the majority of
organizations
· Nearly
one-third of companies surveyed still don’t actively manage unstructured data
· 83 percent
of organizations cite growth in transactional data (including e-commerce) as
one of the most important sources of structured data growth within their
organization, with 51 percent also
citing growth in management data, such as ERP systems.
· Although
there is an increasing industry focus on the proliferation of social data, an
increase in the creation of internally generated documents was seen as the top
driver of unstructured data growth, identified by more than 50 percent of
respondents.
An example of how the data can be
analyzed using a data warehouse is as shown:
This primarily involves extraction
from the original data source, performing transformations to suit different
data sources and finally load into a separate database (ETL).
Limitations
of data warehousing:
Some of the limitations of using a
data warehouse are:
·
The transformations from
individual data sources to the data warehouse usually represents 50% of the total
data warehouse effort
·
Data owners lose control over
their data, raising ownership (responsibility and accountability), security and
privacy issues
·
Initial implementation time is
high and expensive
·
Updating to add new data sources
is time consuming and expensive
·
Limited flexibility since it
requires multiple separate data marts for multiple uses and types of users
·
Data is static and dated and
changes cannot be monitored
·
No data drill-down capabilities
Future
of data warehousing (supernova schema):
According to Kimball, data
warehousing has never been more valuable and interesting than it is currently
in the age of big data. Since dimensions are the most important part of a
warehouse, they need to be more powerful to support advanced queries and
analytics. In a supernova schema, attributes of a dimension can be complex
objects and not just simple text. Supernova dimensions become much more flexible
and extensible from one analysis to the other. The following figure shows how a
supernova customer dimension is different from a traditional customer
dimension.
References:
No comments:
Post a Comment