Google Search

Sunday, August 31, 2008

SCHEMAS IN DATA WAREHOUSING

12.1 STAR SCHEMA

The star schema is perhaps the simplest data warehouse schema. It is called a star schema because the entity-relationship diagram of this schema resembles a star, with points radiating from a central table. The centre of the star consists of a large fact table and the points of the star are the dimension tables.

A star schema is characterised by one or more very large fact tables that contain the primary information in the data warehouse, and a number of much smaller dimension tables (or lookup tables), each of which contains information about the entries for a particular attribute in the fact table.

A star query is a join between a fact table and a number of dimension tables. Each dimension table is joined to the fact table using a primary key to foreign key join, but the dimension tables are not joined to each other.

A typical fact table contains keys and measures. In the example given below, the fact table, sales, contain the measures quantity_sold, amount, and cost, and the keys cust_id, time_id, prod_id, channel_id, and promo_id. The dimension tables are customers, times, products, channels, and promotions. The product dimension table, contains information about each product number that appears in the fact table.

A star join is a primary key to foreign key join of the dimension tables to a fact table.

The main advantages of star schemas are that they:
• Provide a direct and intuitive mapping between the business entities being analysed by end users and the schema design.
• Provide highly optimised performance for typical star queries.
• Are widely supported by a large number of business intelligence tools, which may anticipate or even require that the data-warehouse schema contain dimension tables

Star schemas are used for both simple data marts and very large data warehouses


12.2 SNOWFLAKE SCHEMA



Snowflake schemas normalize dimensions to eliminate redundancy. That is, the dimension data has been grouped into multiple tables instead of one large table.
The snowflake model is the result of decomposing one or more of the dimensions, which sometimes have hierarchies themselves.

For example, a product dimension table in a star schema might be normalized into a products table, a product_category table, and a product_manufacturer table in a snowflake schema. While this saves space, it increases the number of dimension tables and requires more foreign key joins. The result is more complex queries and reduced query performance. The figure below presents a graphical representation of a snowflake schema.

DATA MODELLING FOR OLAP

Dimensional modelling is primarily to support OLAP and decision making .

Four types of operations are used in OLAP to analyse data. Considering granularity, the operations of drill down and roll up are performed. To browse along the dimensions, slice and dice operations are used.

7.4.1 DRILL DOWN AND ROLL UP

Drill down and roll up are the operations for moving the view down and up along the dimensional hierarchy levels. With drill-down capability, users can navigate to higher levels of detail. With roll-up capability, users can zoom out to see a summarise level of data. The
navigation path is determined by the hierarchies within dimensions.

7.4.2 SLICE AND DICE

Slice and dice are the operations for browsing the data through the visualised cube. Slicing cuts through the cube so that users can focus on some specific perspectives. Dicing rotates the cube to another perspective so that users can be more specific with the data analysis. For example suppose we are considering a company’s production of two products Cell phones and pagers. The dimensions would be location, time, product. While analysing the production report of a specific month by plant and product, you get the quarterly view of gross production by plant. You can then change the dimension from product to time, which is dicing. Now, you want to focus on the Cell Phone only, rather than gross production. To do this, you can cut off the cube only for the Cell Phone for the same dimensions, which is slicing.

DATA MODELLING TECHNIQUES


7.2.1 ER MODELLING



An ER model is represented by an ER diagram, which uses three basic graphic symbols to
conceptualise the data: entity, relationship, and attribute.

Entity- An entity is defined to be a person, place, thing, or event of interest to the business or the organisation. An entity represents a class of objects, which are things in the real world that can be observed and classified by their properties and characteristics.

Relationship-A relationship is represented with lines drawn between entities. It depicts the
structural interaction and association among the entities in a model.

The relationship between two entities can be defined in terms of the cardinality. This is the maximum number of instances of one entity that are related to a single instance in another table and vice versa. The possible cardinalities are: one-to-one (1:1), one-to-many (1:M), and many-to-many (M:M).In a detailed (normalized) ER model, any M:M relationship is not shown because it is resolved to an associative entity.

Attribute- Attributes describe the characteristics of properties of the entities. An attribute Name should be unique in an entity and should be self-explanatory.

The figure above is an example of an ER Model

7.2.2 DIMENSIONAL MODELLING

Dimensional modelling is a technique for conceptualising and visualising data models as a set of measures that are described by common aspects of the business. It is especially useful for summarising and rearranging the data and presenting views of the data to support data analysis. Dimensional modelling focuses on numeric data, such as values, counts , weights, balances, and occurrences.

Dimensional modelling has the following basic concepts:

 Facts
 Dimensions
 Measures (variables)

Fact
A fact is a collection of related data items, consisting of measures and context data. Each fact typically represents a business item, a business transaction, or an event that can be used in analysing the business or business processes.

In a data warehouse, facts are implemented in the core tables in which all of the numeric data is stored.

Dimension

A dimension is a collection of members or units of the same type of views.

In a dimensional model, every data point in the fact table is associated with one and only one member from each of the multiple dimensions. Dimensions determine the contextual background for the facts.

Dimensions are the parameters over which we want to perform Online Analytical Processing (OLAP). For example, in a database for analysing all sales of products, common dimensions could be:
 Time
 Location/region
 Customers
 Salesperson
 Scenarios such as actual, budgeted, or estimated numbers

Dimensions can usually be mapped to nonnumeric, informative entities such as branch or employee.

Dimension Members: A dimension contains many dimension members. A dimension member is a distinct name or identifier used to determine a data items position. For example, all months, quarters, and years make up a time dimension, and all cities, regions, and countries make up a geography dimension.

Dimension Hierarchies: We can arrange the members of a dimension into one or more hierarchies. Each hierarchy can also have multiple hierarchy levels. Every member of a dimension does not locate on one hierarchy structure.

Measure

A measure is a numeric attribute of a fact, representing the performance or behaviour of the business relative to the dimensions. For example, measures are the sales in money, the sales volume, the quantity supplied, the supply cost, the transaction amount, and so forth. A measure is determined by combinations of the members of the dimensions and is located on facts.

COMPONENTS OF DATA WAREHOUSING


Data warehousing components are identified in the figure above. A detailed discussion
of each of the components will be followed as we move along.


Data Mart -- A data mart is a focused subset of a data warehouse that deals with a single area of data and is organised for quick analysis.

Metadata -- Literally, "data about data." It is the descriptions of what kind of information is stored where, how it is encoded, how it is related to other information, where it comes from, and how it is related to your business.

Query -- A specific atomic request for information from a database.

Data Mining -- Data mining is the running of automated routines that search through data organised in a warehouse. They look for patterns in order to point you to areas that you should be addressing.

OLAP (On-Line Analytical Processing) -- Tools that extract data from data warehouses go by a variety of names: OLAP, ROLAP (Relational On-Line Analytical Processing), multi-dimensional analysis tools, and decision support systems being the most common ones. All provide the ability to do rapid analysis of multiple simultaneous factors, something that relational databases can't do.

Data Visualization --Techniques for turning data into information by using the high capacity of the human brain to visually recognise patterns and trends. There are many specialised techniques designed to make particular kinds of visualization easy.