BUSINESS INTELLIGENCE: 2008

Monday, October 6, 2008

Sunday, September 28, 2008

The following is a list of the basic reasons why organizations implement data warehousing. This list was put together because too much of the data warehousing literature confuses "next order" benefits with these basic reasons. For example, spend a little time reading data warehouse trade material and you will read about using a data warehouse to "convert data into business intelligence", "make management decision making based on facts not intuition", "get closer to the customers", and the seemingly ubiquitously used phrase "gain competitive advantage". In probably 99% of the data warehousing implementations, data warehousing is only one step out of many in the long road toward the ultimate goal of accomplishing these highfalutin objectives.

The basic reasons organizations implement data warehouses are:
To perform server/disk bound tasks associated with querying and reporting on servers/disks not used by transaction processing systems

Most firms want to set up transaction processing systems so there is a high probability that transactions will be completed in what is judged to be an acceptable amount of time. Reports and queries, which can require a much greater range of limited server/disk resources than transaction processing, run on the servers/disks used by transaction processing systems can lower the probability that transactions complete in an acceptable amount of time. Or, running queries and reports, with their variable resource requirements, on the servers/disks used by transaction processing systems can make it quite complex to manage servers/disks so there is a high enough probability that acceptable response time can be achieved. Firms therefore may find that the least expensive and/or most organizationally expeditious way to obtain high probability of acceptable transaction processing response time is to implement a data warehousing architecture that uses separate servers/disks for some querying and reporting.
To use data models and/or server technologies that speed up querying and reporting and that are not appropriate for transaction processing

There are ways of modeling data that usually speed up querying and reporting (e.g., a star schema) and may not be appropriate for transaction processing because the modeling technique will slow down and complicate transaction processing. Also, there are server technologies that that may speed up query and reporting processing but may slow down transaction processing (e.g., bit-mapped indexing) and server technologies that may speed up transaction processing but slow down query and report processing (e.g., technology for transaction recovery.) - Do note that whether and by how much a modeling technique or server technology is a help or hindrance to querying/reporting and transaction processing varies across vendors' products and according to the situation in which the technique or technology is used.
To provide an environment where a relatively small amount of knowledge of the technical aspects of database technology is required to write and maintain queries and reports and/or to provide a means to speed up the writing and maintaining of queries and reports by technical personnel

Often a data warehouse can be set up so that simpler queries and reports can be written by less technically knowledgeable personnel. Nevertheless, less technically knowledgeable personnel often "hit a complexity wall" and need IS help. IS, however, may also be able to more quickly write and maintain queries and reports written against data warehouse data. It should be noted, however, that much of the improved IS productivity probably comes from the lack of bureaucracy usually associated with establishing reports and queries in the data warehouse.
To provide a repository of "cleaned up" transaction processing systems data that can be reported against and that does not necessarily require fixing the transaction processing systems

Please read my essay on what data errors you may find when building a data warehouse for an explanation of the type of "errors" that need cleaning up. The data warehouse provides an opportunity to clean up the data without changing the transaction processing systems. Note, however, that some data warehousing implementations provide a means to capture corrections made to the data warehouse data and feed the corrections back into transaction processing systems. Sometimes it makes more sense to handle corrections this way than to apply changes directly to the transaction processing system.
To make it easier, on a regular basis, to query and report data from multiple transaction processing systems and/or from external data sources and/or from data that must be stored for query/report purposes only

For a long time firms that need reports with data from multiple systems have been writing data extracts and then running sort/merge logic to combine the extracted data and then running reports against the sort/merged data. In many cases this is a perfectly adequate strategy. However, if a company has large amounts of data that need to be sort/merged frequently, if data purged from transaction processing systems needs to be reported upon, and most importantly, if the data need to be "cleaned", data warehousing may be appropriate.
To provide a repository of transaction processing system data that contains data from a longer span of time than can efficiently be held in a transaction processing system and/or to be able to generate reports "as was" as of a previous point in time

Older data are often purged from transaction processing systems so the expected response time can be better controlled. For querying and reporting, this purged data and the current data may be stored in the data warehouse where there presumably is less of a need to control expected response time or the expected response time is at a much higher level. - As for "as was" reporting, some times it is difficult, if not impossible, to generate a report based on some characteristic at a previous point in time. For example, if you want a report of the salaries of employees at grade Level 3 as of the beginning of each month in 1997, you may not be able to do this because you only have a record of current employee grade level. To be able to handle this type of reporting problem, firms may implement data warehouses that handle what is called the "slowly changing dimension" issue.
To prevent persons who only need to query and report transaction processing system data from having any access whatsoever to transaction processing system databases and logic used to maintain those databases

The concern here is security. For example, data warehousing may be interesting to firms that want to allow report and querying only over the Internet.

Some firms implement data warehousing for all the reasons cited. Some firm implement data warehousing for only one of the reasons cited.

By the way, I am not saying that a data warehouse has no "business" objectives. (I grit my teeth when I say that because I am not one to assume that an IT objective is not a business objective. We IT people are businesspeople too.) I do believe that the achievement of a "business" objective for a data warehouse necessarily comes about because of the achievement of one or many of the above objectives.

If you examine the list you may be struck that need for data warehousing is mainly caused by the limitations of transaction processing systems. These limitations of transaction processing systems are not, however, inherent. That is, the limitations will not be in every implementation of a transaction processing system. Also, the limitations of transaction processing systems will vary in how crippling they are.

Finally, to repeat the point I made initially, a firm that expects to get business intelligence, better decision making, closeness to its customers, and competitive advantage simply by plopping down a data warehouse is in for a surprise. Obtaining these next order benefits requires firms to figure out, usually by trial and error, how to change business practices to best use the data warehouse and then to change their business practices. And that can be harder than implementing a data warehouse.

Monday, September 8, 2008

BW overview

Sunday, August 31, 2008

SCHEMAS IN DATA WAREHOUSING

12.1 STAR SCHEMA

The star schema is perhaps the simplest data warehouse schema. It is called a star schema because the entity-relationship diagram of this schema resembles a star, with points radiating from a central table. The centre of the star consists of a large fact table and the points of the star are the dimension tables.

A star schema is characterised by one or more very large fact tables that contain the primary information in the data warehouse, and a number of much smaller dimension tables (or lookup tables), each of which contains information about the entries for a particular attribute in the fact table.

A star query is a join between a fact table and a number of dimension tables. Each dimension table is joined to the fact table using a primary key to foreign key join, but the dimension tables are not joined to each other.

A typical fact table contains keys and measures. In the example given below, the fact table, sales, contain the measures quantity_sold, amount, and cost, and the keys cust_id, time_id, prod_id, channel_id, and promo_id. The dimension tables are customers, times, products, channels, and promotions. The product dimension table, contains information about each product number that appears in the fact table.

A star join is a primary key to foreign key join of the dimension tables to a fact table.

The main advantages of star schemas are that they:
• Provide a direct and intuitive mapping between the business entities being analysed by end users and the schema design.
• Provide highly optimised performance for typical star queries.
• Are widely supported by a large number of business intelligence tools, which may anticipate or even require that the data-warehouse schema contain dimension tables

Star schemas are used for both simple data marts and very large data warehouses

12.2 SNOWFLAKE SCHEMA

Snowflake schemas normalize dimensions to eliminate redundancy. That is, the dimension data has been grouped into multiple tables instead of one large table.
The snowflake model is the result of decomposing one or more of the dimensions, which sometimes have hierarchies themselves.

For example, a product dimension table in a star schema might be normalized into a products table, a product_category table, and a product_manufacturer table in a snowflake schema. While this saves space, it increases the number of dimension tables and requires more foreign key joins. The result is more complex queries and reduced query performance. The figure below presents a graphical representation of a snowflake schema.

DATA MODELLING FOR OLAP

Dimensional modelling is primarily to support OLAP and decision making .

Four types of operations are used in OLAP to analyse data. Considering granularity, the operations of drill down and roll up are performed. To browse along the dimensions, slice and dice operations are used.

7.4.1 DRILL DOWN AND ROLL UP

Drill down and roll up are the operations for moving the view down and up along the dimensional hierarchy levels. With drill-down capability, users can navigate to higher levels of detail. With roll-up capability, users can zoom out to see a summarise level of data. The
navigation path is determined by the hierarchies within dimensions.

7.4.2 SLICE AND DICE

Slice and dice are the operations for browsing the data through the visualised cube. Slicing cuts through the cube so that users can focus on some specific perspectives. Dicing rotates the cube to another perspective so that users can be more specific with the data analysis. For example suppose we are considering a company’s production of two products Cell phones and pagers. The dimensions would be location, time, product. While analysing the production report of a specific month by plant and product, you get the quarterly view of gross production by plant. You can then change the dimension from product to time, which is dicing. Now, you want to focus on the Cell Phone only, rather than gross production. To do this, you can cut off the cube only for the Cell Phone for the same dimensions, which is slicing.

DATA MODELLING TECHNIQUES

7.2.1 ER MODELLING

An ER model is represented by an ER diagram, which uses three basic graphic symbols to
conceptualise the data: entity, relationship, and attribute.

Entity- An entity is defined to be a person, place, thing, or event of interest to the business or the organisation. An entity represents a class of objects, which are things in the real world that can be observed and classified by their properties and characteristics.

Relationship-A relationship is represented with lines drawn between entities. It depicts the
structural interaction and association among the entities in a model.

The relationship between two entities can be defined in terms of the cardinality. This is the maximum number of instances of one entity that are related to a single instance in another table and vice versa. The possible cardinalities are: one-to-one (1:1), one-to-many (1:M), and many-to-many (M:M).In a detailed (normalized) ER model, any M:M relationship is not shown because it is resolved to an associative entity.

Attribute- Attributes describe the characteristics of properties of the entities. An attribute Name should be unique in an entity and should be self-explanatory.

The figure above is an example of an ER Model

7.2.2 DIMENSIONAL MODELLING

Dimensional modelling is a technique for conceptualising and visualising data models as a set of measures that are described by common aspects of the business. It is especially useful for summarising and rearranging the data and presenting views of the data to support data analysis. Dimensional modelling focuses on numeric data, such as values, counts , weights, balances, and occurrences.

Dimensional modelling has the following basic concepts:

 Facts
 Dimensions
 Measures (variables)

Fact
A fact is a collection of related data items, consisting of measures and context data. Each fact typically represents a business item, a business transaction, or an event that can be used in analysing the business or business processes.

In a data warehouse, facts are implemented in the core tables in which all of the numeric data is stored.

Dimension

A dimension is a collection of members or units of the same type of views.

In a dimensional model, every data point in the fact table is associated with one and only one member from each of the multiple dimensions. Dimensions determine the contextual background for the facts.

Dimensions are the parameters over which we want to perform Online Analytical Processing (OLAP). For example, in a database for analysing all sales of products, common dimensions could be:
 Time
 Location/region
 Customers
 Salesperson
 Scenarios such as actual, budgeted, or estimated numbers

Dimensions can usually be mapped to nonnumeric, informative entities such as branch or employee.

Dimension Members: A dimension contains many dimension members. A dimension member is a distinct name or identifier used to determine a data items position. For example, all months, quarters, and years make up a time dimension, and all cities, regions, and countries make up a geography dimension.

Dimension Hierarchies: We can arrange the members of a dimension into one or more hierarchies. Each hierarchy can also have multiple hierarchy levels. Every member of a dimension does not locate on one hierarchy structure.

Measure

A measure is a numeric attribute of a fact, representing the performance or behaviour of the business relative to the dimensions. For example, measures are the sales in money, the sales volume, the quantity supplied, the supply cost, the transaction amount, and so forth. A measure is determined by combinations of the members of the dimensions and is located on facts.

COMPONENTS OF DATA WAREHOUSING

Data warehousing components are identified in the figure above. A detailed discussion
of each of the components will be followed as we move along.

Data Mart -- A data mart is a focused subset of a data warehouse that deals with a single area of data and is organised for quick analysis.

Metadata -- Literally, "data about data." It is the descriptions of what kind of information is stored where, how it is encoded, how it is related to other information, where it comes from, and how it is related to your business.

Query -- A specific atomic request for information from a database.

Data Mining -- Data mining is the running of automated routines that search through data organised in a warehouse. They look for patterns in order to point you to areas that you should be addressing.

OLAP (On-Line Analytical Processing) -- Tools that extract data from data warehouses go by a variety of names: OLAP, ROLAP (Relational On-Line Analytical Processing), multi-dimensional analysis tools, and decision support systems being the most common ones. All provide the ability to do rapid analysis of multiple simultaneous factors, something that relational databases can't do.

Data Visualization --Techniques for turning data into information by using the high capacity of the human brain to visually recognise patterns and trends. There are many specialised techniques designed to make particular kinds of visualization easy.

Friday, June 27, 2008

Data Mining

What is Data Mining?
Too much data and not enough information — this is a problem facing many
businesses and industries. Most businesses have an enormous amount of data, with a
great deal of information hiding within it, but "hiding" is usually exactly what it is
doing: So much data exists that it overwhelms traditional methods of data analysis.
Data mining provides a way to get at the information buried in the data. Data mining
creates models to find hidden patterns in large, complex collections of data, patterns
that sometimes elude traditional statistical approaches to analysis because of the large
number of attributes, the complexity of patterns, or the difficulty in performing the
analysis.

What Is Data Mining in the Database?
Data mining projects usually require a significant amount of data collection and data
processing before and after model building. Data tables are created by combining
many different types and sources of information. Real-world data is often dirty, that is,
includes wrong or missing values; data must often be cleaned before it can be used.
Data is filtered, normalized, sampled, transformed in various ways, and eventually
used as input to data mining algorithms. Up to 80% of the effort in a data mining
project is often devoted to data preparation. When the data is stored as a table in a
database, data preparation can be performed using database facilities.
Data mining models have to be built, tested, validated, managed, and deployed in
their appropriate application domain environments. The data mining results may need
to be post-processed as part of domain specific computations (for example, calculating
estimated risks, expected utilities, and response probabilities) and then stored into
permanent databases or data warehouses.
Making the entire data mining process work in a reproducible and reliable way is
challenging; it may involve automation and transfers across servers, data repositories,
applications, and tools. For example, some data mining tools require that data be
exported from the corporate database and converted to the data mining tool’s format;
data mining results must be imported into the database. Removing or reducing these
obstacles can enable data mining to be utilized more frequently to extract more
valuable information and, in many cases, to make a significant impact on the
bottom-line of an enterprise. Data mining in the database makes the data movement
required by tools that do not operate in the database unnecessary and make it much
easier to mine up-to-date data. Also, the less data movement, the less time the entire
data mining process takes.
Data movement can make data insecure. If data never leaves the database, database
security protects the data.
In summary, data mining in the database provides the following benefits:
■ Less data movement
■ More data security
■ Up-to-date data

How BI works

Saturday, June 21, 2008

What is OLAP, MOLAP, ROLAP, DOLAP, HOLAP? Examples?

OLAP - On-Line Analytical Processing.
Designates a category of applications and technologies that allow the collection, storage, manipulation and reproduction of multidimensional data, with the goal of analysis.

MOLAP - Multidimensional OLAP.
This term designates a cartesian data structure more specifically. In effect, MOLAP contrasts with ROLAP. Inb the former, joins between tables are already suitable, which enhances performances. In the latter, joins are computed during the request.
Targeted at groups of users because it's a shared environment. Data is stored in an exclusive server-based format. It performs more complex analysis of data.

DOLAP - Desktop OLAP.
Small OLAP products for local multidimensional analysis Desktop OLAP. There can be a mini multidimensional database (using Personal Express), or extraction of a datacube (using Business Objects).
Designed for low-end, single, departmental user. Data is stored in cubes on the desktop. It's like having your own spreadsheet. Since the data is local, end users don't have to worry about performance hits against the server.

ROLAP - Relational OLAP.
Designates one or several star schemas stored in relational databases. This technology permits multidimensional analysis with data stored in relational databases.
Used for large departments or groups because it supports large amounts of data and users.

HOLAP:Hybridization of OLAP, which can include any of the above.

BI Tools

Types of business intelligence tools

The key general categories of business intelligence tools are:
Spreadsheets[1]
Reporting and querying software - are tools that extract, sort, summarize, and present selected data
OLAP
Digital Dashboards
Data mining
Process mining
Business performance management

Except for spreadsheets, these tools are sold as standalone tools, suites of tools, components of ERP systems, or as components of software targeted to a specific industry. The tools are sometimes packaged into data warehouse appliances.

Open Source and Free Business Intelligence Products
Freereporting.com: Free Web-based BI software application by LogiXML
Eclipse BIRT Project: Eclipse-based open source reporting for web applications, especially those based on Java EE.
OpenI: simple web application that does OLAP reporting
Palo (OLAP database): Memory-based OLAP Server (MOLAP) with interface to Microsoft Excel, .NET, PHP, Java and C++
Pentaho: enterprise-class reporting, analysis, dashboard, data mining and workflow capabilities
RapidMiner (formerly YALE): open-source software for intelligent data analysis, knowledge discovery, data mining, predictive analytics, and machine learning useful for business intelligence applications.
SpagoBI: a Business Intelligence Free Platform which uses many FOSS tools as analytical engines, integrating them in an infrastructure which offers a cross-operativeness and a consistent vision between Report,OLAP,Data Mining,Dashboard and over the DWH.

History of BI

Prior to the start of the Information Age in the late 20th century, businesses had to collect data from non-automated sources. Businesses then lacked the computing resources necessary to properly analyze the data, and as a result, companies often made business decisions primarily on the basis of intuition.

As businesses automated systems the amount of data increased but its collection remained difficult due to the inability of information to be moved between or within systems. Analysis of information informed for long-term decision making, but was slow and often required the use of instinct or expertise to make short-term decisions. Business intelligence was defined in an October 1958 IBM Journal article by Hans Peter Luhn.[1] Luhn wrote,

In this paper, business is a collection of activities carried on for whatever purpose, be it science, technology, commerce, industry, law, government, defense, et cetera. The communication facility serving the conduct of a business (in the broad sense) may be referred to as an intelligence system. The notion of intelligence is also defined here, in a more general sense, as "the ability to apprehend the interrelationships of presented facts in such a way as to guide action towards a desired goal."

In modern businesses the use of standards, automation and specialized software allows large volumes of information to be warehoused, extracted, transformed and loaded to greatly increase the speed at which data becomes available, including the use of online tools.

In 1989 Howard Dresner, later a Gartner Group analyst, popularized BI as an umbrella term to describe a set of concepts and methods to improve business decision-making by using fact-based decision support systems

What is BI?

The term business intelligence (BI) refers to technologies, applications and practices for the collection, integration, analysis, and presentation of business information and also sometimes to the information itself. The purpose of business intelligence is to support better business decision making. It dates to 1958.[1] D. J. Power explains in "A Brief History of Decision Support Systems,"[2]

BI describes a set of concepts and methods to improve business decision making by using fact-based support systems. BI is sometimes used interchangeably with briefing books, report and query tools and executive information systems. Business Intelligence systems are data-driven DSS.

BI systems provide historical, current, and predictive views of business operations, most often using data that has been gathered into a data warehouse or a data mart and occasionally working from operational data. Software elements support the use of this information by assisting in the extraction, analysis, and reporting of information. Applications tackle sales, production, financial, and many other sources of business data for purposes that include, notably, business performance management. Information may be gathered on comparable companies to produce benchmarks.

BUSINESS INTELLIGENCE

Google Search