Google Search

Wednesday, April 1, 2009

Pattern Mining

Pattern mining

"Pattern mining" is a data mining technique that involves finding existing patterns in data. In this context patterns often means association rules. The original motivation for searching association rules came from the desire to analyze supermarket transaction data, that is, to examine customer behaviour in terms of the purchased products. For example, an association rule "beer => crisps (80%)" states that four out of five customers that bought beer also bought crisps.

In the context of pattern mining as a tool to identify terrorist activity, the National Research Council provides the following definition: "Pattern-based data mining looks for patterns (including anomalous data patterns) that might be associated with terrorist activity — these patterns might be regarded as small signals in a large ocean of noise."[8][9][10] Pattern Mining includes new areas such a Music Information Retrieval (MIR) where patterns seen both in the temporal and non temporal domains are imported to classical knowledge discovery search techniques.

[edit] Subject-based data mining

"Subject-based data mining" is a data mining technique involving the search for associations between individuals in data. In the context of combatting terrorism, the National Research Council provides the following definition: "Subject-based data mining uses an initiating individual or other datum that is considered, based on other information, to be of high interest, and the goal is to determine what other persons or financial transactions or movements, etc., are related to that initiating datum."[9]

Sunday, September 28, 2008

The Case for Data Warehousing

The following is a list of the basic reasons why organizations implement data warehousing. This list was put together because too much of the data warehousing literature confuses "next order" benefits with these basic reasons. For example, spend a little time reading data warehouse trade material and you will read about using a data warehouse to "convert data into business intelligence", "make management decision making based on facts not intuition", "get closer to the customers", and the seemingly ubiquitously used phrase "gain competitive advantage". In probably 99% of the data warehousing implementations, data warehousing is only one step out of many in the long road toward the ultimate goal of accomplishing these highfalutin objectives.

The basic reasons organizations implement data warehouses are:
To perform server/disk bound tasks associated with querying and reporting on servers/disks not used by transaction processing systems


Most firms want to set up transaction processing systems so there is a high probability that transactions will be completed in what is judged to be an acceptable amount of time. Reports and queries, which can require a much greater range of limited server/disk resources than transaction processing, run on the servers/disks used by transaction processing systems can lower the probability that transactions complete in an acceptable amount of time. Or, running queries and reports, with their variable resource requirements, on the servers/disks used by transaction processing systems can make it quite complex to manage servers/disks so there is a high enough probability that acceptable response time can be achieved. Firms therefore may find that the least expensive and/or most organizationally expeditious way to obtain high probability of acceptable transaction processing response time is to implement a data warehousing architecture that uses separate servers/disks for some querying and reporting.
To use data models and/or server technologies that speed up querying and reporting and that are not appropriate for transaction processing

There are ways of modeling data that usually speed up querying and reporting (e.g., a star schema) and may not be appropriate for transaction processing because the modeling technique will slow down and complicate transaction processing. Also, there are server technologies that that may speed up query and reporting processing but may slow down transaction processing (e.g., bit-mapped indexing) and server technologies that may speed up transaction processing but slow down query and report processing (e.g., technology for transaction recovery.) - Do note that whether and by how much a modeling technique or server technology is a help or hindrance to querying/reporting and transaction processing varies across vendors' products and according to the situation in which the technique or technology is used.
To provide an environment where a relatively small amount of knowledge of the technical aspects of database technology is required to write and maintain queries and reports and/or to provide a means to speed up the writing and maintaining of queries and reports by technical personnel

Often a data warehouse can be set up so that simpler queries and reports can be written by less technically knowledgeable personnel. Nevertheless, less technically knowledgeable personnel often "hit a complexity wall" and need IS help. IS, however, may also be able to more quickly write and maintain queries and reports written against data warehouse data. It should be noted, however, that much of the improved IS productivity probably comes from the lack of bureaucracy usually associated with establishing reports and queries in the data warehouse.
To provide a repository of "cleaned up" transaction processing systems data that can be reported against and that does not necessarily require fixing the transaction processing systems

Please read my essay on what data errors you may find when building a data warehouse for an explanation of the type of "errors" that need cleaning up. The data warehouse provides an opportunity to clean up the data without changing the transaction processing systems. Note, however, that some data warehousing implementations provide a means to capture corrections made to the data warehouse data and feed the corrections back into transaction processing systems. Sometimes it makes more sense to handle corrections this way than to apply changes directly to the transaction processing system.
To make it easier, on a regular basis, to query and report data from multiple transaction processing systems and/or from external data sources and/or from data that must be stored for query/report purposes only

For a long time firms that need reports with data from multiple systems have been writing data extracts and then running sort/merge logic to combine the extracted data and then running reports against the sort/merged data. In many cases this is a perfectly adequate strategy. However, if a company has large amounts of data that need to be sort/merged frequently, if data purged from transaction processing systems needs to be reported upon, and most importantly, if the data need to be "cleaned", data warehousing may be appropriate.
To provide a repository of transaction processing system data that contains data from a longer span of time than can efficiently be held in a transaction processing system and/or to be able to generate reports "as was" as of a previous point in time

Older data are often purged from transaction processing systems so the expected response time can be better controlled. For querying and reporting, this purged data and the current data may be stored in the data warehouse where there presumably is less of a need to control expected response time or the expected response time is at a much higher level. - As for "as was" reporting, some times it is difficult, if not impossible, to generate a report based on some characteristic at a previous point in time. For example, if you want a report of the salaries of employees at grade Level 3 as of the beginning of each month in 1997, you may not be able to do this because you only have a record of current employee grade level. To be able to handle this type of reporting problem, firms may implement data warehouses that handle what is called the "slowly changing dimension" issue.
To prevent persons who only need to query and report transaction processing system data from having any access whatsoever to transaction processing system databases and logic used to maintain those databases

The concern here is security. For example, data warehousing may be interesting to firms that want to allow report and querying only over the Internet.

Some firms implement data warehousing for all the reasons cited. Some firm implement data warehousing for only one of the reasons cited.

By the way, I am not saying that a data warehouse has no "business" objectives. (I grit my teeth when I say that because I am not one to assume that an IT objective is not a business objective. We IT people are businesspeople too.) I do believe that the achievement of a "business" objective for a data warehouse necessarily comes about because of the achievement of one or many of the above objectives.

If you examine the list you may be struck that need for data warehousing is mainly caused by the limitations of transaction processing systems. These limitations of transaction processing systems are not, however, inherent. That is, the limitations will not be in every implementation of a transaction processing system. Also, the limitations of transaction processing systems will vary in how crippling they are.

Finally, to repeat the point I made initially, a firm that expects to get business intelligence, better decision making, closeness to its customers, and competitive advantage simply by plopping down a data warehouse is in for a surprise. Obtaining these next order benefits requires firms to figure out, usually by trial and error, how to change business practices to best use the data warehouse and then to change their business practices. And that can be harder than implementing a data warehouse.

Monday, September 8, 2008

Sunday, August 31, 2008

SCHEMAS IN DATA WAREHOUSING

12.1 STAR SCHEMA

The star schema is perhaps the simplest data warehouse schema. It is called a star schema because the entity-relationship diagram of this schema resembles a star, with points radiating from a central table. The centre of the star consists of a large fact table and the points of the star are the dimension tables.

A star schema is characterised by one or more very large fact tables that contain the primary information in the data warehouse, and a number of much smaller dimension tables (or lookup tables), each of which contains information about the entries for a particular attribute in the fact table.

A star query is a join between a fact table and a number of dimension tables. Each dimension table is joined to the fact table using a primary key to foreign key join, but the dimension tables are not joined to each other.

A typical fact table contains keys and measures. In the example given below, the fact table, sales, contain the measures quantity_sold, amount, and cost, and the keys cust_id, time_id, prod_id, channel_id, and promo_id. The dimension tables are customers, times, products, channels, and promotions. The product dimension table, contains information about each product number that appears in the fact table.

A star join is a primary key to foreign key join of the dimension tables to a fact table.

The main advantages of star schemas are that they:
• Provide a direct and intuitive mapping between the business entities being analysed by end users and the schema design.
• Provide highly optimised performance for typical star queries.
• Are widely supported by a large number of business intelligence tools, which may anticipate or even require that the data-warehouse schema contain dimension tables

Star schemas are used for both simple data marts and very large data warehouses


12.2 SNOWFLAKE SCHEMA



Snowflake schemas normalize dimensions to eliminate redundancy. That is, the dimension data has been grouped into multiple tables instead of one large table.
The snowflake model is the result of decomposing one or more of the dimensions, which sometimes have hierarchies themselves.

For example, a product dimension table in a star schema might be normalized into a products table, a product_category table, and a product_manufacturer table in a snowflake schema. While this saves space, it increases the number of dimension tables and requires more foreign key joins. The result is more complex queries and reduced query performance. The figure below presents a graphical representation of a snowflake schema.

DATA MODELLING FOR OLAP

Dimensional modelling is primarily to support OLAP and decision making .

Four types of operations are used in OLAP to analyse data. Considering granularity, the operations of drill down and roll up are performed. To browse along the dimensions, slice and dice operations are used.

7.4.1 DRILL DOWN AND ROLL UP

Drill down and roll up are the operations for moving the view down and up along the dimensional hierarchy levels. With drill-down capability, users can navigate to higher levels of detail. With roll-up capability, users can zoom out to see a summarise level of data. The
navigation path is determined by the hierarchies within dimensions.

7.4.2 SLICE AND DICE

Slice and dice are the operations for browsing the data through the visualised cube. Slicing cuts through the cube so that users can focus on some specific perspectives. Dicing rotates the cube to another perspective so that users can be more specific with the data analysis. For example suppose we are considering a company’s production of two products Cell phones and pagers. The dimensions would be location, time, product. While analysing the production report of a specific month by plant and product, you get the quarterly view of gross production by plant. You can then change the dimension from product to time, which is dicing. Now, you want to focus on the Cell Phone only, rather than gross production. To do this, you can cut off the cube only for the Cell Phone for the same dimensions, which is slicing.

DATA MODELLING TECHNIQUES


7.2.1 ER MODELLING



An ER model is represented by an ER diagram, which uses three basic graphic symbols to
conceptualise the data: entity, relationship, and attribute.

Entity- An entity is defined to be a person, place, thing, or event of interest to the business or the organisation. An entity represents a class of objects, which are things in the real world that can be observed and classified by their properties and characteristics.

Relationship-A relationship is represented with lines drawn between entities. It depicts the
structural interaction and association among the entities in a model.

The relationship between two entities can be defined in terms of the cardinality. This is the maximum number of instances of one entity that are related to a single instance in another table and vice versa. The possible cardinalities are: one-to-one (1:1), one-to-many (1:M), and many-to-many (M:M).In a detailed (normalized) ER model, any M:M relationship is not shown because it is resolved to an associative entity.

Attribute- Attributes describe the characteristics of properties of the entities. An attribute Name should be unique in an entity and should be self-explanatory.

The figure above is an example of an ER Model

7.2.2 DIMENSIONAL MODELLING

Dimensional modelling is a technique for conceptualising and visualising data models as a set of measures that are described by common aspects of the business. It is especially useful for summarising and rearranging the data and presenting views of the data to support data analysis. Dimensional modelling focuses on numeric data, such as values, counts , weights, balances, and occurrences.

Dimensional modelling has the following basic concepts:

 Facts
 Dimensions
 Measures (variables)

Fact
A fact is a collection of related data items, consisting of measures and context data. Each fact typically represents a business item, a business transaction, or an event that can be used in analysing the business or business processes.

In a data warehouse, facts are implemented in the core tables in which all of the numeric data is stored.

Dimension

A dimension is a collection of members or units of the same type of views.

In a dimensional model, every data point in the fact table is associated with one and only one member from each of the multiple dimensions. Dimensions determine the contextual background for the facts.

Dimensions are the parameters over which we want to perform Online Analytical Processing (OLAP). For example, in a database for analysing all sales of products, common dimensions could be:
 Time
 Location/region
 Customers
 Salesperson
 Scenarios such as actual, budgeted, or estimated numbers

Dimensions can usually be mapped to nonnumeric, informative entities such as branch or employee.

Dimension Members: A dimension contains many dimension members. A dimension member is a distinct name or identifier used to determine a data items position. For example, all months, quarters, and years make up a time dimension, and all cities, regions, and countries make up a geography dimension.

Dimension Hierarchies: We can arrange the members of a dimension into one or more hierarchies. Each hierarchy can also have multiple hierarchy levels. Every member of a dimension does not locate on one hierarchy structure.

Measure

A measure is a numeric attribute of a fact, representing the performance or behaviour of the business relative to the dimensions. For example, measures are the sales in money, the sales volume, the quantity supplied, the supply cost, the transaction amount, and so forth. A measure is determined by combinations of the members of the dimensions and is located on facts.