Lecture 1 - Introduction to DW

Lecture 1 - Introduction to DW
Reading Recommendations ”An overview of Data Warehousing and OLAP Technology” by Chaudhuri & Bayal, Keywords DW, DSS, OLTP, OLAP, MDM, Data Mart, Data Mining

”We are drowing in information, but starving for knowledge”
- John Naisbett

The Data Warehouse - definition
B. Inmon: ”A data warehouse is a subject oriented, integrated, non-volatile, and time-variant collection of data in support of manadement’s decisions”. Verksamhetsorienterat eftersom en datalager är organiserat runt de objekt som finns i verksamheten (så som kund, anställd, leverantör), snarare än kring de applikationsområde som fins ( så som förjälning, lönehantering och inköp), kring vilka de system som används i det operativa verksamnten är byggda. Detta beror just på syftet med en datalager vilket är att stödja beslutsfattande för vilket verksamhetsorienterat - och inte applikations-orienterat data behövs. Integrerat p.g.a att den använder data ur olika skällor (olika applications-orientrade system) Dessa skällor innehåller ofta inkonsistent data t.ex. genom att de använder sig utav olika format för att presentera en och samma typ av data. Detta gör att data från de olika skällorna behöver integreras och göras konsistent för att ens kunna arbeta med den och presentera den för användarna. Icke-ombytlig eftersom datalagern uppdateras inte on-line, utan den istället regelbunden uppdateras genom att lägga till data från de operationella systemen. Det är också så att befintlig data ersätts inte utav ny data, utan ny data läggs bara hela tiden på till den befintliga datan. Datalagern integrerar den nya datan till den befintliga datan. Tidsberoende pga datat i datalagret är korrekt och giltig endast under en viss tidpunkt eller en viss tidsintervall. Det är också så att tiden som man håller data är betydligt längre och man associerar all data med något slags tidsangivelse (direkt eller indirekt) Slutligen kan man säga att datalagret representerar helt enkelt ett antal ögonblicksbilder av verksamheten.

Subject-oriented Operational Systems Data Warehouse Sales Customer
Employee Data Payroll System Subject oriented means that the warehouse is organised around the major subjects of the enterprise such as customers, products and sales. And not around the major application areas, like invoicing, stock control, and product sales. This is reflected in the need to stored decision-support data rather than application-oriented data. Purchasing System Vendor Data

Integrated Operational Systems Data Warehouse Marketing System Order
Customer Data A data warehouse is integrated since different source systems may be used for building a DW. Those source systems needs to be integrated in order to present a unified view of the data to the users. Billing System

Time variant Operational Systems Data Warehouse 60-90 days 5-10 years
Customer Data Order System Time variant: since the data in a warehouse is only accurate and valid at some point in time or over some time interval. The time-variance of the data warehouse is also shown in the extended time that the data is helm, the implicit or explicit associate of time with all data, and the fact that the data represents a series of snapshots. 60-90 days 5-10 years

Non-volatile Operational Systems Data Warehouse Create Update Delete
Order System Non-volatile as the data is not updated in real-time but is refreshed from operational systems on a regular basis. New data is always added as supplement to the database rather as a replacement. The database continually absorbs this new data, incrementally integrating it with the previous data. Load Access Customer Data Insert

The Data Warehouse - definition
B. Inmon: ”A data warehouse is a subject oriented, integrated, non-volatile, and time-variant collection of data in support of manadement’s decisions”. S. Chaudhuri & U. Dayal: Verksamhetsorienterat eftersom en datalager är organiserat runt de objekt som finns i verksamheten (så som kund, anställd, leverantör), snarare än kring de applikationsområde som fins ( så som förjälning, lönehantering och inköp), kring vilka de system som används i det operativa verksamnten är byggda. Detta beror just på syftet med en datalager vilket är att stödja beslutsfattande för vilket verksamhetsorienterat - och inte applikations-orienterat data behövs. Integrerat p.g.a att den använder data ur olika skällor (olika applications-orientrade system) Dessa skällor innehåller ofta inkonsistent data t.ex. genom att de använder sig utav olika format för att presentera en och samma typ av data. Detta gör att data från de olika skällorna behöver integreras och göras konsistent för att ens kunna arbeta med den och presentera den för användarna. Icke-ombytlig eftersom datalagern uppdateras inte on-line, utan den istället regelbunden uppdateras genom att lägga till data från de operationella systemen. Det är också så att befintlig data ersätts inte utav ny data, utan ny data läggs bara hela tiden på till den befintliga datan. Datalagern integrerar den nya datan till den befintliga datan. Tidsberoende pga datat i datalagret är korrekt och giltig endast under en viss tidpunkt eller en viss tidsintervall. Det är också så att tiden som man håller data är betydligt längre och man associerar all data med något slags tidsangivelse (direkt eller indirekt) Slutligen kan man säga att datalagret representerar helt enkelt ett antal ögonblicksbilder av verksamheten. ”Data warehousing is a collection of decision support technologies, aimed at enabling the knowledge worker (executive, manager, analyst) to make better and faster decisions.”

Decision Support and OLAP (by Navathe)
Information technology to help the knowledge worker (executive, manager, analyst) make faster and better decisions. Will a 10% discount increase sales volume sufficiently? Which of two new medications will result in the best best outcome: higher recovery rate & shorter hospitality rate? How did the share price of computer manufacturers correlate with quarterly profits over the past 10 years? On-Line Analytical Processing (OLAP) is an element of decision support system (DSS).

Data Warehouse (Navathe)
A decision support database that is maintained separately from the organisation’s operational databases. A data warehouse is a subject oriented, integrated, time-varying, non-volatile collection of data that is used primarily in the organisational decision making.

OLTP vs. OLAP holds current data stores detailed data data is dynamic
repetitive processing high level of transaction throughput predictable pattern of usage transaction driven application oriented support day-to-day decisions serves large number of operational users holds historic data stores detailed and summarised data data is largely static ad-hoc, unstructured and heuristic processing medium or low-level of transaction throughput unpredictable pattern of usage analysis driven subject oriented supports strategic decisions serves relatively lower level of managerial users A DBMS built for online transaction processing (OLTP) is generally regarded as unsuitable for data warehousing because each system is designed with differing set of requirements in mind. For example, OLTP systems are designed to maximise the transaction processing capacity, while data warehouses are designed to support ad hoc query processing. An organisation will normally have a number of different OLTP systems for business process such as inventory control, customer invoicing, and point-of-sale. These systems generate operational data that is detailed, current, and subject to change. The OLTP systems are optimised for a high number of transactions that are predictable , repetitive and update intensive. The OLTP data is organised according to the requirements of the transactions associated with the business applications and supports the day-to-day decisions of a large number of concurrent operational users. In contrast, an organisation will normally have a single data warehouse, which holds data that is historic, detailed, and summarised to various levels an rarely subject to change (other than being supplemented with new data). The data warehouse is designed to support relatively lower numbers of transactions that are unpredictable in nature and require answers to queries that are ad hoc, unstructured, and heuristic. The warehouse data is organised according to the requirements of potential queries and supports the long term strategic decisions of a relatively lower number of managerial users.

Why separate data warehouse?
Performance The operational DBs are tuned to support known OLTP workloads Supporting OLAP requires special data organisations, access methods and implementation methods Function The decision support requires data that may be missing from the operational DBs Decision support usually requires consolidating data from many heterogeneous sources Although OLTP systems and data warehouses have different characteristics and are build with different purposes in mind, these systems are closely related in that the OLTP systems provide the source data for the warehouse. A major problem of this relationship is that the data held by the OLTP systems can be inconsistent, fragmented, and subject to change, containing duplicate or missing entries. As such the operational data must be ‘cleaned up’ before it can be used in the data warehouse.

Architecture Monitoring & Administration Tools Metadata Data sources
repository Data sources OLAP servers Analysis Data warehouse External sources Extract Transform Load Refresh Query/Reporting Serve Operational DBs Data mining Falö aöldf flaöd aklöd falö alksdf Data marts

OLAP for Decision Support (Navathe)
Goal of OLAP is to support ad-hoc querying for the business analyst Business analysts are familiar with spreadsheets Extend spreadsheet analysis model to work with warehouse data Large data set Semantically enriched to understand business terms (e.g., time, geography) combined with reporting features Multidimensional view of data is the foundation for OLAP

“Multidimensional” view of the data
- a popular conceptual model that influenced front-end tools, database design, and the query engine for OLAP - numeric measures/facts (e.g. number of, sum, total sales) depends on a set of dimensions quarter office product 2 300 200 130 A data cube: 5 024 Spreadsheets: office 130 product quarter 5 024 2 300 200 Man brukar tala om multidimensionell modellering när man ska strukturera data i data warehouse. Själva idén, tanke, modellen har påverkat de som designat datawarehouse och verktyg till dem. Vanlígt excelark med två dimensioner. Sifrorna anger summan av försäljningen för varje kvartal för varje produkt. Om man nu vill ha en till dimension, dvs hur mycket vissa kontor har sålt av en viss produkt för en viss period. Det här kan också åskådligöras i en kub. Vi ser de tre dimensionerna. Varje liten minikub i kuben och få fram Visar denna bild för att visa hur själva tankesättet har uppkommit. När man modellerar detta på papper gör man på detta sätt.

promotion campaign quarter office product office quarter office product quarter office product Man brukar tala om multidimensionell modellering när man ska strukturera data i data warehouse. Själva idén, tanke, modellen har påverkat de som designat datawarehouse och verktyg till dem. Vanlígt excelark med två dimensioner. Sifrorna anger summan av försäljningen för varje kvartal för varje produkt. Om man nu vill ha en till dimension, dvs hur mycket vissa kontor har sålt av en viss produkt för en viss period. Det här kan också åskådligöras i en kub. Vi ser de tre dimensionerna. Varje liten minikub i kuben och få fram Visar denna bild för att visa hur själva tankesättet har uppkommit. När man modellerar detta på papper gör man på detta sätt. quarter product customer group

Promotion campaign Quarter Measures/facts Customer group Promotion campaign Office

Dimensional modelling - Star schema
Service used Time - date - month - quarter - year - service name - service group Telephone calls - sum ($) - number of calls Sales Dimension Customer - customer name - address - region - income group - seller name - office

Dimensional modelling - Star-join schemas
Service Dimension Time Dimension Sales Dimension Customer Dimension Fact table - Transactions Sum Number of calls C210 S1 F11 991011 25:00 3 S3 05:00 1 C212 S2 F13 89:00 C213 12:00 C214 S4 991012 08:00 När man modellerara fokuserar man på de viktiga affärshändelserna -transactions, i verkamheten. Det kan till exempel vara säljhändelser. Att man säljer varor eller tjänster. Jag kommer återkomma till andra typer av händelser - men centarlt när man modellerar på det här sättet är att identifiera händelserna. Man samla dessa händelser och fakta, dvs värdet, om händelserna i en entitet som vi kallar försäljningsfakta. Försäljningsfaktaentiteten innehåller alltså det här värdet vi såg förut i kuben. Sedan kan man studera dessa händelser ur olika aspekter, dimensioner. Dimensionerna motsvarar dimensionerna vi såg förut. När man modellerar gäller det alltså att välja dimensionerna, till exempel vilka tjänster eller tjänstegrupper som säljs, vilka kunder eller kundkategorier som köper tjänster. Vid vilka tidpunkter eller tidsperioder som tjänsterna säljs. Vilka försäljare eller försäljningskontor som sålt tjänsterna. Sedan väljer på attributen för dessa dimensioner - för tjänst kanske vi lägger på tjänstenamn och tjänstegrupp. För kund väljer vi kundnamn, postadress, region, inkomstgrupp. Fakta i mitten, och runt om får vi dimensionser Gör om till databasschema med tabeller. Entiteterna blir tabeller. Vi får faktatabelloch dimensionstabeller. Då hamnar attributen som kolumner. Bofolkar tabellerna med instanser, rader, tupler. Mellan dimensionstabellerna och faktatabellen får vi ett till många förhållande. Dimensionerna innehåller få rader. Ej normaliserat. Dubbellagra information Faktatabellen innehåller massor med rader. Den innehåller den överväldigasnde delen av information. Den vill manska ha så få och små kolumner som mjligt. I faktatabellen finns främmande nycklar till de olika dimensionerna. Plockar bort rader i tabellerna. Går in och joinar mellan dimensioner och faktatabell för att ta bort rader i faktatabell. Tar bort alla rader som inte är S1 Enkelt att förstå en sådan här struktur för verksamhetschefer än en relationsdatabasstruktur. Normalisera av två orsaker– för att ha en så efffektiv lagring, spara utrymmer, som möjligt +redundans/inkonsistens. Men insonstistesen hanteras av transformationlagret. Och spara utrymme är sekundärt. OLAP-verktyget kan använda den här datastrukturen effektivt. Normaliserar vissa attribut – vinne kortare rader, men innebär att man måste joina.

Dimensional modelling - Star-join schemas
Service Dimension Time Dimension Sales Dimension Customer Dimension Fact table - Transactions Sum Number of calls C210 S1 F11 991011 25:00 3 S3 05:00 1 C212 S2 F13 89:00 C213 12:00 C214 S4 991012 08:00 Query: For how much did customers in Sthlm use service “Local call” in october 1999? S=37:00

Snow-flake schema Year Service used Month Time Quarter Telephone calls
- service name Time - date Telephone calls Quarter Service group - sum ($) - number of calls Region Customer Sales Dimension - customer name - address - seller name Income group Office

Architecture Monitoring & Administration Tools Metadata Data sources
repository Data sources OLAP servers Analysis Data warehouse External sources Extract Transform Load Refresh Query/Reporting Serve Operational DBs Data mining Falö aöldf flaöd aklöd falö alksdf Data marts

Back End Tools and Utilities
Extract & Transform data selection data cleaning Data migration: “replace the string gender by sex” Data scrubbing: based on domain specific knowledge Data auditing: a variant of data mining data enrichment data aggregation

Back End Tools and Utilities
Load full loading: a long batch transaction, takes a long time incremental loading: during refresh Refresh when: periodically e.g., daily or weekly how: extracting the entire source: sometimes the only way when dealing with legacy data sources incremental refresh: supported by replication servers data shipping transaction shipping

Approaches to OLAP Servers
Relational OLAP (ROLAP) Relational and Extended Relational DBHS to store and manage warehouse data schema design extended SQL Multidimensional OLAP (MOLAP) Array-based storage structure (n-dimensional array) Direct access to array data structure Good indexing properties Poor storage utilisation when the data is sparse.

Front End Tools - Basic Functionality
Pivoting Rollup (drill-up) and Drill-down Slice-and-dice Ranking (sorting) Selection Computed attributes

Metadata Data about data Administrative metadata Business metadata
(includes all information necessary for setting up and using a DW, e.g. Information about source databases, dw schemas, dimensions, hierachies, predefined queries, physical organisation, rules and script for extraction, transformation and load, back-end and front end tools) Business metadata (business terms and definitions, ownership of data) Operational metadata (information collected during the operations of the DW, e. g. usage statistics, error reports) Vad är metadata? Något som används av databashanteringssystemet. Något som används av verktyg Använväds av de som designat data warehouset Sätta upp dw Använda dw Verksamhetsdata – vad som menas med kund, vem som äger data Operattionell data – info som samlas in under tiden man driver dw. Kund använder vissa frågor oftare – då kan det vara värt att aggregera denna data.

Metadata Repository warehouse schema view & derived data definitions
predefined queries and reports data marts locations and contents data partitions data extraction, cleaning, transformations rules, defaults data refresh and purging rules user profiles, user groups security: user authorisation, access control

Problems of Data Warehousing
Underestimation of resources for data loading Hidden problems with source systems Required data not captured Increased end-user demands Data homogenisation High demand of resources Data ownership High maintenance Long duration projects Complexity of integration Underestimating of resources for data loading Many developers underestimate the time required to extract, clean and load the data into the warehouse. This process may account for up to 80% of the total development time (Imnon 90), although better data cleansing and management tools may reduce this figure. Hidden problems with source systems Hidden problems associated with the source systems feeding the data warehouse will be identified, possibly after years of being undetected. The developer must decide whether to fix the problem in the data warehouse and/or fix the source systems. For example when entering the details of a new property, certain fields may allow mulls, which may result in staff entering incomplete property data, even when available and applicable. Required data not captured Warehouse projects often highlight a requirement for data not being captured by the existing source systems. The organisation must decide whether to modify the OLTP systems or create a system dedicated to capturing the missing information.

Underestimation of resources for data loading Hidden problems with source systems Required data not captured Increased end-user demands Data homogenisation High demand of resources Data ownership High maintenance Long duration projects Complexity of integration Increased end-user demands After end-users receive query and reporting tools, requests for support from IS staff may increase rather then decrease. This is caused by an increasing awareness of the users on the capabilities and value of the data warehouse. This problem can be partially alleviated by investing in easier to use, more powerful tools, or in providing better training for the users. A further reason for increasing demands on IS staff is that once a data warehouse is online, it is often the case that the number of users and queries increases together with requests for answers to more and more complex queries. Data homogenisation Large-scale data warehousing can become an exercise in data homogenisation that lessens (nedvärdera) the value of the data. For example, in producing a consolidated and integrated view of the organisation’s data, the various designers may be tempted to emphasise similarities rather than differences in the data used by different application areas such as property sales and property renting. High demand for resources The data warehouse can use up large amounts of disk space. Many relational databases used for decision-support are designed around, snowflake and starflake. These approaches result in the creation of very large fact tables. If there are many dimensions to the factual data, the combination of aggregate tables and indexes to the fact tables can use up more space than the raw data.

Underestimation of resources for data loading Hidden problems with source systems Required data not captured Increased end-user demands Data homogenisation High demand of resources Data ownership High maintenance Long duration projects Complexity of integration Data Ownership DWing may change the attitude of end-users to the ownership of data. Sensitive data that was originally viewed and used only by a particular department or business area, such as sales or marketing, may now be made accessible to others in the organisation. High maintenance DWs are high maintenance systems. Any re-organisation of the business processes and the source systems may affect the data warehouse. To remain a valuable resource, the DW must remain consistent with the organisation that it supports. Long duration projects A DW represents a single information resource for the organisation. However, the building of a warehouse can take up to three years, which is why some organisations are building their own data marts. Data marts support only the requirements of a particular department or functional area and can therefore be built more rapidly Complexity of integration The most important area of management of a DW is the integration capabilities. This means an organisation must spend a significant amount of time determining how well the different DW tools can be integrated into the overall solution that is needed. This can be a very difficult task, as there are a number of tools for every operation of the DW, which must integrate well in order that the warehouse works to the organisation’s benefit.

Data Warehouse vs. Data Mart (Navathe)
Enterprise warehouse: collects all information about subject (customer, products, sales, assets, personnel) that span the entire organisation Requires extensive business modelling May take years to design and build Data Mart: Departmental subsets that focus on selected subjects: Marketing data mart: customer, product, sales Faster roll-out Complex integration in the long term

The Data Warehouse Bus Orders Production Dimensions Time Sales Rep Customer Promotion Product Plant Distr. Center Allows the parallell dvlpmt of business process data marts with ability to integrate Allows the parallell dvlpmt of business process data marts with ability to integrate

The Business Dimensional Lifecycle
Requirement Definition Technical Architecture Design Product Selection & Installation Project Planning Dimensional Modeling Physical Design Data Staging Design & Development Deployment Maintenance and Growth End-User Application Specification End-User Application Development Project Management

What is data mining? Data Mining is data analysis in order to discover hidden correlations (pattern, rules) in huge data sets “Data Mining is the process of extracting previously unknown, valid and actionable information from large databases and then using the information to make crucial business decisions” Cabena, Hadjinian, Stadler, Verhees, Zanasi

Enabling factors for data mining
Data availability Increased amount of electronically stored data Increased processing power Increased data storage ability Increased data gathering ability (networks, extraction tools) Increased number of data warehouses Business conditions Increased need to compete effectively Increased awareness of need to know customers

Data mining uses in enterprises
Predict customer pattern of behaviour, e.g buying pattern Discover market developments driven by demographic changes Discover shifts in consumption Identification of new customers Anticipation of demands on inventory

Data mining process Data mining Data mining This step: 70%-80%
Report results Business Problem X Y, Z Analysis of results Extraction and trans-formation of data Data mining expert needed Data mining expert needed Mining data using specific function This step: 70%-80% of the total time Data mining expert needed

Primary operations in data mining
A number of basic operations/functions/techniques can be used for prediction and depiction Link Analysis: Associations discovery Link Analysis: Sequential pattern discovery Database Segmentation: Clustering Predictive modelling: Classification Predictive modelling: Value prediction Forensic analysis: Discover anomalous

Link Analysis: Association discovery
Occurrences that are linked to a single event, e.g centered on the transaction For exampel, discovers items that are bought/visited/done together. Often in the form: x% of all record containing items A and B, also contain items D and E “When a customer buys orange juice: then the customer also buy brandy in 60% of cases” Öl och blöjor - exemplet

Link Analysis: Sequential pattern discovery
Discover sequences, that show events linked over time Often in the form: x% of the customers who get B will get C at a later time Often used on a long time series of records in order to discover trends “20% of customers who buy a new carpet, will later buy new curtains”

Database Segmentation: Clustering
Clustering identifies undiscovered grouping A cluster is a group of objects grouped together because of their similarity of proximity, for example similiar behavior Dept X X X X X X X Profitable customers! X X XX X X X X XX Income

Predictive modelling: Classification
Classify data items into one of several predefined classes For example, to predict if a person is going to stay or leave as a customer Customer Customer>2,5 (yrs) STAY (A) Yes No 3 2 1 Service<3 No Yes LEAVE (B) STAY (C) Services STAY (A) LEAVE (B) STAY (C)

Predictive modelling: Value prediction
Value prediction or regression is a common statistical technique for modelling the relationship between two or more variables Linear prediction/regression attempts to fit a straight line through data items. Nonlinear prediction attempts to fit a nonlinear line through data set, see fig. X X XX XXX XX X X XX XX XXX X X X

Mining in e-Commerce systems
Information in a Web Server´s Log that can be used for data mining analysis: - cookie ID (anonymous user) - user ID (registred user), registration information - IP address, MAC nr - date, time - which webpages accessed and in what order - products sold to whom - “Comet_cursor” - Double-click Öl och blöjor - exemplet

Problems in data mining
Limited information Noise and missing values Spurious (false) associations/patterns Expert knowledge needed

”We are drowing in information, but starving for knowledge”
- John Naisbett

Lecture 1 - Introduction to DW

Liknande presentationer

En presentation över ämnet: "Lecture 1 - Introduction to DW"— Presentationens avskrift:

Liknande presentationer

Om projektet

Kontakta oss

Logga in

Logga in via sociala nätverk:

Lecture 1 - Introduction to DW

Liknande presentationer

En presentation över ämnet: "Lecture 1 - Introduction to DW"— Presentationens avskrift:

Liknande presentationer

Om projektet

Kontakta oss