Presentation laddar. Vänta.

Presentation laddar. Vänta.

Lecture 1 - Introduction to DW

Liknande presentationer


En presentation över ämnet: "Lecture 1 - Introduction to DW"— Presentationens avskrift:

1 Lecture 1 - Introduction to DW
Reading Requirements [EN] chapter 26 [CB] chapter 25 [AS] paper 1 ”An overview of Data Warehousing and OLAP Technology” by Chaudhuri & Bayal, Keywords DW, DSS, OLTP, OLAP, MDM, ROLAP, MOLAP, Data Mart

2 The Data Warehouse - definition
B. Imnon: ”A data warehouse is a subject oriented, integrated, non-volatile, and time-variant collection of data in support of manadement’s decisions”. En data lager är en verksamhetsorienterat, integrerat, icke-ombytlig och tids-beroende samling av data ämnat att stödja beslutsfattande på strategisk nivå. S. Chaudhiri & U. Dayal: Verksamhetsorienterat eftersom en datalager är organiserat runt de objekt som finns i verksamheten (så som kund, anställd, leverantör), snarare än kring de applikationsområde som fins ( så som förjälning, lönehantering och inköp), kring vilka de system som används i det operativa verksamnten är byggda. Detta beror just på syftet med en datalager vilket är att stödja beslutsfattande för vilket verksamhetsorienterat - och inte applikations-orienterat data behövs. Integrerat p.g.a att den använder data ur olika skällor (olika applications-orientrade system) Dessa skällor innehåller ofta inconsistent data t.ex. genom att de använder sig utav olika format för att presentera en och samma typ av data. Detta gör att data från de olika skällorna behöver integreras och göras konsistent för att ens kunna arbeta med den och presentera den för användarna. Icke-ombytlig eftersom datalagern uppdateras inte on-line, utan den istället regelbunden uppdateras genom att lägga till data från de operationella systemen. Det är också så att befintlig data ersätts inte utav ny data, utan ny data läggs bara häla tiden på till den befintliga datan. Datalagern integrerar den nya datan till den befinltliga datan. Tidsberoende pga datat i datalagret är korrekt och giltig endast under en viss tidpunkt eller en viss tidsintervall. Det är också så att tiden som man håller data är betydligt längre och man accosierar all data med något slags tidsangivelse (direkt eller indirekt) Slutligen kan man säga att datalagret representerar helt enkelkt ett antal ögonbliksbilder av verksamheten. ”Data warehousing is a collection of decision support technologies, aimed at enabling the knowledge worker (executive, manager, analyst) to make better and faster decisions.”

3 Subject-oriented Operational Systems Informational Systems Sales
Customer Data Employee Data Payroll System Purchasing System Vendor Data

4 Integrated Operational Systems Informational Systems Marketing System
Order System Customer Data Billing System

5 Time variant Operational Systems Informational Systems 60-90 days
Customer Data Order System 60-90 days 5-10 years

6 Non-volatile Operational Systems Create Informational Systems Update
Delete Order System Load Access Customer Data Insert

7 Decision Support and OLAP (by Navathe)
Information technology to help the knowledge worker (executive, manager, analyst) make faster and better decisions. Will a 10% discount increase sales volume sufficiently? Which of two new medications will result in the best best outcome: higher recovery rate & shorter hospitality rate? How did the share price of computer manufacturers correlate with quarterly profits over the past 10 years? On-Line Analytical Processing (OLAP) is an element of decision support system (DSS).

8 Data Warehouse (Navathe)
A decision support database that is maintained separately from the organisation’s operational databases. A data warehouse is a subject oriented, integrated, time-varying, non-volatile collection of data that is used primarily in the organisational decision making.

9 OLTP vs. OLAP holds current data stores detailed data data is dynamic
repetitive processing high level of transaction throughput predictable pattern of usage transaction driven application oriented support day-to-day decisions serves large number of operational users holds historic data stores detailed and summarised data data is largely static ad-hoc, unstructured and heuristic processing medium or low-level of transaction throughput unpredictable pattern of usage analysis driven subject oriented supports strategic decisions serves relatively lower level of managerial users

10 Why separate data warehouse?
Performance The operational DBs are tuned to support known OLTP workloads Supporting OLAP requires special data organisations, access methods and implementation methods Function The decision support requires data that may be missing from the operational DBs Decision support usually requires consolidating data from many heterogeneous sources

11 Architecture Monitoring & Administration Tools Metadata Data sources
repository Data sources OLAP servers Analysis Data warehouse External sources Extract Transform Load Refresh Query/Reporting Serve Operational DBs Data mining Falö aöldf flaöd aklöd falö alksdf Data marts

12 OLAP for Decision Support (Navathe)
Goal of OLAP is to support ad-hoc querying for the business analyst Business analysts are familiar with spreadsheets Extend spreadsheet analysis model to work with warehouse data Large data set Semantically enriched to understand business terms (e.g., time, geography) combined with reporting features Multidimensional view of data is the foundation for OLAP

13 Data Modelling for Data Warehouses
See the examples in [EN] chapter 26

14 Data Modelling for Data Warehouses?
A data cube: product p125 fiscal quarter p124 qtr3 qtr2 qtr1 p123 reg1 reg2 region reg3

15 Data Modelling for Data Warehouses?
Pivoted version of the data cube: region region product fiscal quarter fiscal quarter product

16 Data Modelling for Data Warehouses
See the examples in [EN] chapter 26

17 Star-Join Schema A single fact table and a single table for each dimension Every fact points to one tuple in each of the dimensions and has additional attributes Does not capture hierarchies directly Generated keys are used for performance and maintenance reasons Fact constellation: Multiple Fact tables that share many dimension tables

18 Snowflake Schema Represent dimensional hierarchy directly by normalising the dimension tables Save storage Reduces the effectiveness of browsing

19 Approaches to OLAP Servers
Relational OLAP (ROLAP) Relational and Extended Relational DBHS to store and manage warehouse data schema design extended SQL Multidimensional OLAP (MOLAP) Array-based storage structure (n-dimensional array) Direct access to array data structure Good indexing properties Poor storage utilisation when the data is sparse.

20 Mullet-dimensional OLAP (MOLAP)
Relational DB server and/or legacy systems End-user access tools MOLAP server data request load result set Database & application logic layer Presentation layer

21 Relational OLAP (ROLAP)
db server ROLAP server End-user access tools SQL data request result set result set Database layer Application logic layer Presentation layer

22 Managed Query Environment (MQE)
Relational DB server End-user access tools SQL result set MOLAP server data request load result set

23 DB2’s Integration Server Architecture
Desktop OLAP Model OLAP Metaoutline Integration Server desktop TCP/IP DB2 OLAP server TCP/IP Server ODBC Relational data source ODBC TCP/IP OLAP Metadata Catalog OLAP Command Interface DV2 OLAP database

24 Architecture Monitoring & Administration Tools Metadata Data sources
repository Data sources OLAP servers Analysis Data warehouse External sources Extract Transform Load Refresh Query/Reporting Serve Operational DBs Data mining Falö aöldf flaöd aklöd falö alksdf Data marts

25 Back End Tools and Utilities
Extract & Transform data selection data cleaning Data migration: “replace the string gender by sex” Data scrubbing: based on domain specific knowledge Data auditing: a variant of data mining data enrichment data aggregation

26 Back End Tools and Utilities
Load full loading: a long batch transaction, takes a long time incremental loading: during refresh Refresh when: periodically e.g., daily or weekly how: extracting the entire source: sometimes the only way when dealing with legacy data sources incremental refresh: supported by replication servers data shipping transaction shipping

27 Front End Tools - Basic Functionality
Pivoting Rollup (drill-up) and Drill-down Slice-and-dice Ranking (sorting) Selection Computed attributes

28 Metadata Repository warehouse schema view & derived data definitions
predefined queries and reports data marts locations and contents data partitions data extraction, cleaning, transformations rules, defaults data refresh and purging rules user profiles, user groups security: user authorisation, access control

29 Problems of Data Warehousing
Underestimation of resources for data loading Hidden problems with source systems Required data not captured Increased end-user demands Data homogenisation High demand of resources Data ownership High maintenance Long duration projects Complexity of integration

30 Data Warehouse vs. Data Mart (Navathe)
Enterprise warehouse: collects all information about subject (customer, products, sales, assets, personnel) that span the entire organisation Requires extensive business modelling May take years to design and build Data Mart: Departmental subsets that focus on selected subjects: Marketing data mart: customer, product, sales Faster roll-out Complex integration in the long term


Ladda ner ppt "Lecture 1 - Introduction to DW"

Liknande presentationer


Google-annonser