type
Post
Created date
Nov 3, 2022 11:52 AM
category
Data Science
tags
Data Warehousing
status
Published
Language
English
From
School
summary
slug
password
Author
Jason Ching Yuen Siu
Priority
Featured
Featured
Cover
Origin
Type
URL
Youtube
Youtube
icon

Summary

In principal, a Bridge Table is used under three scenarios:

  • Indirect relationship with Fact table. When it is impossible to have a dimension connected directed to the Fact table, because simply there is no relationship between this dimension and the Fact table (e.g. in the Product Sales case study, it is impossible to have a direct link from SupplierDim to ProductSalesFact)
  • Many-to-Many relationship. When an entity (which will become a dimension) has a many-many relationship with another entity (dimension) in the E/R schema of the operational database (e.g. Supplier and Stock has a many-many relationship).
  • Temporal dim. When temporality aspect (data history) is maintained in the operational database and the bridge table can be used to accommodate the dimension that has temporal attributes (e.g. product supply history is maintained in the second snowflake schema example)

And when it is used, we will implement:

1) Weight Factor and 2) List aggregate

Why a dimension cannot be connected directly to the FACT?

a) The Fact table has a fact measure, and the dimension has a key identity. In order to connect a dimension to the Fact, the dimension’s key identity must contribute DIRECTLY to the calculation of the fact measure.
  • Unfortunately, this cannot happen if the operational database does not have this data, so we need to use Bridge Table.
b) The op. database does not have this data if the relationship between two entities in the operational database that hold the information about dimension’s key identity and the intended fact measure is a MANY-to-MANY relationship.

What is weight factor?

  • To estimate the contribution of a dimension in the calculation of the fact measure
  • Because this is only an estimate, a weight factor is optional.
create table TripDim2 as select T.TripID, T.TripDate, T.TotalKm, 1.0/count(*) as WeightFactor from Trip T, Destination D where T.TripID = D.TripID group by T.tripid, T.tripdate, T.totalkm;

What is list aggregate?

Concatenate the list
create table TripDim3 as select T.TripID, T.TripDate, T.TotalKm, 1.0/count(D.StoreID) as WeightFactor, listagg (D.StoreID, ’_’) within group (order by D.StoreID) as StoreGroupList from Trip T, Destination D where T.TripID = D.TripID group by T.TripID, T.TripDate, T.TotalKm;

Case study 1: A Product Sales Case Study

Context:

notion image

Solution

notion image
notion image
notion image
notion image
notion image
notion image
notion image
notion image

Sales

 
notion image
notion image
notion image

Why isn't this report correct?

  • This report seems correct but it is impossible to be produced.
  • Because in the E/R diagram, it shows that one supplier may supply many products (Many2Many) -- so the sales do not record which supplier supplied a product.
  • Therefore, we can conclud that 1) there is no direct relationship between suppliers and total sales. 2) supplier information is not available in the sales of a particular product.

If the report is not correct, what should we do?

  • We move the Supplier Dimension from being connected to the Fact and to create a relationship with the Product Dimension — that means we added a Bridge Table between Product Dimension and Supplier Dimension.
  • The bridge table has three information in which we can “drill down” for information on products: location, date and quantity of supplies.

Why are there three kinds of representations? What is the difference?

notion image
notion image
notion image
notion image
Figure 5.7 for further details. 5.7 is where you — as the manager —want to see the detail of the dim that cannot be directly connected (i.e., Supplier). For example, we know that product A is performing particularly well, and we want to further information about details of the supply history and the suppliers.
Figure 5.8 for temporal dim. 5.8 is where the operational database does not maintain a master list of all suppliers (assume that each supplier may supply any product in an ad hoc manner), we will not have the Supplier Dimension in the schema in the figure 5.8. Instead, we will only have the Bridge Table; however, for each Product-Supplier pair, we will have Location and Date implemented as a weak entity, because for each productsupplier, there are multiple supplies.
Therefore, the Bridge Table here acts as a temporal dimension. This means, for each product, there is a list of the history of supplies.
Figure 5.9 for non-temporal warehouse. 5.9 is where the history is not needed (in a non-temporal data warehouse). You can see that the Product Supplier Bridge Table will only have ProductNo and SupplierName attributes, without the history of supplies.

Case study 2: A Truck Delivery Case Study

Context

notion image

Solution

notion image
notion image
notion image
notion image
 

A Truck Delivery Case Study

A trucking company is responsible for picking up goods from the warehouses of a retail chain company and delivering the goods to individual retail stores. A truck carries goods on a single trip, which is identified by TripID and delivers these goods to multiple stores. Trucks have different capacities for both the volumes they can hold and the weights they can carry.
Currently, a truck makes several trips each week. An operational database is being used to keep track of the deliveries, including the scheduling of trucks to provide timely deliveries to stores.
notion image
(i) a trip may pick up goods from many warehouses (i.e. a many-many relationship between Warehouse and Trip);
(ii) a trip uses one truck only, and obviously a truck may have many trips in the history (i.e. a many1 relationship between Trip and Truck);
(iii) a trip delivers goods (e.g. TVs, fridges, etc.) potentially to several stores (a many-many relationship between Trip and Store, which is represented by the Destination table)
notion image

Why isn't this report correct?

If you look at the Tables 5.9, 5.10, 5.11, 5.12, 5.13, and 5.14, you will see that there is no direct relationship between Store and Total Delivery Cost in the Fact Table.
This is due to the many-many relationship between the Trip entity and Store entity in the E/R diagram of the operational database.

If the report is not correct, what should we do?

Solution 1: Using a Bridge Table

  • To solve this problem, a Bridge Table can be used, as shown in Fig. 5.12.
notion image
  • For this solution, we can only see the cost for Trip1 is $x and Trip1 delivered to y number of stores. For example, it is found that the length of trip1 is 370 km on 14 April using Truck1, and the cost per kilometre for Truck1 is $1.20. Trip1 delivered goods to five stores at a cost of $444 (370 * 1.2).

Solution 2: Add a Weight Factor Attribute

Problem is that it is impossible to calculate the delivery cost for each of these five stores for Trip1. So, we can estimate the total delivery cost per store, if we want to. This can be estimated through the “Weight Factor” (see the Weight Factor attribute in the Trip Dimension in Fig. 5.13).
notion image
notion image
A weight factor is a proportion of the trip that goes to each store for that particular trip. For example, if Trip1 went to five stores, then the Weight Factor is 0.2 (or 20%).
  • This implies that each store “contributes” 20% of the total delivery cost for that trip. This is certainly inaccurate, but this is the only estimate that we can make, based on the data that we have.
The 0.2 weight factor for Trip1 does not mean that Trip1 has 20% of the total delivery cost; rather, Trip1 delivers to five stores, and each store contributes 20% (or 0.20) to the total delivery cost.

Solution 3: A List Aggregate Version

Instead of having M1, M2, M3… per trip for each row, we can concatenate all the stores passed during the trip (e.g., M1_M2_M3_M4_M8 ), and store it as an attribute.
To create the Trip Dimension table with a StoreGroupList attribute, we can use the listagg function in SQL. The listagg function has the following format:
listagg (Attr1, ’_’) within group (order by Attr1) as ColumnName
where Attr1 is the StoreID and the “” indicates that the StoreIDs are concatenated with the “” symbol (e.g. M1_M2_M3_M4_M8). If we want to list the stores listed in descending order (e.g. M8_M4_M3_M2_M1), then we use within group (order by Attr1 Desc). The SQL to create the Trip Dimension table becomes:
create table TripDim3 as select T.TripID, T.TripDate, T.TotalKm, 1.0/count(D.StoreID) as WeightFactor, listagg (D.StoreID, ’_’) within group (order by D.StoreID) as StoreGroupList from Trip T, Destination D where T.TripID = D.TripID group by T.TripID, T.TripDate, T.TotalKm;
Pros:
  • Visually, the List Aggregate attribute (e.g. the StoreGroupList attribute in the Trip Dimension) is appealing because it is very easy to see the complete list of stores for each trip.
  • A preferred option in the industry, because a group list is physically listed in the parent dimension, which may visually help the decision-makers to understand the completeness of the group list (e.g. StoreID for each Trip).
Cons:
  • This information is rather redundant because we can get the same information from the Bridge Table
  • A non-List Aggregate version is simpler and cleaner.
  • The implementation is simpler and more straightforward.
  • There is no redundant information on the stores within each trip
 

Case study (Lecture Activity 7): Publication

notion image
You are required to build a data warehouse to analyse number of publications for each year, research field, publication type, and academic/author. Note the history of each academic job rank in the university is maintained in the database.
You are required to build a data warehouse to analyse number of publications for each year, research field, publication type, and academic/author. Note the history of each academic job rank in the university is maintained in the database.
notion image
 
 
FIT3003 - W3 - Average in the FactFIT3158 Note - W11 Monte Carlo Simulation in applying the queuing problem