641 Week 7 Outline


		Multimedia Test System

B U L L E T I N B O A R D

641 Week 7 Outline

(Subject: Database/Authored by: Liping Liu on 10/12/2024 4:00:00 AM)/Views: 3949

Blog News Post

Warm-Up Queries:

Find the employment year of each employee
Find the departments in Dallas
Find the salesmen who were hired after 1980

--Find menagers whose name ends with M

select ename
from emp
where job = 'MANAGER' and ename like '%M';

--Give salesman 100 commission

update emp
set comm = comm + 100
where job = 'SALESMAN';

--handle missing values

upate emp
set comm = 0
where comm is null;

select 12*sal + comm
from emp
group by empno;

--use nvl(comm, 0)

update emp
set comm = nvl(comm, 0) + 100
where job = 'SALESMAN';

Lecture 1: Intermediate SQL Programming: Join

implicit join (outer and inner), Explicit Join: inner join, left join, right join, full outer join (or cross join)

Find all the employees located in Dallas
Find all employees and their associated departments
Find all departments and their associated employees

Explicit Joins

Just like + operation in arithmetics to combine two tables, may be used in parentheses for nested combination of more than two tables
Syntax: TableA join TableB on (criterion to join)
three explicit Join types: left, right, inner
Equivalent to implicit joins (e.g., where dept.deptno = emp.deptno) but may be more efficient than implicit joins
Inner join -- find records from both tables that are matching
left join -- find records from the left table regardless whether they have matching records from the right table or not (may be also don using the right table key is null)
right join -- find the records from the right table regardless whether they have matching records from the left table or not (may be done using left table key is null)

Examples:

Example 1: Find employees and their matching department names

Example 2: Find all employee names and their department names regardless whether the employee has a department or not

Example 3: Find all department names and their employee names regardless whether the department has employees or not

Find employees in Chicago who were hired before 1983
Fine employees in SALES department who makes more than 500.

Example 4: create a directory of employees' dept name, ename, job

select dept.deptno, dname, ename, job, emp.deptno
from emp, dept;

--outer join or cross join = Cartesian product of records from both tables

select dept.deptno, dname, ename, job, emp.deptno
from dept, emp
where dept.deptno = emp.deptno;

--implict inner join, not very efficient

--explicit join --create temperory table from multiple tables using joins: inner join, left join and right right

Example 5: Find employees and their matching department names

select dept.deptno, dname, ename, emp.deptno
from dept inner join emp on (dept.deptno = emp.deptno);

Example 6: Find employees located in Dallas

select ename, loc
from dept inner join emp on (dept.deptno = emp.deptno)
where loc = 'DALLAS';

Example 7: Find all employee names and their department names regardless whether the employee has a department or not

select ename, dname
from emp left join dept on (emp.deptno = dept.deptno);

Example 8: Find all department names and their employee names regardless whether the department has employees or not

select dname, ename
from emp right join dept on (dept.deptno = emp.deptno);

Lecture 2: Intermediate SQL Programming: Statistic or Group Functions

Descriptive Statistic Functions: COUNT, SUM, AVG, MAX, MIN, STDEV, Variance, Mode, Median, quantiles, plus x sigma, minus x sigma, top n outliers, bottom n outliers. and listagg

Advanced Statistics Functions:

Hypothesis testing (t-test, F-test, Kolmogorov-Smirnov test, Mann Whitney test, Wilcoxon Signed Ranks test
Correlations analysis (parametric and nonparametric e.g. Pearson’s test for correlation, Spearman's rho coefficient, Kendall's tau-b correlation coefficient)
Ranking functions
Cross Tabulations with Chi-square statistics
Linear regression
ANOVA (Analysis of variance)
Test Distribution fit (e.g. Normal distribution test, Binomial test, Weibull test, Uniform test, Exponential test, Poisson test, etc.)

SELECT group function, column_name_(ones after_group_by)

FROM table_names

WHERE conditions (apply to individual rows)

GROUP BY column_name

HAVING conditions (apply to groups)

Example 1: find the total number of employees and the total amount of salary

Example 2: Find the average salary of each department

Example 3: Find those departments whose minimal salary is less than 500

Example 4: Find those job categories whose average salary is more than 2000

Example 5: Find the most recent employment among those hired before 1980 for each department

Example 6: Find the average salary of those employees whose name starts with J in each department

Example 7: Find those job categories whose maximum salary among the people hired after 1980 is more than 2500

No Case Selection and Division

--find # of employees

select count(*)
from emp;

--find total salary of all employees

select sum(sal)
from emp;

--min salary

select min(sal)
from emp;

select max(sal)
from emp;

--all descriptive stat of salaries for all employees

select min(sal), max(sal), avg(sal), stddev(sal)
from emp;

--fine the most recent hiredate
--this is wrong because of mixing group characteristics with individual records

select max(hiredate), ename
from emp;

Case Selection

--# of employees hired after 1981

select count(*)
from emp
where hiredate > to_date('1981', 'YYYY');

--average sal for salemen

select avg(sal)
from emp
where upper(job) = upper('salesman');

--upper() and lower() for case conversions

--max and min salaries for employees in Dallas

select max(sal), min(sal)
from dept inner join emp on (dept.deptno = emp.deptno)
where loc = 'DALLAS';

Case Division --Factor Analysis

--Find the average salary of each department
--no columns are allowed with stat functions except for ones after group by

select avg(sal), deptno
from emp
group by deptno;

select avg(sal), dname
from dept inner join emp on (dept.deptno = emp.deptno)
group by dname;

--find max salary for each job

select max(sal), job
from emp
group by job;

--find total number of employees hired each year

select count(*), to_char(hiredate, 'YYYY')
from emp
group by to_char(hiredate, 'YYYY');

--find total number of employees hired each month

select count(*), to_char(hiredate, 'Mon')
from emp
group by to_char(hiredate, 'Mon');

--which month the company hired most employees

--find standard deviation of salaries for each location

select stddev(sal), loc
from dept inner join emp on (dept.deptno = emp.deptno)
group by loc;

--list all the names of employees as one string in each department.

select deptno, listagg(ename, ',') as Employees
from emp
group by deptno;

-- order the list within each group

select deptno, listagg(ename, ',') within group (order by ename desc) as Employees

from emp

group by deptno;

Factor Analysis with Case Selection

--find # of employees hired after 1982 for each dept

select count(*), deptno
from emp
where hiredate > '31-Dec-1982'
group by deptno;

--find average salaries for eacg job-dept pair among employees hired before 1984

select avg(sal), deptno, job
from emp
where hiredate < '01-Jan-1984'
group by deptno, job;

Factor Selection (Selecting Groups)

-- Find those departments whose minimal salary is less than 900

select deptno
from emp
group by deptno
having min(sal) < 900;

--Find departmehts that has less than 3 employees

select dname
from dept inner join emp on (dept.deptno = emp.deptno)
group by dname
having count(*) < 3;

--find location that has max salary over 2000

select loc
from dept inner join emp on (dept.deptno = emp.deptno)
group by loc
having max(sal) > 2000;

--find the year the company hired two or more employees

select to_char(hiredate, 'YYYY')
from emp
group by to_char(hiredate, 'YYYY')
having count(*) >= 2;

--Find those job categories whose average salary is more than 2000

select job
from emp
group by job
having avg(sal) > 2000;

--Find the most recent employment among those hired before 1980 for each department

select max(hiredate)
from emp
where hiredate < '01-Jan-1980'
group by deptno;

Finding Duplicate Data

In transactional databases, it is rear to have duplicate records because the PK values must be unique. However, in data sets for analytics, duplication happens very often. Group by clause can be used to detect duplications.

For example, for the following table with duplicates,

drop table duplicates;

create table duplicates (
x number(2),
y char(1),
z date
);

insert into duplicates
Values (12, 'Y', '12-Jan-1998');

insert into duplicates
Values (10, 'N', '12-Jan-2008');

insert into duplicates
Values (12, 'Y', '12-Jan-1998');

We can just write the following statement:

select *

from duplicates

group by x, y, z

having count(*) > 1;

The tedious part of the query is to list all columns in the table. Unfortunately, no one can write an SQL query like the above without a complete list of all columns. However, I wrote the following SQL statement that can dynamically generate the above statement for any table with any columns (I am sure you will come back here to copy the following statement for your work in the future because I am only the person wrote it):

select 'select * from ' || table_name || ' group by ' || listagg(column_name, ', ') || ' having count(*) > 1;'
from user_tab_columns
where table_name = 'DUPLICATES'
group by table_name;

You just need to replace 'DUPLICATES' by your own table name. After running the statement, you will get the following as output:

select * from DUPLICATES group by X, Y, Z having count(*) > 1;

which you can use to find duplicates in your own table.

Advanced Statistic Analysis

This topic is beyond the scope of Database Management. Here I merely use a few examples to demonstrate the capabilities of OLAP using advanced stat functions.

ANOVA

STATS_ONE_WAY_ANOVA Example: The following example determines the significance of the differences in mean salaries within a job and differences in mean salaries between jobs. The results, p_values close to zero, indicate that, for different jobs, the difference in salaries across different locations is significant.

select stats_one_way_anova(job, sal, 'F_RATIO') as F, stats_one_way_anova(job, sal, 'SIG') as p_value
from emp;

The following example checks if people in different locations have significant differences in salaries. The F ratio is not significantly different from 1, and the p-value is larger than a typical threshold of 0.1, 0.05, and 0.01. Thus, we do not have sufficient evidence to reject the null hypothesis: the variation in salary between locations is not different from the variation in salary within a location.

select stats_one_way_anova(loc, sal, 'F_RATIO') as F, stats_one_way_anova(loc, sal, 'SIG') as p_value
from emp inner join dept on (dept.deptno = emp.deptno);

The following statement conducts ANOVA test of the impact of job on salary at each location:

  SELECT loc,
       STATS_ONE_WAY_ANOVA(job, sal, 'F_RATIO') f_ratio,
       STATS_ONE_WAY_ANOVA(job, sal, 'SIG') p_value
  FROM dept, emp
  WHERE dept.deptno = emp.deptno
  GROUP BY loc;

Chi-Square Test (Contingency Table)

Chi-Square tests are used to check if two distributions are the same (null hypothesis). As a special case, it can test whether two categorical variables are independent (null hypothesis).

Oracle STATS_CROSSTAB Example: The following example determines the independence between location and job:

 SELECT STATS_CROSSTAB(JOB, LOC, 'CHISQ_OBS') chi_squared,
 STATS_CROSSTAB(JOB, LOC, 'CHISQ_SIG') p_value,
 STATS_CROSSTAB(JOB, LOC, 'PHI_COEFFICIENT') phi_coefficient,
 STATS_CROSSTAB(JOB, LOC, 'CRAMERS_V') Cramers_v
 FROM dept, emp where dept.deptno = emp.deptno;

Note: PHI_Coefficient is a measure of association independent of the sample size. Phi ranges between 0 (no relationship) and 1 (perfect relationship). Phi was designed for 2 × 2 tables only. For larger tables, it has no upper limit and Cramer’s V should be used instead.

SQL Programming Rules:
--1) Result does not count, program matters
--2) One query, one statement
--3) Statement msut be based on the given data, do not assume you know data beyond what is given.

--find average salary for sales dept

select * from dept;

select avg(sal)
from emp
where deptno = 30;

select avg(sal)
from emp inner join dept on (dept.deptno = emp.deptno)
where dname = 'SALES';

-- Find the average salary of those employees whose name starts with J in each department

select avg(sal)
from emp
where ename like 'J%'
group by deptno;

-- Find those job categories whose maximum salary among the people hired after 1980 is more than 2500

select job
from emp
where hiredate > to_date('1980', 'YYYY')
group by job
having max(sal) > 2500
order by job desc;

Additional Exercises:

Find the name of those departments whose employees earn more than 2100 dollars on the average
Find the number of employees whose name starts with J in each department
Find the average salary of each department from EMP
Find the total salary for each job
Give an employee 8% salary increases if he or she is hired before 1981 and has salary less than 1500
Find the number of employees in each department
Find the total annual pay (salary *12 + commission) for each employee
Find the employment age of each employee?
Find those departments whose minimum salary is less than 1800:
Find those job categories whose average salary is more than 1200
List the name of each department, and its total number of employees, and the total amount of salaries
Find the average salary of those employees whose name starts with J in each department
Find the most recent employment among those hired before 1980 for each department
Find the minimum salary of each department among the people who was hired before 1980?
Find the gap of salaries of each department in descending order
Find those departments whose minimum salary is more than 2000
Find the total number of employees hired each year
Find the number of employees hired in January of all years

Homework:

Reading: SQL Tutorial

Design:

Correctness Questions: Online at ecourse.org

Closeness Questions: Write SQL statements for the following problems:

Find the total number of employees who have never received commissions
Find the department that has less than 5 employees
Find the years that company hired more than 5 employees
Find the location that has more than 3 employees
Find all the employees located in Dallas
Find those job categories whose maximum salary among the people hired after 1980 is more than 2500
For all the employees working in Dallas, list the name of each one and the number of years he has been employed by the company.
Find the total salary of employees at each location
Find the location that have more than two employees
Find the year in which employees hired with minimum salary more than 1000


Add Comments/评论

Register

Blog News Post

Blog Posts News Digest Contact Us About Developer Privacy Policy