Warm-Up Queries:
- Find the employment year of each employee
- Find the departments in Dallas
- Find the salesmen who were hired after 1980
--Find menagers whose name ends with M
select ename from emp where job = 'MANAGER' and ename like '%M';
--Give salesman 100 commission
update emp set comm = comm + 100 where job = 'SALESMAN';
--handle missing values
upate emp set comm = 0 where comm is null;
select 12*sal + comm from emp group by empno;
--use nvl(comm, 0)
update emp set comm = nvl(comm, 0) + 100 where job = 'SALESMAN';
Lecture 1: Intermediate SQL Programming: Join
implicit join (outer and inner), Explicit Join: inner join, left join, right join, full outer join (or cross join)
- Find all the employees located in Dallas
- Find all employees and their associated departments
- Find all departments and their associated employees
Explicit Joins
- Just like + operation in arithmetics to combine two tables, may be used in parentheses for nested combination of more than two tables
- Syntax: TableA join TableB on (criterion to join)
- three explicit Join types: left, right, inner
- Equivalent to implicit joins (e.g., where dept.deptno = emp.deptno) but may be more efficient than implicit joins
- Inner join -- find records from both tables that are matching
- left join -- find records from the left table regardless whether they have matching records from the right table or not (may be also don using the right table key is null)
- right join -- find the records from the right table regardless whether they have matching records from the left table or not (may be done using left table key is null)
Examples:
Example 1: Find employees and their matching department names
Example 2: Find all employee names and their department names regardless whether the employee has a department or not
Example 3: Find all department names and their employee names regardless whether the department has employees or not
- Find employees in Chicago who were hired before 1983
- Fine employees in SALES department who makes more than 500.
Example 4: create a directory of employees' dept name, ename, job
select dept.deptno, dname, ename, job, emp.deptno from emp, dept;
--outer join or cross join = Cartesian product of records from both tables
select dept.deptno, dname, ename, job, emp.deptno from dept, emp where dept.deptno = emp.deptno;
--implict inner join, not very efficient
--explicit join --create temperory table from multiple tables using joins: inner join, left join and right right
Example 5: Find employees and their matching department names
select dept.deptno, dname, ename, emp.deptno from dept inner join emp on (dept.deptno = emp.deptno);
Example 6: Find employees located in Dallas
select ename, loc from dept inner join emp on (dept.deptno = emp.deptno) where loc = 'DALLAS';
Example 7: Find all employee names and their department names regardless whether the employee has a department or not
select ename, dname from emp left join dept on (emp.deptno = dept.deptno);
Example 8: Find all department names and their employee names regardless whether the department has employees or not
select dname, ename from emp right join dept on (dept.deptno = emp.deptno);
Lecture 2: Intermediate SQL Programming: Statistic or Group Functions
Descriptive Statistic Functions: COUNT, SUM, AVG, MAX, MIN, STDEV, Variance, Mode, Median, quantiles, plus x sigma, minus x sigma, top n outliers, bottom n outliers. and listagg
Advanced Statistics Functions:
- Hypothesis testing (t-test, F-test, Kolmogorov-Smirnov test, Mann Whitney test, Wilcoxon Signed Ranks test
- Correlations analysis (parametric and nonparametric e.g. Pearson’s test for correlation, Spearman's rho coefficient, Kendall's tau-b correlation coefficient)
- Ranking functions
- Cross Tabulations with Chi-square statistics
- Linear regression
- ANOVA (Analysis of variance)
- Test Distribution fit (e.g. Normal distribution test, Binomial test, Weibull test, Uniform test, Exponential test, Poisson test, etc.)
SELECT group function, column_name_(ones after_group_by)
FROM table_names
WHERE conditions (apply to individual rows)
GROUP BY column_name
HAVING conditions (apply to groups)
Example 1: find the total number of employees and the total amount of salary
Example 2: Find the average salary of each department
Example 3: Find those departments whose minimal salary is less than 500
Example 4: Find those job categories whose average salary is more than 2000
Example 5: Find the most recent employment among those hired before 1980 for each department
Example 6: Find the average salary of those employees whose name starts with J in each department
Example 7: Find those job categories whose maximum salary among the people hired after 1980 is more than 2500
No Case Selection and Division
--find # of employees
select count(*) from emp;
--find total salary of all employees
select sum(sal) from emp;
--min salary
select min(sal) from emp;
select max(sal) from emp;
--all descriptive stat of salaries for all employees
select min(sal), max(sal), avg(sal), stddev(sal) from emp;
--fine the most recent hiredate --this is wrong because of mixing group characteristics with individual records
select max(hiredate), ename from emp;
Case Selection
--# of employees hired after 1981
select count(*) from emp where hiredate > to_date('1981', 'YYYY');
--average sal for salemen
select avg(sal) from emp where upper(job) = upper('salesman');
--upper() and lower() for case conversions
--max and min salaries for employees in Dallas
select max(sal), min(sal) from dept inner join emp on (dept.deptno = emp.deptno) where loc = 'DALLAS';
Case Division --Factor Analysis
--Find the average salary of each department --no columns are allowed with stat functions except for ones after group by
select avg(sal), deptno from emp group by deptno;
select avg(sal), dname from dept inner join emp on (dept.deptno = emp.deptno) group by dname;
--find max salary for each job
select max(sal), job from emp group by job;
--find total number of employees hired each year
select count(*), to_char(hiredate, 'YYYY') from emp group by to_char(hiredate, 'YYYY');
--find total number of employees hired each month
select count(*), to_char(hiredate, 'Mon') from emp group by to_char(hiredate, 'Mon');
--which month the company hired most employees
--find standard deviation of salaries for each location
select stddev(sal), loc from dept inner join emp on (dept.deptno = emp.deptno) group by loc;
--list all the names of employees as one string in each department.
select deptno, listagg(ename, ',') as Employees from emp group by deptno;
-- order the list within each group
select deptno, listagg(ename, ',') within group (order by ename desc) as Employees
from emp
group by deptno;
Factor Analysis with Case Selection
--find # of employees hired after 1982 for each dept
select count(*), deptno from emp where hiredate > '31-Dec-1982' group by deptno;
--find average salaries for eacg job-dept pair among employees hired before 1984
select avg(sal), deptno, job from emp where hiredate < '01-Jan-1984' group by deptno, job;
Factor Selection (Selecting Groups)
-- Find those departments whose minimal salary is less than 900
select deptno from emp group by deptno having min(sal) < 900;
--Find departmehts that has less than 3 employees
select dname from dept inner join emp on (dept.deptno = emp.deptno) group by dname having count(*) < 3;
--find location that has max salary over 2000
select loc from dept inner join emp on (dept.deptno = emp.deptno) group by loc having max(sal) > 2000;
--find the year the company hired two or more employees
select to_char(hiredate, 'YYYY') from emp group by to_char(hiredate, 'YYYY') having count(*) >= 2;
--Find those job categories whose average salary is more than 2000
select job from emp group by job having avg(sal) > 2000;
--Find the most recent employment among those hired before 1980 for each department
select max(hiredate) from emp where hiredate < '01-Jan-1980' group by deptno;
Finding Duplicate Data
In transactional databases, it is rear to have duplicate records because the PK values must be unique. However, in data sets for analytics, duplication happens very often. Group by clause can be used to detect duplications.
For example, for the following table with duplicates,
drop table duplicates;
create table duplicates ( x number(2), y char(1), z date );
insert into duplicates Values (12, 'Y', '12-Jan-1998');
insert into duplicates Values (12, 'Y', '12-Jan-1998');
insert into duplicates Values (10, 'N', '12-Jan-2008');
insert into duplicates Values (12, 'Y', '12-Jan-1998');
We can just write the following statement:
select *
from duplicates
group by x, y, z
having count(*) > 1;
The tedious part of the query is to list all columns in the table. Unfortunately, no one can write an SQL query like the above without a complete list of all columns. However, I wrote the following SQL statement that can dynamically generate the above statement for any table with any columns (I am sure you will come back here to copy the following statement for your work in the future because I am only the person wrote it):
select 'select * from ' || table_name || ' group by ' || listagg(column_name, ', ') || ' having count(*) > 1;' from user_tab_columns where table_name = 'DUPLICATES' group by table_name;
You just need to replace 'DUPLICATES' by your own table name. After running the statement, you will get the following as output:
select * from DUPLICATES group by X, Y, Z having count(*) > 1;
which you can use to find duplicates in your own table.
Advanced Statistic Analysis
This topic is beyond the scope of Database Management. Here I merely use a few examples to demonstrate the capabilities of OLAP using advanced stat functions.
ANOVA
STATS_ONE_WAY_ANOVA Example: The following example determines the significance of the differences in mean salaries within a job and differences in mean salaries between jobs. The results, p_values close to zero, indicate that, for different jobs, the difference in salaries across different locations is significant.
select stats_one_way_anova(job, sal, 'F_RATIO') as F, stats_one_way_anova(job, sal, 'SIG') as p_value from emp;
The following example checks if people in different locations have significant differences in salaries. The F ratio is not significantly different from 1, and the p-value is larger than a typical threshold of 0.1, 0.05, and 0.01. Thus, we do not have sufficient evidence to reject the null hypothesis: the variation in salary between locations is not different from the variation in salary within a location.
select stats_one_way_anova(loc, sal, 'F_RATIO') as F, stats_one_way_anova(loc, sal, 'SIG') as p_value from emp inner join dept on (dept.deptno = emp.deptno);
The following statement conducts ANOVA test of the impact of job on salary at each location:
SELECT loc,
STATS_ONE_WAY_ANOVA(job, sal, 'F_RATIO') f_ratio,
STATS_ONE_WAY_ANOVA(job, sal, 'SIG') p_value
FROM dept, emp
WHERE dept.deptno = emp.deptno
GROUP BY loc;
Chi-Square Test (Contingency Table)
Chi-Square tests are used to check if two distributions are the same (null hypothesis). As a special case, it can test whether two categorical variables are independent (null hypothesis).
Oracle STATS_CROSSTAB Example: The following example determines the independence between location and job:
SELECT STATS_CROSSTAB(JOB, LOC, 'CHISQ_OBS') chi_squared, STATS_CROSSTAB(JOB, LOC, 'CHISQ_SIG') p_value, STATS_CROSSTAB(JOB, LOC, 'PHI_COEFFICIENT') phi_coefficient, STATS_CROSSTAB(JOB, LOC, 'CRAMERS_V') Cramers_v FROM dept, emp where dept.deptno = emp.deptno;
Note: PHI_Coefficient is a measure of association independent of the sample size. Phi ranges between 0 (no relationship) and 1 (perfect relationship). Phi was designed for 2 × 2 tables only. For larger tables, it has no upper limit and Cramer’s V should be used instead.
SQL Programming Rules: --1) Result does not count, program matters --2) One query, one statement --3) Statement msut be based on the given data, do not assume you know data beyond what is given.
--find average salary for sales dept
select * from dept;
select avg(sal) from emp where deptno = 30;
select avg(sal) from emp inner join dept on (dept.deptno = emp.deptno) where dname = 'SALES';
-- Find the average salary of those employees whose name starts with J in each department
select avg(sal) from emp where ename like 'J%' group by deptno;
-- Find those job categories whose maximum salary among the people hired after 1980 is more than 2500
select job from emp where hiredate > to_date('1980', 'YYYY') group by job having max(sal) > 2500 order by job desc;
Additional Exercises:
- Find the name of those departments whose employees earn more than 2100 dollars on the average
- Find the number of employees whose name starts with J in each department
- Find the average salary of each department from EMP
- Find the total salary for each job
- Give an employee 8% salary increases if he or she is hired before 1981 and has salary less than 1500
- Find the number of employees in each department
- Find the total annual pay (salary *12 + commission) for each employee
- Find the employment age of each employee?
- Find those departments whose minimum salary is less than 1800:
- Find those job categories whose average salary is more than 1200
- List the name of each department, and its total number of employees, and the total amount of salaries
- Find the average salary of those employees whose name starts with J in each department
- Find the most recent employment among those hired before 1980 for each department
- Find the minimum salary of each department among the people who was hired before 1980?
- Find the gap of salaries of each department in descending order
- Find those departments whose minimum salary is more than 2000
- Find the total number of employees hired each year
- Find the number of employees hired in January of all years
Homework:
Reading: SQL Tutorial
Design:
Correctness Questions: Online at ecourse.org
Closeness Questions: Write SQL statements for the following problems:
- Find the total number of employees who have never received commissions
- Find the department that has less than 5 employees
- Find the years that company hired more than 5 employees
- Find the location that has more than 3 employees
- Find all the employees located in Dallas
- Find those job categories whose maximum salary among the people hired after 1980 is more than 2500
- For all the employees working in Dallas, list the name of each one and the number of years he has been employed by the company.
- Find the total salary of employees at each location
- Find the location that have more than two employees
- Find the year in which employees hired with minimum salary more than 1000
|