INFO 302: SQL & Process Optimization
This course introduces common techniques for relational data management, including conceptual modeling
and Structured Query Language (SQL). Students will learn how to construct a relational model to link data that are extracted from
multiple systems. Students will learn the pros and cons to using a database management system to implement these relationships instead
of doing this in your analytics script in Python, R, or SAS. |
Course Learning Objectives
|
Who am I?
Before becoming an academic, I held different technical consulting and analyst positions at Information Resources, Deloitte, and the Private Bank at Citigroup. I earned my PhD in International Management from the University of Hawai&i at Manoa. I currently research a variety of information systems related topics. I teach several different data analytics courses at the University of Richmond. My technical skills include Visual Basic, SQL, MySQL, Python, MongoDB, and others! I enjoy combining my technical skills with my business knowledge to create useful information systems! |
Required Software
The main software will be the MySQL Workbench version (6.3.10), which is NOT the latest version. Do not install the latest version of the MySQL Workbench because it will not work with our RSBDB server.
If you also want to store your data locally on your laptop in addition to storing your data on our RSBDB server, then you can download the community edition of the MySQL server. Note that this installation wizard will attempt to install the latest version of the workbench (do NOT check this option or uninstall it after it gets installed and re-install the 6.3.10 version of the Workbench).
You will also need the University edition of SAS to run your statistical models. Please note that the statistical package may change depending on the configuration of the computer lab. | ||
If you have a MAC, please click here for instructions on how to run a parallel Windows environment on your machine. You can also access all computing resources by remote desktoping into one of the computer labs on campus. |
High Level Schedule
I am not providing a day-by-day schedule because each class learns the topics at different speeds. Certain semesters students will (on average) learn the topics faster or slower. If I provide a detailed day-by-day schedule, then we will spend too much time adjusting it! As such, I will provide detailed tasks for each week via email at the start of each week or during the first class session of the week. The course syllabus, which contains additional course details including policies & procedures, may be downloaded here. |
|
|
Culture of Honor
Getting a grade or an accolade that you did not earn is never worth it. You are not a failure if your best effort results in grade less than an "A". Perfection in any domain is an unrealistic expectation. If you ever find yourself in a situation where you feel like you have to take a short cut, stop and email me! It is always better to email me than to commit an act of academic dishonesty. Committing an act of academic dishonesty can get you kicked out of school and will result in an "F" grade in my course. Emailing me will not result in either of these! | ||
Would you want your Doctor to be the person who cheated their way through Medical School? Would you want the architect of the bridge you drive over every day to be the person who cheated on their exams? Would you want to invest your money with a CEO who cheated their way through their prestigious MBA program because they could not figure out how to solve business problems in school? | ||
Your character has nothing to do with your GPA. How do you act when given hard problems to solve? How do you act when nobody is watching? Do not be the person who answers these questions by saying "I take short cuts." If one of your friends asks you to help them take a short cut in my class, say "no" and have them contact me. You, your friend, and the UR community will benefit more from that response than from helping them take a short cut! The long-term strength of the Robins brand is counting on you to act with integrity now and as alums in the future. | ||
Good Luck & Have Fun!
Learning is supposed to be fun! However, I know navigating an online or hybrid course can be challenging. It is for me as well! Recognize that I am here to help! If you are ever feeling lost, overwhelmed, alone, or confused in any way, reach out to me. I am ALWAYS available (regardless of time zone) to help you. You are not in this endeavor alone (even if we are thousands of miles apart!). |
You may download my video from here. |
Course Introduction
This course is about practicing evidence-based management and performing constrained optimization analyses to help managers make more informed (hopefully better) decisions in all disciplines!
In this class, you will learn how to link data extracted from different systems using relational models, use SQL to work with the data in those models, run statistical models using data in your relational databases, and optimize those models using a series of organizational constraints.
The below video discusses a few of the details about my course. |
You may download this video from here and the associated PowerPoint presentation from here. | ||
Evidence-based Management
Evidence-based management is the incorporation of data and evidence (coupled with critical analysis) in the decision making processes that managers follow. As explained in this Pfeffer & Sutton article, evidence-based management is the judicious and conscientious use of data and evidence when making important choices in organizations. They argue that evidence-based management is "a way of seeing the world....from the premise that using better, deeper logic and employing facts, to the extent possible, permits leaders to do their jobs more effectively." There are many reasons why organizations do not effectively practive evidence-based management but one of the most important is related to organization culture (i.e, the norms and routines in an organization). It is necessary to develop an organizational culture that values and rewards managers who incorporate evidence in their decision making processes. In order to contextualize evidence-based management and to build useful analytical tools (or present analyses that will be acted upon), it is important to understand how managers make decisions as individuals and within the constraints of their organizations. |
You may download my video from here and PowerPoint file from here. | You may download my video from here and PowerPoint file from here. |
Big Data
Big data is a key component of the fourth industrial revolution. Organizations have always had more data than computing power to process it (just the number of bytes have changed since the 1950s). What is different today is the variety and velocity of unstructured data that analysts are asked to orangize, structure, and analyze. The growth of cloud computing (i.e., using a network of remote servers hosted on the internet to store and process data) to allow organizations to analyze more data without needing the technical infrastructure in-house.
Harnessing the power of big data requires organizational culture change but it does not eliminate the need for human insights (see this article).
When working with big data, we often get mesmerized by the size of the data and fall victim to the flawed logic that "more is always better than less." Having representative data is more valuable than having a big data set. |
|
You may download my video from here and zip file from here. |
Conceptual Framework
Starting an analytics proect may be intimidating! What is the objective of your analytics endeavor? Is it exploratory or designed to test specific hypotheses? Your organization might already have lots of data and it just wants you to analyze it. Your organization might be on the opposite end of the spectrum and not have any data stored in a usable manner. Which organizational questions or problems should you answer with an analytics project? The following two videos present a process flow chart and an application of this flow chart for engaging in analytics projects. Please keep the following in mind as you are watching these videos: |
|
You may download my video from here and PowerPoint file from here. |
You may download my video from here and zip file from here. |
Relational Models
The core of data analytics is clean, accurate, and consistent data. The old adage of garbage in, garbage out applies to ALL analytics projects. In this course, we will use relational databases to ensure data quality when linking multiple data files. Relational databases are rather rigid structures, but that rigidity helps ensure that our data are accurate and consistent. There is certainly a trend toward non-relational database solutions especially for unstructured data, but relational models still work very well for structured and semi-structured data. In this module, you will learn the foundations of relational modelling. You will learn the three types of relationships (one-to-one, one-to-many, and many-to-many) and how to model business scenarios relationally. You will then implement these using MySQL. Once you learn MySQL, you should be able to learn other common relational database engines |
Relational Database Fundamentals
Relational databases are still the most common way to store structured and semi-structured data. They are rigid structures but that rigidity also maximizes the likelihood that our data will be consistent and accurate. These relational structures generally require an up-front time investment to design a normalized data model and a database administrator to tune your server to maximize performance.
Making changes to a relational database can also be challenging, especially after the tables have been populated with data. Having said all of this, data analysts will not be able to avoid relational databases to extract data for their analyses. |
The first video may be downloaded from here. The second video may be downloaded from here. The PowerPoint file that is referenced in each of these videos may be download from here. |
Modeling in MySQL
Here, you will construct your logical relational model in MySQL using the built-in point-and-click modeling utility. All we are doing here is graphically constructing a picture of the model that we want to implement. With this .mwb file (i.e., our logical relational model) built, we have not, however, actually built the physical tables and relationships yet. We will (eventually) forward engineer our .mwb file into a SQL script, which we will execute to actually construct our tables and relationships. We will execute this SQL script in our Workbench either connected to a localhost (i.e., your laptop) or on our Robins School of Business MySQL server. Lastly, this example deliberately attempts to use many different types of relationships, which may have over-complicated this Hospital's relational model for the purposes of demonstrating features and functionality! |
This video may be downloaded from here and the PowerPoint file that is referenced in this video may be download from here. |
Car Insurance Logical Model MySQL Example
The objective of this example is to construct a logical model to allow us to analyze data related to car insurance policies and claims. This example requires you to relate data coming from multiple systems, particularly the human resources information system and the organization's proprietary claims processing information system. Once constructed, your relational database should allow you to perform a variety of relevant analytics related to this business. |
You may download my video from here and associated PowerPoint file containing the instructions & other details from here. |
You may download my video from here. |
Subway Logical Model Example
The purpose of this exercise is to have you construct a logical model while just starting from a problem statement. In this example, I am not giving you the list of tables, attributes, and expected relationships. Your job is to try to figure out a logical relational model that fits this business problem and the types of analytics that you need to perform.
|
This video may be downloaded from here and the PowerPoint file that is referenced in this video may be download from here. |
Retail Department Store Logical Model MySQL Example
The objective of this example is to construct a logical model to allow us to analyze data related to a department store example. As you construct this example, I obviously want you to learn how to use the MySQL modeling utility but also think about the different tables, attributes, and relationships. Why did we decide to build these specific tables and relate them in the way that we did?
|
You may download my video from here and associated PowerPoint file containing the instructions & other details from here. |
You may download my video from here. |
Forecasting Logical Model Example
The purpose of this exercise is to have you construct a logical model while just starting from a problem statement (similar to the Subway Example). Your job is to try to figure out a logical relational model that fits this business scenario (problem statement) and the types of analytics that you need to perform.
My suggestion is to find the transaction for this business scenario and figure out what needs to get linked to that transaction. Those linkages will probably result in additional tables that will relate back to the transaction.
|
This video may be downloaded from here and the zip file containing the PowerPoint & MySQL model file from here. |
Brief Conceptual Review
In this section of the course, we will forward engineer our logical models into physical tables on our rsbdb server. We will then load our data into our physically constructed logical model. In doing so, we will work with INSERT, UPDATE, and DELETE SQL statements along with the LOAD DATA statement.
However, before loading data it is imperative that you have a strong understanding of the core relational database concepts (types of relationships, primary, and foreign keys). Without this foundation, it is going to be difficult to load your data.
When loading your data, you will have to ask yourself three primary questions: 1) Do these data fit the requirements for the relationship type (i.e., one-to-one or one-to-many), 2) Are there any foreign key violations in these data, and 3) Are there any primary key violations in these data.
The below video reviews these concepts for a final time.
|
This video may be downloaded from here and the referenced word document from here. |
Update, Insert, Delete and Alter Table
In this section of the course, you will work with a few common commands that will modify the data in your tables (UPDATE, INSERT, and DELETE)
and commands that will modify the structure of your tables.
|
You may download my video from here and zip file from here. |
You may download my video from here and zip file from here. |
You may download my video from here and zip file from here. |
Loading our Constructed Relational Model
In this example, you will forward engineer a two table relational model and construct the physical tables & relationships in MySQL and load those two tables with data from two csv files.
As you work through this example, think about why we have to load the countries table before we can load the employees table. Loading data into our constructed model is like putting together a puzzle.
Often, when putting together a puzzle, the hardest part is figuring out where to start. In our relational model, we have to start with a table that does not contain any foreign keys.
|
This video may be downloaded from here and the referenced zip file from here. |
Loading the Car Insurance Database
In this example, you will load the car insurance database. There are several keys to successfully completing this activity. First, pick the correct order to load the tables. Load the parent tables before the child tables.
Second, make sure that you don't have any primary key or foreign key violations in your data. A relational database is a rigid structure that requires data to conform to certain rules to load all of the tables properly.
Third, ensure that the relationship rules are met (i.e., no duplicates in a one-to-one relationship).
Finally, don't get frustrated! If you find yourself getting frustrated. Stop, take a break, and come back with a clear head later!
NOTE: If you have an invalid Claim_Id in the VehicleClaims raw data file, assign those claims to Claim number 300.
|
This video may be downloaded from here and the referenced zip file from here. |
Loading the Retail Database
In this example, you will load the retail database. This database has a few tricky self one-to-many relationships, which make loading data in those tables particularly challenging. In these situations, you may be attempting to load a record that references a parent record that has not been loaded yet (e.g., employee #1 is supervised by employee #100 but #100 has not been loaded yet thereby creating a foreign key violation).
This example also has a few more tables and the volume of data are more than the previous examples. Carefully think about the order that you load the tables. Trust the process, take your time and do not get frustrated.
|
This video may be downloaded from here and the referenced zip file from here. |
Loading the Hospitals Database
In this example, you will load the hospitals database. This example will require you to modify the structure of a few fields using the ALTER TABLE data definition statement and cast the data in a few columns to move the data from your temp table to the destination table(s).
NOTES: 1) Convert all of the Hospital_Id fields to VARCHAR(255)s from the original double data types, 2) In the administrators raw file, remove the B- and A- in the admin_id and the Hospital_Id fields in the TEMP table,
3) If you don't have an email field in the Administrators table, you should add it as a VARCHAR(255) data type, 4) Make sure you allow nulls in the Supervisor id fields in the Administrators and Doctors tables, &
5) In the doctors raw file, remove the B- and A- in the doctor_id and the Hospital_Id fields in the TEMP table.
|
This video may be downloaded from here and the referenced zip file from here. |
Clubs Review Activity
In this example, you will practice constructing your logical model, forward engineering that logical model, loading data into your forward engineered model,
and modifying the structure of your model using ALTER TABLE statements. At this point, you should have a solid library of code that you can use to perform a
variety of tasks quickly and efficiently. You should also be comfortable with the different warnings (i.e., truncation warnings) and error meesages that happen
when you are loading data, updating data, and modifying the structure of your database.
|
This video may be downloaded from here and the referenced zip file from here. |
Select Queries Introduction
In this section of the course, you will write a series of select queries to descriptively analyze your data and to calculate the variables necessary to perform more complex analytics in other tools.
Particularly when working with structured finance, operations, transactional, and accounting data, you will have to write select queries to extract the sub-set of data that is relevant for your analyses.
To do this effectively, you need to understand your model (i.e., your tables, relationships, and attributes) and how to join your data across multiple tables. |
You may download my video from here and PowerPoint file from here. |
The w3schools website is here. You may download my video from here. |
SQL Zoo
|
Hospitals Select Query Examples
You will work with the Hospitals database in these problem sets. The Hospitals database is more complex than the SqlZoo
examples because the hospitals database contains many more tables and different types of relationships (i.e., self one-to-many, one-to-one, and so on). This added complexity provides an opportunity to
investigate different types of queries, sub-queries, and all of the different types of JOINs. |
You may download my video from here and the.sql file from here. |
You may download my video from here and the.sql file from here. |
Neighborhood Analysis
In these examples, you will apply your select querying skills to perform a neigborhood analysis.
The process is relatively straightforward: 1) Find the "thing" (product, service, market, and so on) that you want to compare everything else to (i.e., the prototypical thing that you want to find nearest neighbors),
2) Either qualitatively or quantitatively determine the attributes, features, and dimensions that make that prototypical "thing" attractive,
3) Determine the sample of others that you want to compare against, &
4) Compare your sample identified in 3 with the prototypical thing identified in 1 using the attributes identified in 2. We do this by calculating the euclidean distance between them along all of the dimensions.
This type of analysis may sound like a k-means analysis but it is not! A k-means analysis is clustering all "things" based on centroids. We are not clustering when performing a neighborhood analysis. Instead, we are simply comparing all things to a prototypical example. |
You may download my video from here and PowerPoint file from here. |
You may download my video from here and zip file from here. |
Retail Select Query Examples
You will work with the Retail database in these problem sets. Some of these queries will be review and some of these will be new. In these problem sets, you will create & use custom user-defined functions, create & execute custom user-defined procedures, use date & time functions in more detail, work with self-joins to perform lead & and lag analyses, define & use session variables, analyze the execution steps in your queries, and set-up indexes to speed up execution of your queries. |
You may download my video from here and the zip file from here. |
You may download my video from here and the zip file from here. |
Constrained Optimization
Multiple Regression Review
In this section of the course, you will review multiple regression. Multiple regression has been around since the 1800s and is a key tool in the data analysts tool kit.
Often, analysts run a straight forward regression model before runing a more sophisticated analysis.
You may read the following multiple regression refresher, regression notes (but don't get lost in the formulas),
and examples for a discussion of the pros and cons of regressions.
|
You may download my video from here and zip file from here. |
You may download my video from here. |
You may download my video from here. |
Logistic Regression
In this section of the course, you will work with logistic regression. Logistic regression enables us to analyze correlations between a set of independent variables and a binary (0 or 1) dependent variable.
Many management problems do not have a continuous dependent variable because we don't have the ability to measure certain things on a continuous scale along a number line or it does not make sense to do so. For example, did an employee quit the company or were we successful in a market? These are answered with either a yes or a no, which serves as our dependent variable.
The binary nature of the dependent variable renders it impossible to run a multiple regression because the core requirements and assumptions associated with a least squares regression are not met or violated.
|
You may download my video from here and referenced zip file from here. |
You may download my video from here. |
You may download my video from here and zip file from here. |
Constrained Optimization
In this section of the course, you will work with constrained optimization problems.
You will optimize the use of your regression and logistic regression models (and other models) given a set of organizational constraints. You will use the Solver add-in to determine your constrained optimal solutions.
In these types of problems, you will attempt to determine the maximum or the minimum of some objective function (see this article for a great (largely) non-mathematical explanation).
We will not be covering the calculus behind how to maximize or minimize an objective function given a set of design variables and constraints in this class because the technical tool will handle that calculus for us (as business analysts).
However, if you are interested in the math behind these problems, consult this great Khan Academy lecture and this pdf file.
|
You may download my video from here and the zip file from here. |
You may download my video from here. |
Group Project
This project requires you to apply all of the skills you have learned in this class!
You will forward engineer a logical model, load your forward engineered database with data, query your database to calculate your independent & dependent variables,
use those views/sql statements to execute a series of regression models, and then you will use solver and the results of your regression models to optimize the inputs given a set of constraints.
|
You may download this video from here and the associated zip file from here. | ||