6.2 Visualizing the Relationship
Scatter Plots
Bivariate statistics refers to the analyses we perform that measure the relationship between two variables. These bivariate relationships are relatively simple to visualize and analyze using scatter plots.
The first step to analyze this data is to visualize it. Two-dimensional visualizations, the most common and easiest to visually interpret, are those that show the relationship between two variables. If the two variables are continuous, then scatter plots are the preferred visualization. Create a scatter plot to show the relationship between age and income. If needed, follow along with the video below:
Next, add a trendline to the scatter plot. You should notice several things. First, before adding the trendline, is there a clear pattern to the plots? Yes and no. You may have been able to decipher that there is a positive relationship between the two variables: meaning that the dots seem to "scatter" from the lower left to the upper right. That means that as one variable increases, so does the other one--to a degree. In other words, as people get older they make more money. However, you may not have been able to tell that without adding the trendline which clearly increases. Second, the scatter plot appears to be "grouped" into chunks. Why is that? It's because our data for income has been rounded to the nearest 10000. As a result, that variable is not truly continuous, it's ordinal. That's okay though.
Lastly, does this chart tell us that age causes income, or that income causes age? Although it is obvious that age leads to income (and not the other way around), the answer is neither. This is referred to as causal ambiguity. The visualizations and statistics don't tell us which factor causes which. Our theoretical explanation tells us causality, not the statistics.
Before moving on, let's make sure we understand the concept of slope , which is represented by the trendline in the Excel scatter plot.
Slope
The slope is a measure of how much the y-value changes, on average, for each unit of increase on the x-axis. Slope is calculated as the rise over the run.
The slope of a line is a number that describes both the direction and the steepness of the line. Slope is often denoted by the letter m, such as in the equation y = mx + b. As shown below, the direction of a line is either increasing, decreasing, horizontal (no slope) or vertical (undefined slope).
The line of best fit is displayed on the scatter plot below. According to this equation, as height increases by one inch, self-esteem increases, on average, by 0.0707.
The relationship in the image above is "linear" because it's represented by a straight line. But, clearly, not all relationships are linear. Take, for example, the relationship between age and income. The assumption is that this relationship is positive; meaning that as people get older, they make more money. However, do 1 year olds make any money at all? Of course not. In fact, few people make any more before turning 16. Then, once we graduate from High School, college, and graduate programs, our salaries usually take larger bumps. In other words, turning one year older doesn't result in the exact same increase in pay every year. As a result, even though the relationship between age and income is positive, it's not necessarily linear.
There are a variety of mathematical options for deviating from linear relationships such as logarithmic, exponential, and polynomial relationships. We can model these relationship by using mathematical expressions. However, before we can teach you about these modifications, you need to learn a few more things. But keep an eye out toward the end of the chapter for the video that covers how to calculate non-linear trendlines.