Final Project

Problem Description

The mtcars dataset contains various car attributes such as miles per gallon (mpg), number of cylinders (cyl), horsepower (hp), and weight (wt), among others. Each row in the dataset represents a different car model.

The problem we’re addressing here is to understand how these attributes relate to each other. Specifically, we’re interested in the relationship between the number of cylinders a car has (cyl) and its fuel efficiency, measured in miles per gallon (mpg).

The number of cylinders in a car is a key factor that can influence its performance characteristics, including its fuel efficiency. Cars with more cylinders tend to have more power, which can result in lower fuel efficiency. However, this is not always the case as other factors such as the car’s weight, aerodynamics, and engine technology can also play a role.

By visualizing and analyzing the data, we aim to gain insights into these relationships. This could help car manufacturers design more fuel-efficient cars or help consumers make more informed decisions when purchasing a car.

In this project, we will use the visual analytic techniques, part-to-whole and deviation analysis, to explore these relationships. The part-to-whole analysis will show us the distribution of car models by the number of cylinders, while the deviation analysis will show us how the mpg varies for cars with different numbers of cylinders.


Related Work

In the field of data visualization and analysis, there are numerous studies and projects that have explored the relationships between different variables in a dataset. These projects often involve similar methods and techniques that we are using in our current project with the mtcars dataset.

For instance, one common method used in these projects is correlation analysis, which measures the statistical relationship between two variables. This method can be used to understand if and how two variables in a dataset, such as the number of cylinders and miles per gallon in our case, are related.

Another related work is the use of regression analysis. This is a statistical method used to understand the relationship between a dependent variable (e.g., mpg) and one or more independent variables (e.g., cyl). This method can help us predict the mpg of a car based on its number of cylinders.

In terms of visual analytics, there are many examples of existing visualizations that have inspired our approach. For example, scatter plots and box plots are commonly used to visualize the relationship between two variables. These types of plots can provide a clear visual representation of the data, making it easier to identify trends and patterns.

Moreover, the use of part-to-whole and deviation analysis techniques in visual analytics is also prevalent in related works. Part-to-whole analysis helps in understanding the distribution of a categorical variable (like the number of cylinders in our case), while deviation analysis helps in understanding how a continuous variable (like mpg) varies across different categories.

In summary, our approach to solving the problem is grounded in well-established methods and techniques in data visualization and analysis. By learning from related works, we can apply best practices and avoid common pitfalls in our project.


Solution

The solution to our problem involves using the ggplot2 package in R to create visualizations that will help us understand the relationships between the variables in the mtcars dataset.

Step 1: Data Preparation
We first load the mtcars dataset and create a new dataframe for the pie chart. This dataframe groups the cars by the number of cylinders and counts the number of cars in each group. This is done using the group_by and summarise functions from the dplyr package.



Step 2: Part-to-Whole Analysis
We then create a pie chart to show the part-to-whole relationship of car counts by the number of cylinders. This is done using the geom_bar function to create a bar chart, and then converting it to a pie chart using the coord_polar function. The function geom_text is used to add the text labels, and aes(label = count) specifies that the labels should display the count of cars. The position_stack(vjust = 0.5) argument positions the labels in the middle of each pie slice. The theme_void function is used to remove the background and axes for a cleaner look. 



Step 3: Deviation Analysis
Next, we create a box plot to show the deviation of miles per gallon (mpg) by the number of cylinders. This is done using the geom_boxplot function. The box plot shows the median, quartiles, and potential outliers for mpg within each group of cars, allowing us to see how mpg deviates for cars with different numbers of cylinders.




Step 4: Interpretation 

Finally, we interpret the results of our visualizations. The pie chart represents the distribution of car models based on the number of cylinders. Each slice of the pie corresponds to a different number of cylinders, and the size of the slice represents the proportion of cars with that number of cylinders. From the pie chart, we can see the part-to-whole relationship of car counts by the number of cylinders. This gives us an idea of the distribution of the number of cylinders across all car models in the dataset. For example, if one slice is significantly larger than the others, it indicates that a large proportion of cars have that particular number of cylinders. The largest slice in our chart is the "8 Cylinder" slice showing that the highest number of cars in the dataset, 14 cars, are in that category. 6-cylinder cars are the smallest portion of the chart. 

The box plot shows the variation in miles per gallon (mpg) for each group of cars, grouped by the number of cylinders. The box in the box plot represents the interquartile range (IQR), which is the range between the first quartile (25th percentile) and the third quartile (75th percentile). The line inside the box represents the median mpg value for each group. The whiskers of the box plot represent the range of mpg values within 1.5 times the IQR from the first and third quartiles. Any points outside of this range are considered outliers and are represented as individual points. From the box plot, we can see how mpg varies for cars with different numbers of cylinders. If the median line of a box is lower and the box is wider, it indicates that cars with that number of cylinders tend to have lower mpg and greater variability in mpg. As we can see in the box plot, 4-cylinder cars tend to have a higher mpg than any other cars. 6-cylinder cars on the other hand, have the lowest variance in their mpg compared to the other cars. 

By interpreting these visualizations, we can gain insights into the relationships between the variables in the mtcars dataset. For instance, found that cars with more cylinders tend to have lower fuel efficiency (mpg), indicating a negative relationship between these two variables.



Comments