Exploratory data analysis with r pdf


The ability to take data—to be able to understand it, to process it, to extract value from it, to visualize it, to communicate it—that's going to be a hugely important. Exploratory data analysis is a bit difficult to describe in concrete definitive This book covers some of the basics of visualizing data in R and summarizing high- Now you can view the file 'osakeya.info' on your computer. Exploratory Data Analysis With R () - Ebook download as PDF File .pdf), Text File .txt) or read book online. Exploratory Data Analysis With R.

Language:English, Spanish, Dutch
Published (Last):04.02.2016
Distribution:Free* [*Registration Required]
Uploaded by: ROSEMARIE

75332 downloads 143170 Views 29.61MB PDF Size Report

Exploratory Data Analysis With R Pdf

This book covers the essential exploratory techniques for summarizing data with R. These techniques are typically applied before formal. This book teaches you to use R to effectively visualize and explore complex datasets. Exploratory data analysis is a key part of the data science. (SCP) surface chemistry and demonstrate some of the Exploratory Data Analysis (EDA) functions in R. Finally, we introduce the concept of statistical tests in R.

This book covers the essential exploratory techniques for summarizing data with R. These techniques are typically applied before formal modeling commences and can help inform the development of more complex statistical models. Exploratory techniques are also important for eliminating or sharpening potential hypotheses about the world that can be addressed by the data you have. We will cover in detail the plotting systems in R as well as some of the basic principles of constructing informative data graphics. We will also cover some of the common multivariate statistical techniques used to visualize high-dimensional data. Thanks for downloading this book. For those of you who downloadd a printed copy of this book, I encourage you to go to the Leanpub web site and obtain the e-book version , which is available for free. The reason is that I will occasionally update the book with new material and readers who download the e-book version are entitled to free updates this is unfortunately not yet possible with printed books.

These will be our two landmarks for comparison. Measurement Source: Name County. Local Time. Local variable to see what time measurements are recorded as being taken. Name Date. What if we just pulled all of the measurements taken at this monitor on this date? Measurement 1 Local Sample. Since EPA monitors pollution across the country. It seems. We can take a look at the unique elements of the State.

Perhaps we should see exactly how many states are represented in this dataset. In the U. We can get a bit more detail on the distribution by looking at deciles of the data. Since they are clearly part of the U. This last bit of analysis made use of something we will discuss in the next section: Measurement Min.

External validation can often be as simple as checking your data against a single number. Validate with at least one external data source Making sure your data matches something outside of the dataset is very important. The exact details of how to calculate this are not important for this analysis.

In this case. We knew that there are only 50 states in the U. It allows you to ensure that the measurements are roughly in line with what they should be and it serves as a check on what other things might be wrong in your dataset.

To identify each county we will use a combination of the State. You may refute that evidence later with deeper analysis. Name and the County. For now. Because we want to know which counties have the highest levels. Try the easy solution first Recall that our original question was Which counties in the United States have the highest levels of ambient ozone pollution?

For the moment. Name variables. Local Then we will split the data by month to look at the average hourly levels. Does that number of observations sound right? Date Date. Mariposa County. We can take a look at how ozone varies through the year in this county by looking at monthly averages. For comparison we can look at the 10 lowest counties too. Caddo County. We can check the data to see if anything funny is going on.

In fact some of the monthly averages are below the typical method detection limit of the measurement technology. So the shuffling process could approximate the data changing from one year to the next. Is this a problem? Would it affect our ranking of counties if we had those measurements?

Given the seasonal nature of ozone. Exploratory Data Analysis Checklist 30 Here we can see that the levels of ozone are much lower in this county and that also three months are missing October.

We use the resampled indices to create a new dataset. We can imagine that from year to year. The statistical jargon for this approach is a bootstrap sample. You should always be thinking of ways to challenge the results. First we set our random number generator and resample the indices of the rows of the data frame with replacement.

Challenge your solution The easy solution is nice because it is. Name California California Wyoming Arizona Arizona California Utah California Arizona Nevada We can see that the rankings based on the resampled data columns 4—6 on the right are very close to the original. Name ozone 1 Mariposa 0. We can also look at the bottom of the list to see if there were any major changes. This might suggest that the original rankings are somewhat stable. Name ozone 1 California Mariposa 0. We addressed this by resampling the data once to see if the rankings changed.

The goal of exploratory data analysis is to get you thinking about your data and reasoning about your question. Do you have the right question?

Do you need other data? One sub-question we tried to address was whether the county rankings were stable across years. Do you have the right data? Sometimes at the conclusion of an exploratory data analysis.

It also gave us a number of things to follow up on in case we continue to be interested in this question. The example analysis conducted in this chapter was far from perfect.

Evidence for a hypothesis is always relative to another competing hypothesis. Data graphics should generally follow this same principle.

Principles of Analytic Graphics Watch a video of this chapter Each child was assessed at baseline and then 6-months later at a second visit. You should always be comparing at least two things. He discusses how to make informative and useful data graphics and lays out six principles that are important to achieving that goal.

Show comparisons Showing comparisons is really the basis of all good scientific investigation. This study was conducted at the Johns Hopkins University School of Medicine and was conducted in homes where a smoker was living for at least 4 days a week. Change in symptom-free days by treatment group. But of course. Principles of Analytic Graphics 34 Change in symptom-free days with air cleaner There were 47 children who received the air cleaner.

I show the change in symptom-free days for a group of children who received an air cleaner and a group of children who received no intervention. Such a display may suggest hypotheses or refute them. In the plot below. Change in symptom-free days by treatment group From the plot. Principles of Analytic Graphics 35 Here we can see that on average. Given that the homes in this study all had smokers living in them. Principles of Analytic Graphics 36 The hypothesis behind air cleaners improving asthma morbidity in children is that the air cleaners remove airborne particles from the air.

For anything that you might study. Show multivariate data The real world is multivariate. This pattern shown in the plot above is consistent with the idea that air cleaners improve health by reducing airborne particles. The point is that data graphics should attempt. Change in symptom-free days and change in PM2. In this case we are tracking fine particulate matter. There are a variety of ways that you can show multivariate data. That can be easily shown in the following plot of mortality and date.

The PM10 data come from the U. Each point on the plot represents the average PM10 level for that day measured in micrograms per cubic meter and the number of deaths on that day.

That is. Environmental Protection Agency and the mortality data come from the U. One example is the season. Here is just a quick example. National Center for Health Statistics. From the plot it seems that there is a slight negative relationship between the two variables. PM10 levels tend to be high in the summer and low in the winter. Note that the PM10 data have been centered the overall mean has been subtracted from them so that is why there are both positive and negative values.

What happens if we plot the relationship between mortality and PM10 by season? That plot is below. One should never let the tools available drive the analysis.

A general rule for me is that a data graphic should tell a complete story all by itself. In some cases. Describe and document the evidence Data graphics should be appropriately documented with labels. This set of plots illustrates the effect of confounding by season. You should not have to refer to extra text or descriptions when interpreting a plot. Integrate evidence Just because you may be making data graphics.

This example illustrates just one of many reasons why it can be useful to plot multivariate data and to show as many features as intelligently possible.

In other words. You can also include printed numbers. I tend to err on the side of more information rather than less. Principles of Analytic Graphics 40 Imagine if you were writing a paper or a report. The plot on the right uses the same plot function but adds annotations like a title. Labelling and annotation of data graphics The plot on the left is a default plot generated by the plot function in R.

This includes the question being asked and the evidence presented in favor of certain hypotheses. Starting with a good question. I plot the same data twice this is the PM10 data from the previous section of this chapter. Content Analytical presentations ultimately stand or fall depending on the quality. Is there enough information on that graphic for the person to get the story? While it is certainly possible to be too detailed. No amount of visualization magic or bells and whistles can make poor data.

In the simple example below. Key information included is where the data were collected New York. Graphics Press LLC. Beautiful Evidence. I encourage you to take a look at his books. Edward Tufte Often color and plot symbol size are used to convey various dimensions of information. Visualizing the data via graphics can be important at the beginning stages of data analysis to understand basic properties of the data. In later stages of an analysis. Characteristics of exploratory graphs For the purposes of this chapter and the rest of this book.

One of the national 18 https: Exploratory Graphs Watch a video of this chapter: Part Part There are many reasons to use graphics or plots in exploratory data analysis. If you just have a few data points.

If you have a dataset with more than just a few data points. The goal of making exploratory graphs is usually developing a personal understanding of the data and to prioritize tasks for follow up. Details like axis orientation or legends. This distinction is not a very formal one. The data we will be using come from the U. Exploratory graphs are usually made very quickly and a lot of them are made in the process of checking out the data.

Air Pollution in the United States For this chapter. EPA web site. One key question we are interested is: Are there any counties in the U. Getting the Data First. This question has important consequences because counties that are found to be in violation of the national standards can face serious legal consequences. Data on daily PM2. Exploratory Graphs 43 ambient air quality standards in the U.

This dataset contains the annual mean PM2. Back to the question. The hist function makes a histogram. The density function computes a non-parametric estimate of the distribution of a variables Five Number Summary A five-number summary can be computed with the fivenum function.

The barplot can be made with the barplot function. How can we see if any counties exceed the standard of 12 micrograms per cubic meter? One Dimension For one dimensional summarize. Boxplots are a visual representation of the five-number summary plus a bit more information. This gives the minimum. Histograms show the complete empirical distribution of the data..

Barplots are useful for visualizing categorical data. Given that the mean is fairly close to the median. Note that in a boxplot. For interactive work.

These points migth be worth looking at individually. Note that although the current national ambient air quality standard is 12 micrograms per cubic meter.

From the plot.

Since these counties appear to have very high levels. Histogram A histogram is useful to look at when we want to see more detail on the full distribution of the data.

The boxplot is quick and handy. We can get a little more detail of we use the rug function to show us the actual data points. The hist function has a default algorithm for determining the number of bars to use in the histogram based on the density of the data see?

Note that there are still quite a few counties above the level of Overlaying Features Once we start seeing interesting features in our data. For example in our boxplot above. While the boxplot gives a sense. Here we have one categorical variable. Barplot The barplot is useful for summarizing categorical data. We use the table function to do the actual tabulation of how many counties there are in each region.

We can see how many western and eastern counties there are with barplot.

Exploratory Data Analysis

Similar in concept to scatterplots but rather plots a 2-D histogram of the data. Transformations of the variables e. Can be useful for scatterplots that may contain many many data points. Using multiple boxplots or multiple histograms can be useful for seeing the relationship between two variables.

For investigating data in two dimensions and beyond. Scatterplots are the natural tool for visualizing two continuous variables. These are sometimes helpful to capture unusual structure in the data. Plotting symbols with different sizes can also achieve the same effect when the third dimension is continuous. Plotting points with different colors or shapes is useful for indicating a third dimension. Actual 3-D plots for example.

A conditioning plot. Exploratory Graphs 54 For visualizing data in more than 2 dimensions. Multiple Boxplots One of the simplest ways to show the relationship between two variables in this case.

Of course. Using the pollution data described above. Spinning plots can be used to simulate 3-D plots by allowing the user to essentially quickly cycle through many different 2-D projections so that the plot feels 3-D. Multiple Histograms It can sometimes be useful to plot multiple histograms. Side-by-side boxplots are useful because you can often fit many on a page to get a rich sense of any trends or changes in a variable.

Since the region variable only has two categories. Their compact format allow you to visualize a lot of data in a small space. Here is the distribution of PM2.

A quick Introduction to Exploratory data analysis

From the plot above. The PM2.

Here is a scatterplot of latitude and PM2. Scatterplots For continuous variables. Here we color the circles in the plot to indicate east black or west red. Using Color If we wanted to add a third dimension to the scatterplot above.

Separating the plots out can sometimes make visualization easier. For plotting functions. We can find out by looking directly at the levels of the region variable. Multiple Scatterplots Using multiple scatterplots can be necessary when overlaying points with different colors or shapes is confusing sometimes because of the volume of data. They are also useful for exploring basic questions about the data and for judging the evidence for or against certain hypotheses.

This chapter and this book will focus primarily on the base plotting system.

There are three different plotting systems in R and they each have different characteristics and modes of operation. They three systems are the base plotting system. In more R-specific terms. The idea is you start with blank canvas and build up from there. Plotting Systems Watch a video of this chapter If you have specific plot in mind.

While the base plotting system is nice in that it gives you the flexibility to specify these kinds of details to painstaking accuracy. Another typical base plot is constructed with the following code.

This is one problem that the ggplot2 package attempts to address. The lines function is used to annotate or add to the plot. Here we use the plot function to draw the points on the scatterplot and then use the title function to add a main title to the plot. To use the lattice plotting functions you must first load the lattice package with the library function.

The Lattice System The lattice plotting system is implemented in the lattice package which comes with every installation of R although it is not loaded by default.

These types of plots are useful for looking at multidimensional data and often allow you to squeeze a lot of information into a single window or page. Stopping distance" Base plot with title We will go into more detail on what these functions do in later chapters. There is no real distinction between functions that create or initiate plots and functions that annotate plots because it all happens at once.

Lattice plots tend to be most useful for conditioning types of plots. This is possible because entire plot is specified at once via a single function call. The ggplot2 System The ggplot2 plottings system attempts to split the difference between base and lattice in a number of ways. One downside with the lattice system is that it can sometimes be very awkward to specify an entire plot in a single function call you end up with functions with many many arguments.

The plot itself contains four panels—one for each region—and within each panel is a scatterplot of life expectancy and income. Taking cues from lattice. The notion of panels comes up a lot with lattice plots because you typically have many panels in a lattice plot each panel typically represents a condition. Here is an example of a lattice plot that looks at the relationship between life expectancy and income and how that relationship varies by region in the United States.

A typical plot with the ggplot package looks as follows. There are additional functions in ggplot2 that allow you to make arbitrarily sophisticated plots. The defaults used in ggplot2 make many choices for you. CRC Press. Deepayan Sarkar R Graphics.

Multivariate Data Visualization with R. Hadley Wickham Plotting Systems References Paul Murrell For plots that may be printed out or be incorporated into a document. Note that typically. When making a plot. For quick visualizations and exploratory analysis. The code for mst of the key graphics devices is implemented in the grDevices package. On a Mac the screen device is launched with the quartz function. On a given platform. The list of devices supported by your installation of R is found in?

There are also graphics devices that have been created by users and these are aviailable through packages on CRAN. Graphics Devices Watch a video of this chapter: Part Part A graphics device is something where you can make a plot appear.

Functions like plot in base. The ggplot2 system combines concepts from both base and lattice graphics but uses an independent implementation. Base graphics are usually constructed piecemeal. Call a plotting function like plot. This involves 1. Graphics Devices 67 The Process of Making a Plot When making a plot one must first make a few considerations not necessarily in this order: On the screen? In a file? Or is it just a few points? These generally cannot be mixed.

The first is most common. There are two basic approaches to plotting.

Lattice graphics are usually created in a single function call. In this case we make a plot that gets saved in a PDF file. Call a plotting function to make a plot Note: Explicitly launch a graphics device 2. XML-based scalable vector graphics. Explicitly close graphics device with dev.

Annotate the plot if necessary 4. Creates bitmap files in the TIFF format. You can change the active graphics device with dev. Plotting can only occur on one graphics device at a time.

For example you might copy a plot from the screen device to a file device. Note that copying a plot is not an exact operation. The dev. In particulary. The currently active graphics device can be found by calling dev. The default graphics device is almost always the screen device. File devices are useful for creating plots that can be included in other documents or sent to other people For file devices. There are two phases to creating a base plot: In this chapter.

Initializing a new plot 2. The grDevices package was discussed in the previous chapter and it contains the functionality for sending plots to various output devices. If the arguments to plot are not of some special class. The base graphics system has many global parameters that can set and tweaked.

These parameters are documented in? Annotating adding to an existing plot Calling plot x. The Base Plotting System Watch a video of this chapter: Part Part The core plotting and graphics engine in R is encapsulated in the following packages: The graphics package contains the code for actually constructing and annotating plots.

Base Graphics Base graphics are used most commonly and are a very powerful system for creating data graphics. If you run this code and your graphics window is not already open. Any data points beyond 1. This phenomenon is common with environmental data where the mean and the variance are often related to each other.

In this case the monthly boxplots show some interesting features. Scatterplot Here is a simple scatterplot made with the plot function. This can be useful when you are making plots quickly. One thing to note here is that although we did not provide labels for the x.

Some Important Base Graphics Parameters Many base plotting functions share a set of global parameters. Here are a few key ones: The remainder of this chapter will focus on the default behavior of the plot function.

These parameters can be overridden when they are specified as arguments to specific plotting functions. The plot function makes a scatterplot. Calling plot will draw a plot on the screen device and open the screen device if not already open. Base Plotting Functions The most basic base plotting function is plot. After that. First we make the plot with the plot function and then add a title to the top of the plot with the title function.

R for Data Science

I start with the same plot as above although I add the title right away using the main argument to plot and then annotate it by coloring blue the data points corresponding to the month of May. Notice that when constructing the initial plot. This is a common paradigm as plot will draw everything in the plot except for the data points inside the plot window. Then you can use annotation functions like points to add data points. All Categories. Recent Books. IT Research Library.

Miscellaneous Books. Computer Languages. Computer Science. Electronic Engineering. Linux and Unix. Microsoft and. Mobile Computing. Networking and Communications. Software Engineering. Contributed Packages, 10 Jan [Online]. Accessed: 10 Jan 8. Grubbs F Procedures for detecting outlying observations in samples. Technometrics 11 1 Google Scholar 9. The Statistician — Authors and Affiliations.

TOP Related

Copyright © 2019 osakeya.info. All rights reserved.
DMCA |Contact Us