5 Keys to Exploring Your Data
Exploring Your Data
Describing and exploring your data is a crucial initial step when looking at a new dataset. Data exploration is important because it will establish the foundation for the rest of your analysis. It will help you understand and formulate expectations on how you expect models to perform, and it can influence dashboard and visualization design. Data exploration will further provide an opportunity to detect data integrity issues before they influence your results. It may even prevent you from sharing erroneous results.
You should always begin your analysis by calculating descriptive statistics and plotting your data.
There are 5 key points to describe during data exploration:
- Central Tendencies
- Spread / Variation
- Overall Shape
- Anomalous Observations
- Trends / Correlations
1. Central Tendencies
The mean is the statistic most often reported when discussing the average of data. It is simply the sum of all observations divided by the count of all observations. While this is usually an effective way to summarize the center of data, it can be heavily influenced by outliers or skewness.
The median is a more robust statistic for the center of data. To calculate the median, you would list your data in order and eliminate the highest and lowest data points until you arrive in the middle. The median is robust because outliers will not have much weight on the outcome of the calculations because they will be eliminated before reaching the middle data point.
The mode is capable of telling us the most common occurring data point. While this statistic is not reported very often, it can be beneficial in understanding your dataset.
The quartiles are useful to know because they provide detail on the skewness of the data. The quartiles split your ordered data into 4 quarters. The median is the middle quartile, or 50th percentile of the data. The other 2 quartiles are the 25th and 75th percentile of the data. If the spread between the lower quartiles is smaller than the spread at the top, then you can infer that the data is skewed to the left, or in other words, you have more data points on the lower side of your dataset.
2. Spread / Variation
The range of your data is the spread between the minimum and maximum. Calculating the range will enable you to tell if data falls outside of logical limits, and it will help you understand how far apart data points can be.
The interquartile range, or IQR, is the spread between the 25th and 75th percentiles, or the 1st and 3rd quartiles. This tells you where the middle 50% of the data falls.
The variance is used to determine the spread of the data around the mean. To calculate the variance, you first subtract the mean from each data point. This is the distance between the data point and mean. Then you square all of these distances, which makes everything positive and emphasizes data points that are further away from the mean. Lastly, you divide by the number of observations in the data minus one, which gives you an average of these squared distances.
The standard deviation is the square root of the variance, which converts the calculation back to the units of the data.
The most common occurring distribution is the normal distribution. This is the famous “bell shaped curve” where most of the data falls around the mean, and the remaining data tapers off equally towards the minimum and maximum. There are many other standard distributions of data which occur often in different circumstances. Some examples are the beta, gamma, uniform, or chi-square distributions. Plotting your data is a great visual method to determine which distribution it follows.
The skewness of a dataset refers to whether or not the data is heavily weighted towards the bottom or top. If a data is skewed then there is a lack of symmetry.
Kurtosis is the thickness of the tails relative to the normal distribution. There are two types of kurtosis that may occur. Platykurtic distributions contain more data in the tails than the normal distribution, and a leptokurtotic distribution contains less data in the tails.
4. Anomalous Observations
Outliers are data points that deviate heavily from the normal trend. They can be a major problem for many statistical models and calculations. You can account for outliers by identifying them and removing them from the dataset, or you can reduce the value you give to that data point relative to the other datapoints.
An influential point is an observation that heavily influences the results of data, but they do not have to deviate from the normal data trend. For example, if you are building a regression model that has an influential point, the slope of the line will change significantly based on the inclusion or exclusion of that single data point.
An erroneous observation may refer to data collection issues such as the miscollection of a data point. These observations will need to be adjusted if possible.
Other Common Data Integrity Issues
There are many other common data integrity issues that may arise. In some instances, there is a person inputting data into the system. Separate people could record the same data differently. If these types of issues are apparent, they will also need to be addressed.
5. Trends / Correlations
Time Series Data
Time series data are collected in a sequential pattern that is typically due to time. Recognizing if your dataset is time-based is an important step in determining the type of analysis that will work, and observing the trends over time before models are built.
There are many variables that are interrelated in most datasets. It is important to identify how variables are related and make note if they summarize similar information. Many machine learning techniques will suffer if data that is too similar is included. Also, it is important to understand how different data tables are related through keys and indices.
Another data trend to explore is lagged data. This refers to the fact that two variables themselves may not be related, but the lag of one variable may actually relate. For example, when forecasting financial data, macroeconomic variables may not have a direct relationship with the target variable. In essence, if the unemployment rate rises 2% today, there may not be an immediate effect until several weeks when people start running out of savings.
These five steps will enable you to gain an understanding of your datasets. These steps will allow you to determine the best technique to answer your business question and they will provide you with an idea if results are unexpected. If you enjoyed learning about the 5 keys to explore your data, please let me know if the comments section below!
I was recently asked to compile some useful business information and share it with each salesperson based on the specific customers they are assigned to.…Read More
A business analyst has one fundamental goal: to provide actionable, data-driven insight to improve future stakeholder decisions. Make no mistake, art and discretion are required…Read More
There are many tools designed to build dashboards and business intelligence tools; however, there may be occasions when it is more sensible to build your dashboard in…Read More
Across industries, businesses have benefited from the falling cost of data acquisition and ease of automated data collection. Smart companies have established processes and teams to…Read More