My first adventure with data exploration

Open source program R is widely used by statisticians because is a very powerful tool to deal with large data sets. I am exploring ways with the help of Leonid Pekelis to make it better known and easier to use among journalists.

Here is a preview of the results of our first experiment. We plan to include a step by step explanation soon. For our first experiment we used this data set about US economic and military aid from Data.gov, a US government initiative that wants to increase public access to datasets by the Executive Branch of the Federal Government. http://explore.data.gov/catalog/raw/

These data are U.S economic and military assistance by country from 1946 to 2009. This is the authoritative data set of U.S. foreign assistance. The data set is used to report U.S foreign assistance to Congress as required by the Foreign Assistance Act, Section 634.

Our first steps:

1. We got the data from the US government website
2. Looked at the data in Excel. Basic observations about the data set
3. We plotted (visualized) the data over time
4. We clustered the countries into ten groups
5. We did clustering a second time adjusted for inflation
6. We looked at the per-capita aid and did some association rules based on that. Retail companies have done that for years to find out consumer patterns. That’s how a drug store company found out that on Thursdays male customers who bought diapers between 5.00 and 7.00 pm also bought beer.

This is an overview of the entire dataset. The data here is not adjusted for inflation. Every line corresponds to a country. It tracks how much aid the US government gave to that country over the last 60 years. This allowed us to quickly visualized where most of the US government aid went to. Countries are group by 10 colors or clusters based on the amount of money they received over time. Cluster number 1 included 158 countries so we decided to make the picture cleaner by having just one line for that group. The line represents the average amount received. We did the same with Cluster number 2 which includes 27 countries.

•••

It shows some of the 158 countries in Cluster number 1

•••

It shows the countries in clusters two to ten. One thing clustering allowed us to do was to split the countries into two categories: interesting and uninteresting. Countries with their own group, like the United Kingdom or Iraq, have very different aid patterns than any other country. Egypt and Israel, in cluster 7, have similar aid over time, but different from anyone else.

•••

We also looked at the aid adjusted for inflation. The clusters are still basically the same. The one change was to identify Great Britain post World War II as the single biggest amount of aid from the US since the 40s.

•••

Names of countries from clusters 1, 2 and part of 3

•••

And the same for clusters 4 to 10.

 •••

These are some association rules that we found. The data is aid per capita. We asked R to look for rules describing countries in the top 10%, meaning the countries that got the most money. The probability of the left hand side (lhs) and the right hand side (rhs) happening is very high in this scenario, so we call them “high support rules.” For example, look at the first line: In 73% of the years both Israel and Jordan were in the top 10% of military aid. This is what “support” means. We have a “confidence” of 92% of Israel getting to the top of military aid if Jordan is also at the top. The “lift” tells us how much our “confidence” increases when we include the left side of the equation. In this case we are 17% more confident that Israel will be at the top of military aid if Jordan also is.

•••

It shows the rules that are found with the highest “lift”. This tells us how much the probability of something happening increases if you add additional information on the left side of the rule. These rules are attractive because they are rules that would really give you an edge in a bet on which country receives which aid. Consequently, we believe they better represent the tagline “if the US gives aid here, it will probably give aid there as well.” Look at the first line. The probability of Iraq being in the top 10% of economic aid increases eight times or 800% if Afghanistan is in the top 10% of military aid.

 

Add Comment Add yours ↓

Your Comment