Datafest, Stanford University, May 19-20, 2012
Welcome to the Datafest's wiki! This wiki is part of the event. We'll be posting more information on the logistics soon, so, please, check back often. Everyone is free to use the wiki to prepare for the contest, contribute ideas, find teammates and so on. During the Datafest, we'll be using the wiki to post projects for the jury members to evaluate and the general public to see.
Jen-Hsun Huang Engineering Center, Lower Level 475 Via Ortega. Stanford University S Service Rd, CA 94305
Please, plan to arrive around 9:00 on Saturday for breakfast and registration
Stanford's Internet Policy:
The event organizers have created guest accounts for participants to access Stanford wi-fi. We will be held responsible if you download anything illegally from the web like pirated music. Please don't get us in trouble.
Parking is free on the weekend everywhere at Stanford.
Parking Structure 2 on 285 Panama St, Stanford, CA 94305 is very close to Huang's Engineering Center:
The Oval is also convenient:
More info about parking here:
Visitor parking map:
Please add information about yourself (URLs, Twitter name, projects, affiliation, etc.). [aside: The twitter hash tag for this event is #datafest.]
- Dr. KRS Murthy, CEO, I Cubed, www.bigdataexpert.blogspot.com I organize numerous events every year, including on Big Data
- Tim Stutt, Student at UC Berkeley School of Information, I'm organizing a data event this October 
- John Kern, software engineer on a self funded sabbatical, (twitter, linkedin)
- Mohammad Almalkawi, software engineer, (twitter, linkedin)
- Tim Chen, blogger, NerdWallet, nerdwallet, linkedin, btw, we're hiring
- Danny Willis, database producer/data analyst, Bay Area News Group (Contra Costa Times/Oakland Tribune/San Jose Mercury News) (@DannyJWillis, email)
- Dave Gilson, senior editor, Mother Jones (MoJo campaign finance coverage, twitter)
- Rohan Mittal, Research Scholar, Lawrence Berkeley National Laboratory, (twitter, linkedin)
- Vamshidhar Reddy Boda, software engineer, (GitHub, linkedin)
- Corinne Horn, Stanford Electrical Engineering graduate student, EE/CS/math undergrad
- Rosie Cima, Stanford Journalism MA, Symbolic Systems HCI undergrad
- Jorge Imbaquingo, Knight Fellow at Stanford, Managing Edito HOY Newspaper-Ecuador 
- Joey Baker, Frontend Developer/Interaction Designer github, site, @joeybaker
- Mike Tahani, scraping/parsing, data analysis & visualization, (@mtahani, github, I ♥ MAPS)
- Karthik Manimaran, Entrepreneur, Data Hoarder & Chief Hacker - WeLink, (linkedin), Wants to join a team
- Ted Louie, data modeler/analyst, (@tedlouie, linkedin)
- Spencer MacColl, political accountant, former reporter/researcher at OpenSecrets.org, (twitter, linkedin)
- Jeffrey Goodman, editor and designer, Social Agency Lab - New Orleans, LA.
- Kathryn Hurley, Developer Programs Engineer, Google. Google APIs guru (Fusion Tables, Maps, Geocoding, and BigQuery)
- Ludi Rehak, Stanford Statistics student, Data Miner
- jose d lopez, coder (backend) (@TUMIS, AbelForAssembly), Volunteer to apply your skills to a current campaign, CA AD18
- Jeff Ubois, archivist, twitter: jeffubois
- Sheba Najmi, Code for America Fellow, (twitter, linkedin). We really want data scientists to apply to be CfA fellows for 2013!
- Patricia Carbajales, Geospatial Manager, Stanford University (bit.ly/geotraining)
- Michael Dale, Post-doctoral Scholar, Global Climate Energy Project, Stanford University
- Cinna Julie Wu, Applied Math PhD Candidate, UC Berkeley, (linkedin)
- Kevin Lin, grad student, UC Berkeley
- Prashant Mehta, Analytics Professional
- Namkyu Ryoo, Interaction Designer at Morningstar(Finance Research Company), (linkedin)
- Andreas Paepcke, Computer Science researcher, Stanford University, (Stanford University)
- 9:00 am - Breakfast and registration
- 10:00 am - Opening remarks:
- Ann Grimes, Director, Graduate Program in Journalism, Stanford University. Bio
- Adam Bonica, Assistant Professor of Political Science, Stanford University. Bio
- Ethan Phelps-Goodman, Senior Software Developer, The Sunlight Foundation. Bio
- 10:30 am - Team formation
- 11:30 am - Participants start working
- 12.30 pm - Lunch
- 6:00 pm - Dinner
- 11:00 pm - Doors close. But you can keep working inside the building if you wish.
- 8:00 am - 10:00 am - Breakfast. You can start working any time after 8:00 am
- 12:00 pm - Lunch
- 3:00 pm - Prepare for presentations
- 3:30 pm - Project demonstrations
- 4:30 pm - Knight Fellowship Director Jim Bettinger will announce the winners. Bio
- 5:00 pm - Dinner
- Best Overall Project
- Best in Apps and Widgets
- Best Insight
- Best Visualization
We will award a 46 inch Sony TV and a Blu-ray player as well as gift cards.
Priorities, Prize Criteria & Rules
Our single highest priority is this. You are willing to forego the leisure and spend the weekend working hard. You are contributing your skills and expertise. We would like to see the value of your work extend beyond this weekend. This can be accomplished if resulting projects offer the following:
- Interactivity. This would allow users to explore the data as they would like instead of presenting them with static graphics.
- Easy updates. This year’s election season will be generating more data as it moves towards November 4. Then there is next year and new data available in the longer term. Your work should be able to automatically update the underlying dataset as new information is available.
- Simple manipulation. Widgets and apps that allow users to manipulate and mash data.
It is wonderful to share your data and we would love it if you open-source it
- Innovation: How innovative is the product you have developed.
- Potential: Even if your prototype is not finished, what is the potential moving forward.
- Design: We will look at the interactivity, easy updates and simple manipulation described above.
- For the Best Insight Category we will look at:
- Impact of your discovery
- Complexity of the analysis
- Scope of your discovery
Along with your work (an analysis result, visualization, widget or app), you are expected to submit a brief summary of tools and methods used. Please also read Project Submission Guidelines section below.
We have some wonderful news. Google is donating to us unlimited access to their App Engine service. They have also effectively lifted per user and file size limits on Fusion Tables specially for the Datafest participants. (Please, note that query performance for tables above 100 MB does start to deteriorate so it is advisable to stay below that level. However, there are tables in regular use that are much larger than that.) Please, see Teresa Bouza if you need these extra resources. Also, Kathryn Hurley from Google's Developer Relations team is attending the event. A final note: the Google Fusion Table team has just launched a new UI ("View experimental") that has a bunch of nice data exploration features (facets + multiple tabs). The link will only appear with tables.
R users may run into a memory limit. R does not natively handle datasets larger than main memory. The solution is a package from Revolution Analytics called Revolution R Enterprise. It's free to academic users. Its component called RevoScaleR eliminates the memory barrier. Overall, Revolution R Enterprise is a powerful tool, capable of handling terabyte-class data sets.
But, generally, everyone is welcome to use any tools and methods: from basic functions in a spreadsheet to the latest machine learning algorithms.
If possible, please download the data you need onto your laptop before the actual event. We'll make an effort to have data available on USB thumb drives and CDs to meet contingencies. Still, the more people pre-download it, the better prepared we all are in case users aren't able to download it during the event, for whatever reason (network congestion, server failure, user error, etc).
Feel free to add more links to the list of relevant data sources below.
- Campaign finance data is available for the last 33 years, going back to 1979. The files and their descriptions are available on FEC's website (Note, however, that the raw data is quite difficult to parse. See Influence Explorer, below, for the same data in more accessible formats.)
- Influence Explorer contains a wide variety of influence-related datasets, including federal and state campaign finance, lobbying, federal spending, various enforcement datasets and more. Raw data can be searched and downloaded in Excel format from data.influenceexplorer.com or downloaded in bulk CSVs from data.influenceexplorer.com/bulk/.
- Fech, by the New York Times, is an API for parsing electronic candidate, PAC and party campaign filings from the Federal Election Commission.
- Redistricting data is available from the 2010 Census Redistricting Data Program
- Census Bureau
- Census Historical Files
- Follow the Money.org National Institute on Money in State Politics
- Bureau of Labor Statistics
- Data for project on state government in Louisiana
- Contributions to Super PACs
- Spending by Super PACs
- StateIntegrity.org In particular, see the corruption index. State Integrity has also given us the raw data behind their investigation. Please ask a staff member for more details.
Work on Data Done by Others Previously
Feel free to add more links to the list below.
- The New York Times Developer Network
A Campaign Finance API, with which you can retrieve data from FEC filings. To use the Campaign Finance API, you must sign up for an API key. Usage is limited to 5000 requests per day (rate limits are subject to change). Also see their FEC parser (ruby) fech and Ruby API wrapper Campaign Cash.
- The Computational Legal Studies Blog
Run by three scholars from the U. of Michigan and Princeton, the blog has several data visualizations like this: The Senate Campaign Contribution Network: A Visualization
This site has several examples of data analysis and visualization, some of them using R (in the blog section of the site). There is also a list of several open-source tools that can be used with campaign finance data (in the projects section); one example is Fechell, a ruby library for parsing FEC reports.
A free, 60-minute webcast titled “Using Social Network Analysis to Understand Campaign Finance”
- Sunlight Labs
Provides a data api of FEC data (and more). TransparencyData (now part of InfluenceExplorer) makes it easier to access FEC data. SuperPAC and Independent Expenditure data is also available via follow the unlimited money file downloads.
Campaign finance data stories
The Origins of SuperPAC Money visualizations.
- The New York Times
- The Oklahoman
The DataWatch blog of Oklahoma’s largest newspaper has some analysis of the 2010 data
Some data visualizations
- The Wall Street Journal
- The National Institute on Money in State Politics
Research and a database focused on influence of campaign money on state-level elections and public policy in all 50 states.
- Forum One Communications
The Alexandria, Va-based web strategy and development firm's Datamasher.org allows you to create mash ups and visualizations of government data. The datasets include campaign finance data.
Feel free to suggest ideas for the participants to work on.
- Geocoding campaign contributions
- Associating campaign contributions by the district they are from (address) and the district they are going to (candidate)
- Comparing contributions per and post redistricting
- Comparing socio-economic data (census, BLS data) with campaign contributions (How much of the funds come from the poorest neighborhoods)
- Analyzing campaign contributions by high-tech companies or their board members. Comparing this with 2008, and perhaps contrasting the apparent rise in funding from high-tech contributors with that from Wall Street contributors
- Looking at campaign finance figures mapped over voter turnout and results and see how predictive that can be.
- Analyzing Louisiana data to show a political machine in action, Cajun style. Jeffrey Goodman, Editor at Social Agency Lab in New Orleans, has promised special prizes.
- Looking at super PAC donors by location, industry, past giving records
- Looking at major individual donors' giving over time: Have super PACs created a new group of megadonors and/or simply empowered existing donors?
- Are candidates with more super PAC support more likely to win elections/primaries? (Conversely, are candidates targeted by super PACs more likely to lose?)
- Compare money spent vs. polling data. It would be great to add in a counter of potential money left to spend (combo of campaign funds, like PAC funds, etc…). It could also be fun to add in a division of how money was spent (campaign staff vs. tv ad spend vs. mailers, etc…)
- Try to ascertain the flow of contributions from public donations to PACs over the timeline of available data ( adjusted for inflation, demographics, purchasing power and excluding outliers)
- Ratio of public to PAC donations over the the timeline of the data ( adjusted for inflation, demographics, purchasing power and excluding outliers)
- Bundling is very interesting (bundlers have outsized influence), but in general not disclosed. Two exceptions are: required disclosure when registered lobbyist bundle and the voluntary disclosure of many presidential candidates. Could this data be used to identify recognizable patterns of bundling, and then detect other non-disclosed bundlers?
- Related to the above idea, could we automatically detect groups of aligned donors, based on things like location, employer, timing of donation and recipient? Could we make guesses that a group of people attended a fundraiser together? Or identify the social network of large donors?
- Two big weaknesses of the FEC data are that donors are identified by name only--not a unique ID--and that employer companies aren't reported in a standardized way. The Center for Responsive Politics addresses these with manual standardization of the data, but that process is slow and labor intensive. Could automated methods tell when two donor records with similar names are actually the same person? Or when two employer names written slightly differently are actually the same company?
- Map visualizations: there are tons of different visualizations that could be made if someone has the technical skills to geocode and map data.
- Campaign finance can tell you what politicians a company is interested in. Lobbying reports can tell you what issues and bills they're interested in. Combine the two, and you may be able to see *why* a company gave to a particular politician.
- How is state-level different from national-level? When do big national orgs give at the local level? Are there orgs that give locally but not nationally? Are there interesting instances of local giving though subsidiaries that aren't obviously related to parent company? Does the party breakdown change by locality--that is, will companies give heavily to Democrats in one locality while giving heavily to Republicans in another?
- Charities and advocacy (not specifically campaign finance): can we correlate charitable giving by companies with advocacy by the charity that aligns with the companies' positions? Inspired by AT&T and net neutrality debate: AT&T gave to various grassroots groups, who then lobbied against net neutrality rules.
- Tea party: how does tea-party candidates' fundraising differ from other Republicans? Does it change once they're incumbents?
- Investigate SuperPACs: Who's the treasurer and staff? Where else do they work? What's their giving? Do other orgs/PACs share an address? A building? Who are the donors? Can we trace the anonymous-sounding corporate donors to actual people?
- Independent Expenditures: What was the effect on the 2010 races? When did outside spending sway the race? Under what conditions is there heavy outside spending? Are there typical strategies as far as timing? What about independent expenditures during primary races? Primaries are interesting because it's spending against members of the own party.
- Geocoding data for all 33 years onto a single interactive map.
- Text Mining: Politicians' voting records, speeches and their offices' press releases are online. The full text of news reports and transcripts of TV talk shows are available online and to anyone with access to Lexis-Nexis. Could analyzing these texts shed more light on the direction, size, influence and effect of money flows in campaign finance?
- Can software developers produce widgets or apps that make it easy to update the data visualizations produced at the Datafest? This would help extend the value of the event beyond May 19 weekend and onto the rest of the election season.
- What data is available about financing of city or county elections? Some cities (try NY, LA and Chicago) collect and release political contribution information. A survey of what local governments have this information would be very interesting. Or pick one local government and see what you can find in the data.
- Campaign spending is often overlooked. It would be great to have a method for comparing spending across committee types (PAC, House, Senate), identify the outliers and figure out why their spending is so different or how it changed over time. The FEC electronic filings are available online and have expenditure information. Tools like [FEC-Scraper] and [FECH] can harvest the data.
- Team Awesome
- Team Gophers
- Team Z
- Team KeyStoners
- Team Donkey Versus Elephant$
- Team G8tor
- Team Black Cobra
- Team Most Excellent
- Team Frienemies
- Team Eagle
- Team Oatmeal: no presentation.
Project Submission Guidelines
By midnight Saturday, May 19:
Please create a page for your project or team in "Projects & Teams" section of this page. Include the following information:
- Your project or team name.
- A brief description of what you are working on.
- A list of people on the team.
- What prize category you think your project fits in.
- What you are starting with: existing codebase, libraries, etc.
By 3:00 pm Sunday, May 20:
Please add to your page the following information:
- A brief summary of your work.
- A step-by-step explanation of how you reached the result(s). Screenshots will be much appreciated.
- Any links such as to a code repository, Google docs, etc.
Comments & Questions
- I think a lot of the questions above can be (broadly) answered with good maps. Unfortunately, that requires a good deal of geocoding. Does anyone have access to a free, quality geocoder w/out rate limits for use this weekend? -- @mtahani
- Re: geocoding limits. We have great news. Google is being very generous. They have lifted Table Fusion limits specially for the Datafest. Details are in "Tools" section above.