You already know that data is the bread and butter of reports and presentations. Data makes your presentation solid. It backs up the ideas you are selling. It gives people reasons to listen to you.

However, data digging is a struggle. It’s a struggle to look for reputable and legit sources, especially in this digital age.

To make our life easier, we have scraped up a list of useful databases that you can bookmark. Here are eight useful databases for you to dig for data (and a couple hundreds more).

1. Freebase

Freebase is an open platform for data sharing. It contains a wide range of topics from fictional characters to Modest Mouse. You can even curate your data with data plotting feature. You can plot your datasets in timeline or map.

2. UN Data

This database contains large datasets, consisting virtually all the public data collected by the United Nation. To access the API you have to sign up (it will only take a couple of minutes).

3. WorldBank

Where else to look for financial data of the world but the WorldBank? You can get virtually any country’s financial and economy standings here. Some other topics included are:

  • Agriculture & Rural Development
  • Aid Effectiveness
  • Economic Policy and External Debt
  • Education
  • Energy & Mining
  • Environment
  • Financial Sector
  • Health
  • Infrastructure
  • Labor & Social Protection
  • Poverty
  • Private Sector
  • Public Sector
  • Science & Technology
  • Social Development
  • Urban Development

4. Data.gov

Data.gov is leading the way in democratizing public sector data and driving innovation. This movement has spread throughout cities, states, and countries.  5 of 50+ categories:

  • Agriculture
  • Arts, Recreation, and Travel
  • Banking, Finance, and Insurance
  • Births, Deaths, Marriages, and Divorces
  • Business

5. Infochimps

Infochimps contains paid and free datasets just about anything. What’s cool about Infochimps is that you can download datasets into csv format. Wat’s more is that you can fiddle with the API to extract the data specific to your needs. Try Twitter as your search metric and you will see what I mean.

6. Google Public Data

The Google Public Data Explorer makes large datasets easy to explore, visualize and communicate.

7. Google Scholar

The Google Scholar is a free search engine that contains all kinds of academic literatures. Citing journal publishers, universities research papers, and other scholarly materials do not just make your content looks smarter, but as well as more trustworthy.

8. Data Market

Data Market contains in-house and third party datasets. It’s a good place to explore data related to economics, healthcare, food and agriculture, and the automotive industry.

And here’s a random collection of datasets.

  • Torrent downloads and uploads on Pirate Bay
  • Social media & networks – from Stanford Uni
  • Human Emotions by We Feel Fine: to allow other artists to more easily make pieces that explore these human emotions
  • LittleSis profiles who’s who in the biggest organisations in the world
  • NY Times bestseller
  • Trending Topics: Trending Topics serves Hot Wikipedia Topics daily. It gets you the top hits on Wikipedia by search query.
  • Google Flu Trends
  • NY Times People: User data for com, including the user profiles, activities, news feeds, and networks.
  • CrunchBase: Plenty of information about startups and large tech companies
  • Google Analytics
  • Social networks: Facebook/ Twitter/ Pinterest/ LinkedIn
  • Project management tools: Basecamp
  • Sales management tools: Salesforce
  • Survey tools: SurveyMonkey
  • Photo sharing tools: Flickr
  • Email marketing: MailChimp

You can also get some crazy amount of datasets and related stuff from Datamob.

DataWrangling is a place with a large volume of datasets from a wide range of fields. To make it easier for you, we have scraped the list for you below. However, do note that list may not be up to date as it was last updated in 2009. Be it so, it’s still a good place to start digging for data.

Tips on using this list: Each link comes with tags. You can do a search using keyword to find the appropriate database for use.

Happy data digging, people!

    • Announcing the Article Search API – Open Blog – NYTimes.com (tags: article, api, nytimes, text, corpus, newspaper)
    • Twitter API Wiki / REST API Documentation: Social Graph Methods (tags: graph, network, api, social, twitter)
    • Information Extraction: The RISE Repository of Information Sources (tags: information, textmining, extraction, reviews, jobs)
    • Using the Wikipedia link dataset — Henry Haselgrove (tags: graph, network, link, wikipedia, pagerank)
    • Visualizing the Growth of Target, 1962-2008 | FlowingData (tags: visualization, retail, finance, gis, map, location, store, via:magnetbox, target)
    • The Economy According To Mint (tags: finance, commercial, consumer, mint, spending)
    • Repositories (tags: links, textmining, books, rdf, ocr, documents)
    • Subsidyscope.com (tags: government, banking, csv, tarp, bailout)
    • Best Buy Remix – Welcome to the Best Buy Remix Developer Network (tags: retail, data, api, product, bestbuy)
    • twibs : find the businesses on twitter (tags: directory, businesses, twitter, companies)
    • True Marble Imagery – Free Download (tags: gis, geo, map, mapping, images, satellite)
    • Massive Scrape of Twitter’s Friend Graph « blog.infochimps.org – Organizing Huge Information Sources (tags: textmining, twitter, network, socialnetwork, pagerank, graph, queryminer)
    • Twitter Scrape (rough draft) – get.theinfo | Google Groups (tags: twitter, socialnetwork, graph)
    • API Documentation — BackType (tags: api, blog, comments, textmining, stream, trends, backtype, queryminer)
    • dbpedia.org : Downloads 32 (tags: wikipedia, named_entity, rdf, ontology)
    • CinC Challenge 2000 datasets (tags: timeseries, machinelearning, ecg, health, medical, sleep, apnea)
    • Free book usage data from the University of Huddersfield » “Self-plagiarism is style” (tags: books, library, borrowing, recommender, isbn, recommendation, collaborative, filtering, opendata)
    • UC Berkeley. Sheldon Margen Public Health Library. Statistical/Data Resources (tags: health, links, resources, publichealth, berkeley)
    • ICWSM 2009 – International AAAI Conference on Weblogs and Social Media (tags: blog, crawl, corpus, network, web, link)
    • BART – For Developers (tags: urban, transportation, feeds, public, sanfrancisco, bart, api)
    • Tim Davis: UF Sparse Matrix Collection : sparse matrices from a wide range of applications (tags: spare, matrix)
    • Others Online – Behavioral Targeting, Analytics and Advertising Service for Publishers, Ad Networks, Widgets, WiFi Networks (tags: analytics, audience, segmentation, toolbar, commercial, sem, search, advertising)
    • HumanScan : BioID : Downloads : BioID Face Database (tags: face, detection, image)
    • Face Detection (tags: facerecognition, opencv, face, links)
    • Building a (fast) Wikipedia offline reader (tags: django, wikipedia, compressed, textmining, howto)
    • gov: The Obama-Biden Transition Team | Join the Discussion: Healthcare (tags: textmining, opinion, comment, topic, government, queryminer)
    • UN General Assembly Voting Data (tags: un, voting, statistics, government)
    • NORB Object Recognition Dataset, Fu Jie Huang, Yann LeCun, New York University (tags: image, 3d)
    • Reddit’s Secret API (tags: reddit, api, json)
    • Amazon Web Services Public Datasets » Data Wrangling Blog (tags: amazon, ebs, ec2, s3, publicdata, hadoop)
    • Amazon Web Services (AWS) Hosted Public Datasets (tags: amazon, ebs, publicdata)
    • Executive PayWatch Database (tags: ceo, compensation, pay, economics, business, labor)
    • Research Datasets :: CID Data :: Center for International Development at Harvard University (CID) (tags: economics, international, development)
    • NACDA: Search Holdings (tags: aging, statistics, studies)
    • LIFE photo archive hosted by Google (tags: images, photo, pictures, search)
    • Main Task QA Data (tags: question, answering, trec, nlp, machinelearning)
    • ADL Gazetteer Development (tags: named_entity, location, placenames, geo, nlp)
    • The New York Times Annotated Corpus « YooName – named entity recognition (tags: named_entity, nytimes, corpus, people, organizations, locations)
    • downloading – flossmole – Google Code – How to get FLOSSmole data for your own use (tags: opensource, project, activity, mysql, dump)
    • Google Flu Trends | How does this work? (tags: google, health, trends, search, prediction, epidemiology, biodefence, queries, queryminer)
    • Multi-Domain Sentiment Dataset (tags: sentiment, review, product, amazon)
    • Chris Pound’s Name Generation Page (tags: bizzare, scifi, phrase, name, word, generators, random, perl)
    • TradingSolutions – Data Sources (tags: trading, finance, s, api, list)
    • Announcing the New York Times Campaign Finance API – Open – Code – New York Times Blog (tags: nyt, api, campaign, donations, fec)
    • Beautiful Data – WikiContent (tags: book, data, wiki, via:jhammerb)
    • public domain sounds | free sound library (tags: sound, publicdomain, audio)
    • Netflix API – Welcome to the Netflix Developer Network (tags: netflix, api, movie, mashup, netflixprize, ratings)
    • Data Catalog (tags: dc, government, feeds, transparency, opendata
    • Open beats Closed: Best Buy’s new APIs – O’Reilly Radar (tags: retail, bestbuy, api)
    • Voter registration data; or, HERE IS YOUR HOPE, YOU FOOLS! « The Edge of the American West (tags: voter, registration, politics, 2008)
    • Tickermine (tags: custom, research, retail, finance, market, service, analyst)
    • Linked Movie Data Base (tags: rdf, movies, movie, api)
    • Big Huge Thesaurus API: Access 145,000 Words and Phrases (tags: webservice, api, thesaurus, textmining, nlp, rest)
    • import/parse/fec.py at master from aaronsw’s watchdog — GitHub (tags: fec, python, parser, government, campaign)
    • The Watchdog Project: volunteer (tags: government, transparency, parsing, election, python)
    • Dataset of the day: Where are the Obamacans? | Off the Map – Official Blog of FortiusOne (tags: obama, goverment, mashup, gis, geo, map, campaign, donations)
    • Activity Recognition: Datasets, Bibliography and others (tags: activity, recognition, intent)
    • Normalized Campaign Contribution Data (tags: cmu, politics, campaign, donations, fec, via:jhammerb, government)
    • YouTube Dataset (tags: youtube, research, crawl, socialnetwork, network, graph, web)
    • CRAWDAD (tags: wireless, RF, radio, signal, dartmouth, network)
    • API Documentation – Twitter Development Talk | Google Groups (tags: twitter, text, api)
    • Web FAQ collection | ILPS (tags: faq, question_answering, questions, web, crawl, corpus, xml, textmining)
    • Yahoo! Music API – YDN (tags: api, yahoo, music, artists)
    • Search Query Performance report – Google AdWords Help Center (tags: adwords, ppc, search, metrics, webanalytics, sem, query, queryminer)
    • Wordze Keyword Research Tool (tags: queryminer, keyword, tool, research, commercial, search, adwords)
    • Frontal Face Databases (tags: facerecognition, face, image, recognition)
    • Searchable Catalogs of Data (tags: links, catalogs, social)
    • Download Database – baseball1.com (tags: baseball, database, publicdata, statistics, sports)
    • radiohead – Google Code (tags: lidar, visualization, radiohead, google, video)
    • 80 Million Tiny Images (tags: images, words, english, search, visualization, imagemap)
    • Time Series Center | Harvard University (tags: timeseries, anomaly, detection, astronomical, physics)
    • OpenVisuals – Open Source Visualization Framework (tags: visualization, community, design, processing)
    • BGN: Domestic Names – State and Topical Gazetteer Download Files (tags: gis, usgs)
    • NGA: Country Files (tags: country, cities, geo)
    • Datasets (tags: benchmark, clustering, regression, machinelearning, list, statistics, mathematics)
    • Isomap Datasets (tags: nonlinear, dimensionality, reduction, faces, digits, images, manifold)
    • Yahoo! Search Blog: BOSS — The Next Step in our Open Search Ecosystem (tags: api, open, search, yahoo, BOSS, queryminer)
    • Download the Database – IP Address Lookup – Community Geotarget IP Project (tags: geocoding, geoip, internet, ip, ipaddress, mysql)
    • Airline Data Project (tags: airline, statistics, finance, revenue, location, travel)
    • Reddit.com: Ask Reddit: Where to download a DB dump of Reddit? (tags: reddit, socialnetwork, news, web)
    • Show Us a Better Way: What public data is already available? (tags: statistics, census, uk, school, news, publicdata)
    • Collaborative filtering dataset – dating agency (tags: collaborative, filtering, dating, rating, profiles, czech)
    • About Us – Predictify (tags: predictionmarket, tool, finance, buzz, advertising, marketing, startup, mmds, david_kellogg)
    • VGChartz.com | Video Games, Charts, News, Forums, Reviews, Wii, PS3, Xbox360, DS, PSP (tags: sales, ranking, videogames, retail)
    • Store Level Information (tags: retail, finance, sales, store)
    • Code for querying and downloading Flickr images (tags: image, python, code, flickr, matlab, recognition)
    • Image Parsing Datasets (tags: image, recognition)
    • TAGora » Data (tags: tag, tagging)
    • TAGora » Data (tags: netflixprize, imdb, sparql)
    • OHPI – Traffic Volume Trends (tags: government, traffic, statistics, trends, transportation)
    • PigTutorial – Pig Wiki (tags: search, log, query, web, excite, queries, hadoop, pig, tutorial, mapreduce, parallel, queryminer)
    • Quality of Life Grand Challlenge Dataset: Kitchen Capture (tags: machinelearning, motion, capture, sensor)
    • Summize Twitter Search API (tags: api, buzz, opinion, trends, text, twitter, summize, search)
    • 2008 IEEE InfoVis Contest Dataset (tags: visualization, contest, scalability, motion, tracking, pedestrian, sensor)
    • IMDb Pro : Scary Movie 4: Box office (tags: movie, revenue, sales, box_office, imdb, commercial, movie_study)
    • Spider-Man 2 (2004) – Daily Box Office Results (tags: movie, revenue, box_office)
    • Live Search : xRank™ Celebrity — check out who’s hot and who’s not! (tags: search, query, volume, trends, celebrity, prediction, buzz, named_entity)
    • IMDbPro.com Free Trial Signup (tags: movie, revenue, timeseries, imdb, commercial, subsription)
    • Free time-series and micro-data to download (tags: economics, links)
    • PyGTrends: Python API for Google Trends Data (tags: google, trends, search, web, analytics, api, code, python, hack, keyword, query, forecasting, indicator, finance)
    • Official Google Blog: A new flavor of Google Trends (tags: google, trends, search, query, api, csv, keyword, timeseries)
    • Open Research – the Data: Lastfm-ArtistTags2007 – Duke Listens! (tags:fm, music, tagging, artists, tags, collaborative, filtering)
    • i2b2: Informatics for Integrating Biology & the Bedside (tags: medical, obesity)
    • Tiger Data Set Lecture (tags: tiger, gis, lectures)
    • Google To Launch Large Scale Geo-Services (tags: geo, google, gps, location, geolocation, cell, wifi, api, gis)
    • fm’s Playground (tags: celebrity, misspelling, spelling, names)
    • ImportGenius.com : U.S. Customs Database and Competitive Intelligence Tools (tags: commercial, shipping, imports, exports, finance, datamining)
    • Directory Listing of Betfair price files (tags: betting, prediction, betfair, price, csv, predictionmarket)
    • Reuters Spotlight – Article and Media API (tags: news, text, articles, api, content, media, xml, images, publicdata)
    • DataSets – Scikits – Trac (tags: scipy, python, machinelearning, statistics, resource)
    • [Wikitech-l] page counters (tags: wikipedia, pageviews, trends, textmining, seo, topic)
    • Wikipedia article traffic statistics (tags: via:chl, wikipedia, web, analytics, seo, topic, textmining, traffic)
    • Yahoo! Internet Location Platform – YDN (tags: yahoo, geo, geocoding, location, landmarks, gis)
    • How to find images on the internet « Random knowledge (tags: images, links, lists, archive)
    • Yahoo offers geographic data to Web sites | Tech news blog – CNET News.com (tags: gis, webservice, yahoo, api, location, landmark)
    • Instructions for Obtaining Search Engine Transaction Logs (tags: query, search, log, excite, altavista, alltheweb, transaction)
    • TechTC – Technion Repository of Text Categorization Datasets (tags: datamining, textmining, categorization, classification, odp, directory, text)
    • The TechTC-100 Test Collection for Text Categorization (tags: textmining, classification, category, odp, directory)
    • FEC Election Contributions: Download Detailed Files by Election Cycle (tags: individual, donations, government, election, publicdata, fec)
    • Juiced Google Analytics Python API: Juice Analytics (tags: search, statistics, keywords, analytics, api, python, web, seo, google, google_analytics, juice)
    • Country Name and ISO 3166 Code MySQL Import File (tags: mysql, states, countries, isocode)
    • Semantic Search the US Library of Congress (tags: via:inkdroid, libraries, mashup, rdf, semantic, search, semanticweb, books, api, webservice)
    • geocoded Hotels « GeoNames Blog (tags: hotels, geonames)
    • GeoNames webservice and data download (tags: locations, cities, countries, gis)
    • Index of /download/worldcities (tags: cities, gis)
    • ualberta dependency based thesaurus and word count data (tags: corpus, text, similarity, terms)
    • CommonCrawl – About (tags: web, crawler, bot)
    • Datasets and corpus / corpora for biological literature and text mining , information extraction and information retrival and document classification (tags: bioinformatics, text, corpora, domainspecific, genomics, corpus)
    • Office of Defects Investigation (ODI), Flat File Downloads (tags: defect, recall, automobile, fightclub, nhtsa, saefty)
    • p2psim – kingdata : DNS server latency network distance matrices (tags: distance, matrix, network, p2p, dns, latency, nmf, queryminer)
    • Sep Kamvar / Personalization / (tags: pagerank, web, matrix, matlab)
    • opentick.com (tags: opentick, trading, beta, feeds, finance)
    • WikiXMLDB: Querying Wikipedia with XQuery (tags: wikipedia, xml, ec2)
    • kiwitobes.com » Blog Archive » Walmart Growth Video (tags: walmart, visualization, video, freebase, store, retail, locations, opening)
    • Open Cell Id dataset – phone geolocation from GSM cellids (tags: gis, mobile, geolocation)
    • The Cornell Web Lab – The Cornell Web Lab (tags: cornell, web, archive, hadoop, crawl)
    • im2gps: estimating geographic information from a single image (tags: imagerecognition, via:csantos, gis, cmu, gps, imageprocessing, paper, hack, freaking_awesome)
    • Datasets: MUSCLE WP2 Evaluation, Integration and Standards (tags: image, video, audio, currency, sports, imagerecognition)
    • Open Economics – Store – Index (tags: economics, list)
    • welcome @ omdb (tags: free, movie, database, netflixprize)
    • Cogblog » Blog Archive » Cogmap APIs (tags: api, cogmap, person, name, organization, record_linkage)
    • Wal-Mart : Freebase – The World’s Database (tags: retail, locations, stores)
    • Cogmap: The Org Chart Wiki (tags: record_linkage, identity, name, organization, orgchart, marketing)
    • German English Parallel Corpus “de-news”, Daily News 1996-2000 (tags: german, translation, corpus, english, text, via:maxme)
    • Welcome to the CRCNS data sharing activity website — CRCNS (tags: neuroscience, patch, clamp, recordings, neuron, timeseries, patchclamp, data, neural, cortex, visual)
    • org: Free Redistributable Rich Datasets (tags: aggregator, links)
    • Frequent Itemset Mining Dataset Repository (tags: retail, clickstream, traffic, web, links, sales)
    • Dolores Labs Blog » Blog Archive » Our color names data set is online (tags: colormap, color, mechanicalturk)
    • TeradataUniversityNetwork.com -> Registration (tags: teradata, retail, transactional, database)
    • Pascal Learning Challenge Large Datasets (tags: large, competition, challenge, svm, machinelearning, scalability)
    • ECIS 2007 – The 15th European Conference on Information Systems (tags: retail, dillards, sams_club)
    • Alexa Web Search (tags: alexa, aws, web, search, api)
    • developerWorks Interviews: Massive data mining and the resurgent mainframe (tags: price, retail, transaction, sams_club, dillards)
    • University of Arkansas – Daily Headlines (tags: retail, dillards, uark)
    • Crime data bonanza!!! (tags: timeseries, crime, statistics, publicdata)
    • State and Federal Case Law (tags: creativecommons, court, legal, law, via:inkdroid)
    • Wikipedia:Lists of common misspellings/For machines – Wikipedia, the free encyclopedia (tags: spelling, mispelling, wikipedia)
    • Copyright Free and Public Domain Media (tags: images, audio, publicdata, maps, video, free)
    • Access to Web Research Collections VLC2/WT10g/WT2g (tags: blog, web, text)
    • Databases you can use for benchmarking (tags: image, vision, recognition)
    • Lyricsfly Lyrics API, database access to search for music artist and song title, protocol REST with XML document(tags: song, lyrics, database, api)
    • 2007 IEEE AVSS Detection and Tracking Algorithm Datasets (tags: tracking, video, detection, image, recognition, vehicle, pedestrian)
    • Eigenvector Research, Inc. : Datasets Available to Download (tags: NIR, spectra, chemistry, semiconductor, pharmaceutical, matlab)
    • OTCBVS (tags: image, recognition, detection, pedestrian, thermal, tracking, facerecognition, illumination)
    • 99 Wikipedia Sources Aiding the Semantic Web » AI3:::Adaptive Information (tags: links, directory, record_linkage, extraction, wikipeida, named_entity, recognition, textmining, semanticweb, paper)
    • UNdata (tags: UN, publicdata, government, statistics)
    • AudioScrobbler Data (tags: audioscrobbler, recommendation, collaborative, filtering, music)
    • The Linking Open Data dataset cloud (tags: directory, rdf, semantic, data, soup, graph)
    • Free Economic Data | Economic, Financial, and Demographic Data (tags: finance, economics, portal, links)
    • ::MLSP 2008::: MLSP competition (tags: machinelearning, trading, competition, backtest, matlab, code, finance, via:DeliciousRob)
    • Computer Vision Test Images (tags: computer, vision, image, ray, trace, fingerprint, stereo, detection, via:chl)
    • The Dataverse Network Project | The Dataverse Network Project (tags: statistics, repository, harvard)
    • DVN – Home (tags: harvard, repository, social, science, research, portal, links)
    • Ohio voter registration data (tags: voter, voting, politics, government, name, address, registration)
    • Voter List Data Files – Election Department, Clark County, Nevada (tags: voting, voter, registration, name, address, data, election, politics, government, nevada)
    • Temperature data (HadCRUT3 and CRUTEM3) (tags: climate, temperature, netcdf)
    • MNIST handwritten digit database, Yann LeCun and Corinna Cortes (tags: handwriting, mnist, image, recognition)
    • LFW : Labelled Faces in the Wild (tags: facerecognition, face, recognition, umass, image)
    • Making random contacts – (37signals) (tags: generator, names)
    • Test (Sample) Data Generators (tags: generator, tools, list, via:jd)
    • Compete – Compete Developer Resources (tags: compete, api, web, statistics, traffic, analytics, mashup)
    • Machine Learning (Theory) » The Peekaboom Dataset (tags: peekaboom, vision, image, large, human, computation, machinelearning, recognition)
    • Ocean Processes and Modeling: Ocean Data (tags: links, oceanography, satellite)
    • BlogoCenter datasets (tags: blog, ucla)
    • Tagged datasets for named entity recognition tasks (tags: nlp, corpus, tagged, named_entity, recognition, list)
    • icio.us stats – deli.ckoma (tags: del.icio.us)
    • The Financial Data Finder A – G (tags: finance, links)
    • Freebase Wikipedia Extraction (WEX) (tags: wikipedia, xml, structured, corpus)
    • The arXiv.org API (tags: arxiv, api, open, paper, academic)
    • England Football Results Betting Odds | Premiership Results & Betting Odds (tags: gambling, soccer, football, excel, statistics)
    • HughesData – Main – Hughes Lab (tags: rna, bioinformatics, microarray, expression, gene, machinelearning)
    • Stanford MicroArray Database (tags: bioinformatics, microarray, expression, gene, machinelearning, stanford)
    • ArrayExpress Home (tags: bioinformatics, microarray, expression, gene, machinelearning)
    • Gene Expression Omnibus (GEO) Main page (tags: bioinformatics, microarray, expression, gene, machinelearning)
    • Index of /courts.gov (tags: corpus, text, legal, law, court, ruling, opensource, publicdata)
    • Welcome to Openvest (tags: python, finance, edgar, pylons, matplotlib, sec, webservice, via:jolby)
    • Statistical Science Web: Datasets (tags: links, statistics)
    • Data Mining: Text Mining, Visualization and Social Media: TailRank, Spinn3r, TechMeme and TechCrunch: New Attention (tags: crawler, blog, corpus)
    • Aleix Face Database (tags: facerecognition, machinelearning, face, image)
    • Data Repository Evaluation (tags: umd, links, statistics, government, sports, via:rickladd)
    • PMC FTP Service (tags: biology, medicine, articles, text, journal, authors)
    • “uspop2002″ data set (tags: music, similarity, machinelearning)
    • Internet Archive: Details: Amazon ASIN listing and similarity graph (tags: ASIN, amazon, recommendation, collaborative, filtering, via:keyvowel)
    • European Climate Assessment Daily Weather Data (tags: weather, europe, ascii, netcdf)
    • Poverty Datasets General Information (tags: poverty, statistics)
    • StatLib—Datasets Archive (tags: machinelearning, datamining, cmu, link, collection)
    • National Household Travel Survey (NHTS) Data (tags: driving, transportation, publicdata)
    • RealClearPolitics – Election 2008 – Democratic Presidential Nomination (tags: polls, politics)
    • Nielsen BookScan USA (tags: books, sales, commercial)
    • Pew Internet & American Life Project (tags: internet, demographics, online, web)
    • Home – Numbrary (tags: finance, data)
    • About – Numbrary (tags: searchengine, search, tagging, aggregator, numeric, extraction, tables, collaboration, web2.0, interface, billpoint)
    • Main Page – OpenTextMining (tags: textmining, open, nature, standards, search)
    • Metafilter Infodump (tags: metafilter, comments, network, via:chl)
    • WEBSPAM-UK2007 | Datasets | Web Spam Detection (tags: web, search, spam, crawler, yahoo)
    • Google to Host Terabytes of Open-Source Science Data | Wired Science from Wired.com (tags: google, article, openaccess)
    • Zillow – Labs – Neighborhood Boundaries (tags: neighborhoods, geo, gis, maps)
    • Trust network datasets – TrustLet (tags: socialnetwork, trustnetwork, trust)
    • Crime in the United States 2006 (tags: crime, fbi)
    • TaskForces/CommunityProjects/LinkingOpenDa)ta/DataSets – ESW Wiki (tags: opendata, semantic, rdf, collaboration
    • Some Datasets Available on the Web » Data Wrangling Blog (tags: publicdata, links)
    • XML.com: GovTrack.us, Public Data, and the Semantic Web (tags: semanticweb, rdf, congress, politics, government)
    • CiteULike: Available datasets (tags: networks, research, graph, tags, paper, record_linkage)
    • Archive-It.org (tags: archive, internet, web, index)
    • Challenge: Synopsis – Causality Workbench (tags: competition, machinelearning, forecasting, contest)
    • Natural Language Processing (tags: microsoft, text, paraphrase, corpus)
    • LDC – Linguistic Data Consortium – Obtaining Data Resorces (tags: nlp, text, corpus, ngram, google, commercial, license)
    • 1990 Census Name Files (tags: census, names, identity, frequency, record_linkage)
    • Given Name Frequency Project: Analysis of Given Name Popularity (tags: name, record_linkage, text, identity, code)
    • Email Datasets (tags: enron, names, identity, text, record_linkage)
    • ZoomInfo – Welcome to the ZoomInfo Developer API (tags: api, identity, people, webservice, record_linkage)
    • Ted Pedersen – Name Discrimination Data / Name Disambiguation Data / Name Ambiguity Data / Named Entity Resolution / Named Entity Disambiguation (tags: record_linkage, corpus, nlp, names)
    • Developers Area – eBay Market Data Documentation – eBay Market Data Documentation (tags: ebay, api, retail, price, code)
    • New SwetoDblp RDF dataset released with 11M triples (tags: name, authorship, rdf, record_linkage)
    • LSDIS : SwetoDblp (tags: bibliography, rdf, ontology, duplicate, name, record_linkage)
    • StrikeIron Super Data Pack Web Service 1.0 – StrikeIron Marketplace (tags: webservice, publicdata, datacleaning)
    • Vaccines: IIS/Tech/Deduplication Test Cases (tags: duplicate)
    • Duplicate Detection, Record Linkage, and Identity Uncertainty: Datasets (tags: duplicate, detection, record_linkage, datacleaning, text)
    • INFO 747 – Social and Economic Data (tags: datacleaning, record_linkage, video, lectures, course, cornell, economics, finance, publicdata)
    • Overstock.com Affiliate Program (tags: retail, overstock, sales, api, product, price, forecasting)
    • Amazon Web Services Developer Connection : Can Alexa WS provide detailed … (tags: finance, alexa, amazon, tech)
    • Market Data — eBay Developers Program (tags: ebay, retail, pricing, sales, api, product)
    • Health Data Tools and Statistics (tags: health, information, public, publicdata)
    • It’s a Pitch-by-Pitch Scouting Report, Minus the Scout – New York Times (tags: baseball, gameday)
    • opentick :: market data (tags: opentick, nasdaq, finance, stock)
    • Daily Kos: Obama helps us track $1,000,000,000,000 of federal spending (tags: corruption, government, politics, finance)
    • Welcome to USAspending.gov (tags: government, money, politics)
    • Campaign Finance Reports and Data (tags: campaign, politics, elections)
    • Machine Learning and Data Mining – Datasets (tags: face, image)
    • GIS for Schools (tags: epidemiology, gis, health)
    • Cardiac MRI dataset – York University (tags: mri, cardiac)
    • Google Trends API coming soon | Tech news blog – CNET News.com (tags: google, trends, api)
    • MIT Media Lab: Reality Mining (tags: social, activity, location, cell, gis)
    • RL Competition 2008 – Home (tags: machinelearning, reinforcement, agent, competition)
    • Vehicle Routing Datasets (tags: optimization, vehicle, routing)
    • EIA – Petroleum Data, Reports, Analysis, Surveys (tags: oil, energy, statistics, economics, petroleum)
    • DMOZ100k06 – Michael G. Noll (tags: search, pagerank, text, tags, content)
    • Grading (tags: machinelearning, CMU, course, projects, graphicalmodel, code, paper)
    • Carnegie Mellon University – CMU Graphics Lab – motion capture library (tags: gait, pedestrian, walk, motion)
    • Financial Forecast Center’s Historical Economic and Market Data (tags: exchangerate, dollar, economics)
    • Bureau of Labor Statistics Data (tags: economics, lumber, building, materials, homedepot)
    • Browse Business Cycle Indicators Data (tags: economics, indicators, time, series)
    • The Numbers Guy : Aspiring to Be the Wikipedia of Numbers (tags: finance, numberpedia, mechanicalturk, textmining, statistics)
    • Social characteristics of the Marvel Universe (tags: socialnetwork, graphs, comicbooks)
    • net: Word Lists Collection (tags: dictionary, words)
    • ERS/USDA Data – International Macroeconomic Data Set (tags: usda, economics, population, cpi, gdp, income)
    • State Agency Databases – GODORT (tags: government, directory, links, wiki, states)
    • The 2000 U.S. Census: 1 Billion RDF Triples (tags: gis, census, rdf, semantic, sparql)
    • See Who’s Editing Wikipedia – Diebold, the CIA, a Campaign (tags: wikipedia, authorship)
    • Dataset Generator – Perfect data for an imperfect world. (tags: tools, generator)
    • National Bureasu of Economic Research: Data (tags: economics, links)
    • Entree Chicago Recommendation Data (tags: recommender, collaborative, restaurant)
    • community resource guide: i’ve been here before – show me the links (tags: demographics, maps, gis, statistics, links)
    • Social Science Data on the Net (tags: economics, social, government, health, labor, links)
    • NBI ASCII Files – Bridge – FHWA (tags: government, bridges, safety)
    • List of films: A – Wikipedia, the free encyclopedia (tags: netflix, netflixprize, movie, index, wikipedia)
    • The arXiv on your harddrive (tags: paper, corpus, arXiv)
    • Insanely Useful Websites | Sunlight Foundation (tags: links, transparency, government, politics, congress, reference)
    • Technophilia: Where to find public records online – Lifehacker (tags: public, records, links)
    • Junk email project (tags: corpus, email, spam, textmining)
    • Enron Email Dataset (tags: enron, corpus, email, text, social, network)
    • ftp://ftp.bls.gov/pub/special.requests/cpi/cpiai.txt (tags: finance, cpi, inflation, data)
    • GOS – Geospatial One Stop (tags: health, gis, epidemiology, links)
    • CIA Factbook Grep in Python (tags: cia, population, python, code, grep)
    • Miller Center of Public Affairs – Richard Nixon – Oval Office Recordings (tags: nixon, speech, tapes, audio, mp3, wav, flac)
    • Deborah Jeane Palfrey Legal Defense Fund (tags: phone, politics)
    • UC San Diego Data Mining Competition – 2007 – Datasets (tags: housing, refinance, mortgage)
    • package – MoinMaster
    Retail Industry Financial Ratios & Benchmarks (tags: retail, finance, sales, sqft)
    • Retail Industry Financial Ratios & Benchmarks (tags: retail, finance, sales, sqft)
    • stores | POI Factory (tags: retail, location, poi)
    • GpsPasSion Forums – ** INDEX OF POI COLLECTIONS ** (tags: retail, poi, location, gis, gps)
    • GPS POI US : Home > Retail Stores (tags: retail, location, gis)
    • Collective Dynamics Group (tags: smallworld, networking, socialnetwork, graph)
    • Jester Data download page (tags: collaborative, filtering, jokes)
    • TricTrac: Video Dataset (tags: video)
    • Premium Business Information Databases – AlacraWiki (tags: links, finance, commercial)
    • Index of /edgar (tags: finance, xml, edgar, sec, code, perl)
    • Mail Index (tags: EDGAR, sec, mail, text)
    • metafy / AnthraciteIdioms (tags: finance, SEC, scrape, parse, commercial)
    • Advance Monthly Sales for Retail and Food Services – Time Series Data/Seasonal Factors – 1992 to Present(tags: retail, sales, census)
    • TDT (tags: categorization, textmining, detection, tools)
    • Volume of retail sales: Social Trends 33 (tags: retail, sales, uk)
    • generatedata.com (tags: tools, generator, random)
    • S. Company Filings and Annual Reports (tags: finance, links, sec)
    • FTP Information – EDGAR Database (tags: edgar, finance, sec, filing, ftp, instructions)
    • Data Mining For Investing (tags: investing, finance, datamining, announcement, sec, filing, links)
    • Melissa DATA – Lookups (tags: consumer, data, database, api)
    • FactSet: Data Maven – Kiplinger.com (tags: factset, finance)
    • IBES (Demo) (tags: finance, ibes, analyst, forecast, wharton)
    • Thomson Financial I/B/E/S Data (tags: finance)
    • Historical Quotes – Yahoo! Finance (tags: yahoo, finance, stock, price)
    • Network data (tags: network, links)
    • Bureau of Labor Statistics Home Page (tags: statistics, labor, government, consumer)
    • NAR: Research: EHS Data (tags: housing, sales, finance)
    • RFA – The Industry – Industry Statistics (tags: ethanol)
    • Chain Store Guide – Retail Locations (tags: retail, finance, store, locations, gis)
    • Press Releases – Directions Magazine (tags: retail, gis, store, locations)
    • Energy Information Administration – EIA – Official Energy Statistics from the U.S. Government (tags: finance, government, energy, historical, forecasts, fuel, oil)
    • Databases you can use for benchmarking (tags: links)
    • UPC Database: Downloads (tags: product, upc, database)
    • Web Crawling / Crawl Datasets at Tobias Escher at the OII (tags: crawler, benchmark, search, web, links)
    • TechTC – Technion Repository of Text Categorization Datasets (tags: corpus, text)
    • TMC data archive download site (tags: traffic, data)
    • http://www.volvis.org/ (tags: volumerendering)
    • Computational Vision: Archive (tags: vision, caltech, imagerecognition)
    • DC Pedestrian Classification Benchmark (tags: pedestrian, image, classification, detection)
    • opentick :: home (tags: finance, economics, feed, free, stock, trading, opentick, opensource)
    • Web as Corpus (tags: textmining, corpus, concordance, wordlist, n-gram)
    • .:[ packet storm ]:. – http://packetstormsecurity.org/ (tags: dictionary, hack, security, wordlist, password)
    • Enron Dataset (tags: data, mysql, email, energy, text, socialnetwork)
    • Splog Blog Dataset (tags: blog, corpus, spam)
    • Home Page for 20 Newsgroups Data Set (tags: corpus, text, newsgroup)
    • White Glove Tracking (tags: crowdsourcing, image, processing, algorithm, collaborative, distributed, web2.0, code, opensource)
    • NOAA Paleoclimatology Program – Coral and Sclerosponge Data (tags: paleoclimatology, climate, oceanography, coral, sponge, biology)
    • NAICS — North American Industry Classification System (tags: finance, economics, naics, industry, classifications)
    • Saving Democracy With Web 2.0 – (tags: democracy, web2.0, mashup, government, funding, article)
    • Congresspedia – Congresspedia (tags: collaborative, wiki, government, congress, politics, elections, web2.0, directory)
    • Population Estimates Datasets (tags: census, data, population, statistics)
    • CRAN Task View: Machine Learning & Statistical Learning (tags: statisticallearning, machinelearning, code, R, libraries, cran)
    • Data for Data Mining (tags: linkd, datamining, timeseries, text, extraction, socialnetwork)
    • PAIDA – Pure Python scientific analysis package (tags: python, visualization, library)
    • SUBDUE – Graph Based Knowledge Discovery (tags: machinelearning, network, graph)
    • AOL search data mirrors (tags: aol, search)
    • Python Cheese Shop : shakespeare 0.4 (tags: python, text)
    • AG’s corpus of news articles (tags: corpus, nlp, machinelearning, textmining)
    • Sampling Techniques for Massive Data – Google Video (tags: video, machinelearning, statistics, matrix, sampling, large, sparse, algorithm, experiment_design, towatch)
    • metachronistic » Mirror the Wikipedia (tags: wikipedia, laptop, install, dump)
    • LETOR: Benchmark Datasets for Learning to Rank (tags: ranking, search)
    • CN710: Comparative Analysis of Learning Systems (Spring 2006) – Class Project (tags: machinelearning, algorithm, ogi, bu, greyhound, finance)
    • UrbanSim Home (tags: python, urban, software, simulation, opensource, GIS, census)
    • System One – Wikipedia³ (tags: wikipedia, rdf)
    • System One – Labs (tags: wikipedia, rdf, tools)
    • Face Recognition Homepage – Databases (tags: face, algorithm, facerecognition, data, image)
    • CBCL SOFTWARE Face data set (tags: face, seung, algorithm, recognition, image)
    • Text Analytics Solutions from ClearForest (tags: extraction, finance, semantic, semanticweb, text)
    • 23C3 – Mining Search Queries – Google Video (tags: aol, search, video, talk, algorithm, informationretrieval, datamining, machinelearning)
    • Digital History Hacks: Keywords and Clues (tags: aol, search, query, analysis)
    • Digital History Hacks: Searching for History (tags: aol, search, query, analysis)
    • The Tom Kyte Blog: An interesting data set… (tags: aol, search, oracle, database, code)
    • KDD 2005 – KDD Cup 2005: Aug 21-24, Chicago, IL. USA (tags: query, categorization, algorithm, google)
    • Statistical NLP / corpus-based computational linguistics resources (tags: corpus, machinelearning, text)
    • d.-student Rasmus Elsborg Madsen (tags: text, machinelearning, context, matlab)
    • Intelligent Web Search and Mining: Tools & Resources (tags: machinelearning, code, links)
    • PageRank Datasets and Code (tags: pagerank, code, algorithm)
    • Official Google Research Blog: All Our N-gram are Belong to You (tags: linguistics, google, ngram, nlp, record_linkage)
    • Hyper-threaded Java – Java World (tags: clustering, algorithm, java, parallel)
    • Statistical Modeling, Causal Inference, and Social Science (tags: blog, econometrics, finance, machinelearning, math, statistics)
    • Structural Analysis of Discrete Data and Econometric Applications, by Charles F. Manski and Daniel L. McFadden, MIT Press, 1981. (tags: books, econometrics, economics, finance, ebook)
    • Kris Brower » Archives » Google Onpage Search Results Analysis (tags: google, ranking, aol, search, analytics)
    • CSE 250B Fall 2006 (tags: netflixprize, machinelearning, course)
    • Matrix Market (tags: matrixmarket, matrix)
    • Analysis of incomplete datasets: Estimation of mean values and covariance matrices and imputation of missing values (tags: imputation, matlab, missing, EM, machinelearning)
    • Face Detection (tags: face, image)
    • CSE 250B Project 4, Fall 2006 (tags: subset, netflixprize, dimensionality, reduction)
    • G3DATA (tags: extract, from, graphs, hack, google, trends)
    • cwm – a general purpose data processor for the semantic web (tags: python, processor, semantic, web, rdf)
    • WebBase Project (tags: link, analysis, sturcture, web, crawler, stanford)
    • sam roweis : data (tags: machine, learning, matlab, python, hackers, image)
    • Index of /data/sequence/mnist (tags: mnist, xml, format)
    • MNIST handwritten digit database (tags: mnist)
    • Book-Crossing Dataset (tags: data, set, collaborative, filtering, datamining, books, movie)
    • allmovie (tags: movie, netflixprize, source)
    • Submissions Guidelines for the Collectorz.com Online Movie Database (tags: movie, source)
    • Cinema.com (tags: plot, synopsis, movie, netflixprize, prize)
    • LUMIERE (tags: netflixprize, prize, european, movie, revenue)
    • Data dumps – Meta (tags: mediawiki, wikipedia, import, mysql, sql)
    • “phone ***” ” address *” “e-mail” intitle:”curriculum vitae” – Google Search (tags: resume, google)

