Scientists at the Cornell Center for Advanced Computing have used a supercomputer to download and analyse nearly 35 million Flickr photos from more than 300,000 photographers worldwide. Their main goal was to develop new methods to automatically organise and label large-scale collections of digital data. A secondary result of the research was the generation of statistics on the world's most photographed cities and landmarks, gleamed from the analysis of the multi-terabyte photo collection, with New York, London and San Francisco making the top three cities, and the Eiffel Tower, Trafalgar Square and London's Tate Modern the top three landmarks.
Cornell developed techniques to automatically identify places that people find interesting to photograph, showing results for thousands of locations at both city and landmark scales. 'We developed classification methods for characterising these locations from visual, textual and temporal features,' says Daniel Huttenlocher, the John P and Rilla Neafsey Professor of Computing, Information Science and Business and Stephen H Weiss Fellow. 'These methods reveal that both visual and temporal features improve the ability to estimate the location of a photo compared to using just textual tags.'
Cornell's technique of finding representative images is a practical way of summarising large collections of images. The scalability of the method allows for automatically mining the information latent in very large sets of images, raising the intriguing possibility of an online travel guidebook that could automatically identify the best sites to visit on your next vacation, as judged by the collective wisdom of the world's photographers.
To perform the data analysis, the researchers used a mean shift procedure and ran their application on a 480-core Linux-based Dell PowerEdge 2950 supercomputer at the Cornell Center for Advanced Computing (CAC) called the 'Hadoop Cluster'. Hadoop is a framework used to run applications on large clusters of computers. It uses a computational paradigm called Map/Reduce to divide applications into small segments of work, each of which can be executed on any node of the cluster. 'As the creation of digital data accelerates,' says CAC Director David Lifka, 'supercomputers and high-performance storage systems will be essential in order to quickly store, archive, preserve, and retrieve large-scale data collections.'