We are happy to announce that over the last month we have scraped more than 100.000 public procurement documents from the Greek Central Electronic Registry of PUBLIC Contracts (“Κεντρικό Ηλεκτρονικό Μητρώο Δημοσίων Συμβάσεων” or ΚΗΜΔΗΣ in Greek). Through efficient use of DEiXTo and Selenium we managed to collect information for all tenders and contracts available since the launch of the public eProcurement platform almost a year ago. Please note that besides the tenders RSS feed, a separate RSS feed is daily generated specifically for the latest contracts.
Hopefully, this significant amount of data has already been ingested by the popular
yperdiavgeia.gr search engine (“ΥπερΔιαύγεια”), thus extending its reach beyond the
Cl@rity Program (“Διαύγεια”). So, the lack of an API for ΚΗΜΔΗΣ and the limited searching capabilities of the native e-procurement website were covered to a large extent by a precise,
DEiXTo-based scraping mechanism (extracting the latest documents published) and the powerful, full-text search functionality provided by yperdiavgeia.gr.
The volume of data at hand (we capture both the native HTML detail pages as well as the PDF files of tenders & contracts) has reached 34GB of disk space and is expected to multiply in the coming year. So, for scalability and reliability reasons we have chosen to store all these files to
Amazon Glacier, a low-cost service that offers secure cloud storage for data archiving and backup. Glacier provides a simple REST web service that allows you to upload and retrieve data programmatically. Two great software packages that we used for interacting with it were: a)
mt-aws-glacier, a multithreaded client application written in Perl and b)
boto, a robust Python interface for
AWS.
Finally, we are excited to let you know that a new “alliance” is about to launch. An innovative research initiative called
publicspending.net is going to be the next heavy client consuming e-procurement data retrieved with DEiXTo. We will further elaborate on this soon in a future post. In conclusion, we are thrilled to see that our data extraction mechanism for ΚΗΜΔΗΣ, in its short history, has attracted interest from developers and sparked some new, challenging ideas. We hope that it will continue being utilised creatively and will foster transparency in public expenditure.
Open data can be a goldmine!