Milestone: 100K e-procurement documents scraped

We are happy to announce that over the last month we have scraped more than 100.000 public procurement documents from the Greek Central Electronic Registry of PUBLIC Contracts (“Κεντρικό Ηλεκτρονικό Μητρώο Δημοσίων Συμβάσεων” or ΚΗΜΔΗΣ in Greek). Through efficient use of DEiXTo and Selenium we managed to collect information for all tenders and contracts available since the launch of the public eProcurement platform almost a year ago. Please note that besides the tenders RSS feed, a separate RSS feed is daily generated specifically for the latest contracts.

promitheus
Hopefully, this significant amount of data has already been ingested by the popular yperdiavgeia.gr search engine (“ΥπερΔιαύγεια”), thus extending its reach beyond the Cl@rity Program (“Διαύγεια”). So, the lack of an API for ΚΗΜΔΗΣ and the limited searching capabilities of the native e-procurement website were covered to a large extent by a precise, DEiXTo-based scraping mechanism (extracting the latest documents published) and the powerful, full-text search functionality provided by yperdiavgeia.gr.

yperdiavgeia
The volume of data at hand (we capture both the native HTML detail pages as well as the PDF files of tenders & contracts) has reached 34GB of disk space and is expected to multiply in the coming year. So, for scalability and reliability reasons we have chosen to store all these files to Amazon Glacier, a low-cost service that offers secure cloud storage for data archiving and backup. Glacier provides a simple REST web service that allows you to upload and retrieve data programmatically. Two great software packages that we used for interacting with it were: a) mt-aws-glacier, a multithreaded client application written in Perl and b) boto, a robust Python interface for AWS.

glacier
Finally, we are excited to let you know that a new “alliance” is about to launch. An innovative research initiative called publicspending.net is going to be the next heavy client consuming e-procurement data retrieved with DEiXTo. We will further elaborate on this soon in a future post. In conclusion, we are thrilled to see that our data extraction mechanism for ΚΗΜΔΗΣ, in its short history, has attracted interest from developers and sparked some new, challenging ideas. We hope that it will continue being utilised creatively and will foster transparency in public expenditure. Open data can be a goldmine!

This entry was posted in Extended Article, News and tagged , , , , . Bookmark the permalink.

Comments are closed.