“GUI DEiXTo” feature list
- user friendly graphical interface – no programming required
- enhanced, tree based, extraction rules (wrappers)
- HTML tag filtering (sometimes, ignoring some tags makes life easier)
- can sustain structural variations in HTML source code of the record instances
- fast, flexible and high performance tree pattern matching algorithm
- most of the time, 100% precision and recall can be achieved
- automatic simple form submission
- multi-record, multi-page, many-urls extraction modes
- regular expression support
- can follow “Next Page” links with adjustable crawling depth
- can create RSS feeds from any web source
- can export results to XML and tab delimited formats
- can extract text, URLs and html source code
- XML encoded wrapper project files (.wpf) – can be executed at will
- wrapper files are compatible with DEiXTo Executor
- command line execution to schedule extraction tasks with MS Scheduler
- last but not least, it’s freeware!
“DEiXTo Executor” feature list
- portable, efficient and fast command line executor of GUI DEiXTo wrappers
- provides options and flexibility that you cannot get with GUI DEiXTo
- supports additional output formats such as CSV, Excel and OpenDocument Spreadsheet (.ods).
- provides database support via DBI (the Database independent interface for Perl) and a dbconfig file
- supports HTML output using an HTML template processor and an editable template file
- command line options can override those in wpf files
- overwrite, append and prepend output modes for all supported formats
- proxy support
- can be scheduled to execute wrappers automatically (e.g. using cron in GNU/Linux)
- can sleep random time intervals between http requests to avoid making webmasters mad..
- it is free and open source, distributed under the GNU General Public License (GPL) Version 3!
“I use DEiXTo almost as a daily tool and have hundreds of configuration files. DEiXTo is the swiss army knife of web scraping! I am very glad to have found DEiXTo and I hope the development continues.”