Open-source machine learning algorithms for data mining tasks.

Weka is an open-source, java-based software that leverages machine learning algorithms to mine large datasets.

Data can be mined using pre-processing, classification, regression, clustering, and association rules, and Weka also contains tools to visual these insights.

The tool offers three main interfaces - a workbench, panel-based interface called Explorer, the component-based Knowledge Flow, and developers can also use the command line to access further functionalities.

To process data, Weka utilizes a file format called ARFF (Attribute-Relation File Format), which is essentially an ASCII text file delineating instances that share attributes. Every entry in a dataset is an instance of the java class 'weka.core.Instance', and each of these contain specific attributes, that describe an aspect of the instance.


Image: An example of instances and attributes in an ARFF file.

Upon importing this file into Weka, as shown in the following screenshot, it is clear that it contains fourteen instances of data related to weather and five attributes associated these instances.


From here, data can be processed further to reveal interesting patterns via cluster analysis, outlying records through anomaly detection, dependent relationships by applying association rule mining, and much more.

Once you are done sifting through your data, Weka can plot these visually for further analysis or publication. There is also a series of custom visualization plugins available to help you tell your data's story in the best way possible.

Example: Mining Indian socio-economic data

Researchers at the Swami Vivekanand Institute of Technology and Management, Udgir harnessed Weka to explore socio-economic data from Latur district of Maharashtra state of India.

To begin, they imported 2001 census data, socio-economic data, and data from the National Informatics Centre. The following screenshot illustrates that their ARFF file contains data instances from 729 villages with 25 different attributes.


The researchers then processed these instances to compare the male_percent_literacy, female_percent_literacy and sex_ratio attributes. From this, they were able to extract the impact of literacy on gender inequality, and visualize these insights graphically.


Image: Female literacy and Sex ratio values.

Want to learn more?

The University of Waikato runs a MOOC on using Weka for data mining, and several tutorial courses are also available on YouTube.

Visit the Weka website here.