Distributional Correspondence Indexing

Human Language Technologies (HLT), Istituto di Scienza e Tecnologie dell'Informazione "A. Faedo", Consiglio Nazionale delle Ricerche - Pisa, Italy


Overview

Distributional Correspondence Indexing (DCI) is a feature-representation-transfer method for domain adaptation (cross-domain and cross-lingual classification) that directly applies the Distributional Hypothesis to the concept of Pivot features.

Source Code

Results reported in the paper correspond to our implementation of the DCI algorithm within the JATECS framework (publicly available).

A (deprecated) stand-alone implementation of the implementation of DCI is available here. The zip file contains two apps:

Requirements:

Example:

The following example illustrates how to use the English Books source domain to classify text documents from the German Books domain (cross-lingual adaptation). Notice that all collections were codified in UTF-8.

	java -jar dci.jar -Xmx5G
	-spath Datasets/WebisCorpora/EB                  #source domain path
	-str sourcetrain                                 #source labeled (train) collection 
	-su sourceunlabeled                              #source unlabeled collection
	-tpath Datasets/WebisCorpora/DB                  #target domain path
	-tts targettest                                  #target labeled (test) collection
	-tu targetunlabeled                              #target unlabeled collection
	-dist linear                                     #distributional model (others include pmi, mi, cosine, polynomial, and gauss)
	-s                                               #disable cross-consistency
	-m 100                                           #use 100 pivots
	-phi 30                                          #set pivot frequency support to 30  
	-nthread 8                                       #use 8 threads
	-clean                                           #discard words with less than 3 (occidental) characters
	-o outpath/EBDB                                  #set the output folder
	-d Datasets/WebisCorpora/dict/en_de_dict.txt     #specify the dictionary (available in Webis-CLS-10's site)
	

To train the classifier (SVMlight):

	/svm_learn.exe outpath/EBDB/training outpath/EBDB/model
	

To test the classifier:

	/svm_classify.exe outpath/EBDB/test outpath/EBDB/model outpath/EBDB/predictions
	

Datasets

Publication


For any question, contact: A. Moreo, alejandro.moreo@isti.cnr.it