An Algorithm for Matching Heterogeneous Financial Databases: a Case Study for COMPUSTAT/CRSP and I/B/E/S Databases

Irene Rodriguez-Lujan, Ramon Huerta


Rigorous and proper linking of financial databases is a necessary step to test trading strategies incorporating multimodal sources of information. This paper proposes a machine learning solution to match companies in heterogeneous financial databases. Our method, named Financial Attribute Selection Distance (FASD), has two stages, each of them corresponding to one of the two interrelated tasks commonly involved in heterogeneous database matching problems: schema matching and entity matching. FASD's schema matching procedure is based on the Kullback-Leibler divergence of string and numeric attributes. FASD's entity matching solution relies on learning a company distance flexible enough to deal with the numeric and string attribute links found by the schema matching algorithm and incorporate different string matching approaches such as edit-based and token-based metrics. The parameters of the distance are optimized using the F-score as cost function. FASD is able to match the joint Compustat/CRSP and Institutional Brokers' Estimate System (I/B/E/S) databases with an F-score over 0.94 using only a hundred of manually labeled company links.

Full Text:



Baeza-Yates, R., Ribeiro-Neto, B., & others. (1999). Modern information retrieval. ACM press New York.

Bernstein, P. A., Madhavan, J., & Rahm, E. (2011). Generic schema matching, ten years later. Proceedings of the VLDB Endowment, (pp. 695-701).

Bilenko, M., Mooney, R., Cohen, W., Ravikumar, P., & Fienberg, S. (2003). Adaptive name matching in information integration. IEEE Intelligent Systems, 18(5), 16-23. Retrieved from

Camacho, D., Huerta, R., & Elkan, C. (2008). An Evolutionary Hybrid Distance for Duplicate String Matching. Technical report, Universidad Autonoma de Madrid. Retrieved from

Cohen, W., Ravikumar, P., & Fienberg, S. (2003). A comparison of string metrics for matching names and records. KDD Workshop on Data Cleaning and Object Consolidation, 3, pp. 73-78.

Commission, U. S. (n.d.). CUSIP Number. Retrieved from

de Carvalho, M. G., Laender, A. H., Goncalves, M. A., & da Silva, A. S. (2012). A genetic programming approach to record deduplication. IEEE Transactions on Knowledge and Data Engineering, 24(3), 399-412. Retrieved from

Doan, A., Domingos, P., & Halevy, A. Y. (2001). Reconciling schemas of disparate data sources: A machine-learning approach. ACM Sigmod Record, 30, pp. 509-520.

Duda, R. O., Hart, P. E., & Stork, D. G. (2012). Pattern classification. John Wiley & Sons.

Elmagarmid, A. K., Ipeirotis, P. G., & Verykios, V. S. (2007). Duplicate record detection: A survey. IEEE Transactions on Knowledge and Data Engineering, 19(1), 1-16. Retrieved from

Gal, A., & Shvaiko, P. (2009). Advances in ontology matching. In Advances in web semantics i (pp. 176-198). Springer. Retrieved from

Holland, J. H. (1975). Adaptation in natural and artificial systems: An introductory analysis with applications to biology, control, and artificial intelligence. U Michigan Press. Retrieved from

Huerta, R., Elkan, C., & Corbacho, F. (2013). Nonlinear Support Vector Machines Can Systematically Identify Stocks with High and Low Future Returns. Algorithmic Finance, 2, 1-45. Retrieved from

Isele, R., & Bizer, C. (2011). Learning linkage rules using genetic programming. Proceedings of the Sixth International Workshop on Ontology Matching, (pp. 13-24). Retrieved from

Jaiswal, A., Miller, D. J., & Mitra, P. (2013). Schema Matching and Embedded Value Mapping for Databases with Opaque Column Names and Mixed Continuous and Discrete-valued Data Fields. ACM Trans. Database Syst., 38(1), 1-34. Retrieved from

Jaiswal, A., Miller, D., & Mitra, P. (2010). Uninterpreted Schema Matching with Embedded Value Mapping under Opaque Column Names and Data Values. IEEE Transactions on Knowledge and Data Engineering, 22(2), 291-304. Retrieved from

Jaro, M. A. (1989). Advances in record-linkage methodology as applied to matching the 1985 census of Tampa, Florida. Journal of the American Statistical Association, 84(406), 414-420. Retrieved from

Kang, J., & Naughton, J. F. (2008). Schema matching using interattribute dependencies. IEEE Transactions on Knowledge and Data Engineering, 20(10), 1393-1407. Retrieved from

Kim, K.-j., & Han, I. (2000). Genetic algorithms approach to feature discretization in artificial neural networks for the prediction of stock price index. Expert systems with applications, 19(2), 125-132. Retrieved from

Köpcke, H., & Rahm, E. (2008). Training selection for tuning entity matching. QDB/MUD, (pp. 3-12).

Köpcke, H., & Rahm, E. (2010). Frameworks for entity matching: A comparison. Data & Knowledge Engineering, 69(2), 197-210. Retrieved from

Köpcke, H., Thor, A., & Rahm, E. (2010). Evaluation of entity resolution approaches on real-world match problems. Proceedings of the VLDB Endowment, 3, pp. 484-493.

Levenshtein, V. I. (1966). Binary codes capable of correcting deletions, insertions, and reversals. Soviet physics doklady, (pp. 707-710).

Levine, D. (1996). Users guide to the PGAPack parallel genetic algorithm library. Argonne National Laboratory. Retrieved from

Liu, H., Dou, D., & Wang, H. (2012). Breaking the Deadlock: Simultaneously Discovering Attribute Matching and Cluster Matching with Multi-Objective Metaheuristics. Journal on data semantics 1(2), 1(2), 133-145. Retrieved from

Monge, A. E., & Elkan, C. (1997). Efficient domain-independent detection of approximately duplicate database records. Retrieved from

Monge, A. E., Elkan, C., & others. (1996). The Field Matching Problem: Algorithms and Applications. KDD, (pp. 267-270). Retrieved from

Moussawi, R. (2006). Linking I/B/E/S and Compustat Data. Wharton Research Data Services. Web.

Rahm, E., & Bernstein, P. A. (2001). A survey of approaches to automatic schema matching. The VLDB Journal, 10(4), 334-350. Retrieved from

Sewell, M. (2010). The Application of Intelligent Systems to Financial Time Series Analysis. Department of Computer Science, University College London, University of London. Retrieved from

Shvaiko, P. a. (2005). A survey of schema-based matching approaches. Journal on Data Semantics IV, 146-171. Retrieved from

Winkler, W. E. (1999). The state of record linkage and current research problems. Statistical Research Division, US Census Bureau. Retrieved from

Zhao, H. (2010). Matching Attributes across Overlapping Heterogeneous Data Sources Using Mutual Information. Journal of Database Management (JDM), 21(4), 91-110. Retrieved from

Zhao, H., & Ram, S. (2007). Combining schema and instance information for integrating heterogeneous data sources. Data & Knowledge Engineering 61(2), 61(2), 281-303. Retrieved from



  • There are currently no refbacks.

Paper Submission E-mail:

Applied Economics and Finance    ISSN 2332-7294 (Print)   ISSN 2332-7308 (Online)

Copyright © Redfame Publishing Inc.

To make sure that you can receive messages from us, please add the '' domain to your e-mail 'safe list'. If you do not receive e-mail in your 'inbox', check your 'bulk mail' or 'junk mail' folders.