Enhancing Address Data Integrity using Transformer-Based Language Models Dönüştürücü Tabanlı Dil Modelleri Kullanarak Adres Veri Bütünlüğünün Geliştirilmesi

2024-01-01
Kürklü, Ömer Faruk
Akagündüz, Erdem
Address data integrity is a critical aspect in numerous applications, yet it is often plagued with inaccuracies and inconsistencies, particularly in non-standardized formats. This study explores a novel application of transformer-based language models, traditionally utilized in language translation tasks, for the standardization and correction of Turkish address data. Leveraging the capabilities of Mixtral-8x7B, a state-of-the-art large language model, this research introduces a unique, handcrafted dataset of Turkish addresses. This dataset, derived from the National Address Dataset and enriched through ChatGPT-4 to simulate human-like input errors.This dataset was later used in fine-tuning both TowerInstruct and T5 models, transforming them into tools capable of converting faulty, error-laden address lines into standardized, structured, and corrected formats.
32nd IEEE Conference on Signal Processing and Communications Applications, SIU 2024
Citation Formats
Ö. F. Kürklü and E. Akagündüz, “Enhancing Address Data Integrity using Transformer-Based Language Models Dönüştürücü Tabanlı Dil Modelleri Kullanarak Adres Veri Bütünlüğünün Geliştirilmesi,” presented at the 32nd IEEE Conference on Signal Processing and Communications Applications, SIU 2024, Mersin, Türkiye, 2024, Accessed: 00, 2024. [Online]. Available: https://www.scopus.com/inward/record.uri?partnerID=HzOxMe3b&scp=85200859936&origin=inward.