Procode: the Swiss Multilingual Solution for Automatic Coding and Recoding of Occupations and Economic Activities

by   Nenad Savic, et al.

Objective. Epidemiological studies require data that are in alignment with the classifications established for occupations or economic activities. The classifications usually include hundreds of codes and titles. Manual coding of raw data may result in misclassification and be time consuming. The goal was to develop and test a web-tool, named Procode, for coding of free-texts against classifications and recoding between different classifications. Methods. Three text classifiers, i.e. Complement Naive Bayes (CNB), Support Vector Machine (SVM) and Random Forest Classifier (RFC), were investigated using a k-fold cross-validation. 30 000 free-texts with manually assigned classification codes of French classification of occupations (PCS) and French classification of activities (NAF) were available. For recoding, Procode integrated a workflow that converts codes of one classification to another according to existing crosswalks. Since this is a straightforward operation, only the recoding time was measured. Results. Among the three investigated text classifiers, CNB resulted in the best performance, where the classifier predicted accurately 57-81 to somewhat lower results (by 1-2 the data. The coding operation required one minute per 10 000 records, while the recoding was faster, i.e. 5-10 seconds. Conclusion. The algorithm integrated in Procode showed satisfactory performance, since the tool had to assign the right code by choosing between 500-700 different choices. Based on the results, the authors decided to implement CNB in Procode. In future, if another classifier shows a superior performance, an update will include the required modifications.


Comparing SVM and Naive Bayes classifiers for text categorization with Wikitology as knowledge enrichment

The activity of labeling of documents according to their content is know...

Towards The Automatic Coding of Medical Transcripts to Improve Patient-Centered Communication

This paper aims to provide an approach for automatic coding of physician...

Using Artificial Neural Networks to Determine Ontologies Most Relevant to Scientific Texts

This paper provides an insight into the possibility of how to find ontol...

A Speech Act Classifier for Persian Texts and its Application in Identify Speech Act of Rumors

Speech Acts (SAs) are one of the important areas of pragmatics, which gi...

On the Use of Emojis to Train Emotion Classifiers

Nowadays, the automatic detection of emotions is employed by many applic...

More efficient manual review of automatically transcribed tabular data

Machine learning methods have proven useful in transcribing historical d...

A Method for Modeling Co-Occurrence Propensity of Clinical Codes with Application to ICD-10-PCS Auto-Coding

Objective. Natural language processing methods for medical auto-coding, ...

Please sign up or login with your details

Forgot password? Click here to reset