Abstract
A styled document is a document that contains diverse decorating functions such as different font, colors, tables and images generally authored in a word processor (e.g., MS-WORD, Open Office). Compared to a plain-text document, a styled document enables a human to easily recognize a logical structure such as section, subsection and contents of a document. However, it is difficult for a computer to recognize the structure if a writer does not explicitly specify a type of an element by using the styling functions of a word processor. It is one of the obstacles to enhance document version management systems because they currently manage the document with a file as a unit, not the document elements as a management unit. This paper proposes a machine learning based approach to analyzing the logical structure of a styled document composing of sections, subsections and contents. We first suggest a feature vector for characterizing document elements from a styled document, composing of eight features such as font size, indentation and period, each of which is a frequently discovered item in a styled document. Then, we trained machine learning classifiers such as Random Forest and Support Vector Machine using the suggested feature vector. The trained classifiers are used to automatically identify logical structure of a styled document. Our experiment obtained 92.78% of precision and 94.02% of recall for analyzing the logical structure of 50 styled documents.
| Original language | English |
|---|---|
| Pages (from-to) | 1043-1056 |
| Number of pages | 14 |
| Journal | KSII Transactions on Internet and Information Systems |
| Volume | 11 |
| Issue number | 2 |
| DOIs | |
| State | Published - 2017.02.28 |
Keywords
- Document management system
- Feature vector
- Logical structure analysis
- Machine learning
Quacquarelli Symonds(QS) Subject Topics
- Computer Science & Information Systems
Fingerprint
Dive into the research topics of 'A machine-learning based approach for extracting logical structure of a styled document'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver