A machine-learning based approach for extracting logical structure of a styled document

  • Tae Young Kim
  • , Suntae Kim*
  • , Sangchul Choi
  • , Jeong Ah Kim
  • , Jae Young Choi
  • , Jong Won Ko
  • , Jee Huong Lee
  • , Youngwha Cho
  • *Corresponding author for this work

    Research output: Contribution to journalJournal articlepeer-review

    Abstract

    A styled document is a document that contains diverse decorating functions such as different font, colors, tables and images generally authored in a word processor (e.g., MS-WORD, Open Office). Compared to a plain-text document, a styled document enables a human to easily recognize a logical structure such as section, subsection and contents of a document. However, it is difficult for a computer to recognize the structure if a writer does not explicitly specify a type of an element by using the styling functions of a word processor. It is one of the obstacles to enhance document version management systems because they currently manage the document with a file as a unit, not the document elements as a management unit. This paper proposes a machine learning based approach to analyzing the logical structure of a styled document composing of sections, subsections and contents. We first suggest a feature vector for characterizing document elements from a styled document, composing of eight features such as font size, indentation and period, each of which is a frequently discovered item in a styled document. Then, we trained machine learning classifiers such as Random Forest and Support Vector Machine using the suggested feature vector. The trained classifiers are used to automatically identify logical structure of a styled document. Our experiment obtained 92.78% of precision and 94.02% of recall for analyzing the logical structure of 50 styled documents.

    Original languageEnglish
    Pages (from-to)1043-1056
    Number of pages14
    JournalKSII Transactions on Internet and Information Systems
    Volume11
    Issue number2
    DOIs
    StatePublished - 2017.02.28

    Keywords

    • Document management system
    • Feature vector
    • Logical structure analysis
    • Machine learning

    Quacquarelli Symonds(QS) Subject Topics

    • Computer Science & Information Systems

    Fingerprint

    Dive into the research topics of 'A machine-learning based approach for extracting logical structure of a styled document'. Together they form a unique fingerprint.

    Cite this