Identifying DNA N4-methylcytosine sites in the rosaceae genome with a deep learning model relying on distributed feature representation

  • Jhabindra Khanal
  • , Hilal Tayara
  • , Quan Zou*
  • , Kil To Chong
  • *Corresponding author for this work

Research output: Contribution to journalJournal articlepeer-review

Abstract

DNA N4-methylcytosine (4mC), an epigenetic modification found in prokaryotic and eukaryotic species, is involved in numerous biological functions, including host defense, transcription regulation, gene expression, and DNA replication. To identify 4mC sites, previous computational studies mostly focused on finding hand-crafted features. This area of research, therefore, would benefit from the development of a computational approach that relies on automatic feature selection to identify relevant sites. We here report 4mC-w2vec, a computational method that learned automatic feature discrimination in the Rosaceae genomes, especially in Rosa chinensis (R. chinensis) and Fragaria vesca (F. vesca), based on distributed feature representation and through the word embedding technique ‘word2vec’. While a few bioinformatics tools are currently employed to identify 4mC sites in these genomes, their prediction performance is inadequate. Our system processed 4mC and non-4mC sites through a word embedding process, including sub-word information of its biological words through k-mer, which then served as features that were fed into a double layer of convolutional neural network (CNN) to classify whether the sample sequences contained 4mCs or non-4mCs sites. Our tool demonstrated performance superior to current tools that use the same genomic datasets. Additionally, 4mC-w2vec is effective for balanced and imbalanced class datasets alike, and the online web-server is currently available at: http://nsclbio.jbnu.ac.kr/tools/4mC-w2vec/.

Original languageEnglish
Pages (from-to)1612-1619
Number of pages8
JournalComputational and Structural Biotechnology Journal
Volume19
DOIs
StatePublished - 2021.01

Keywords

  • Convolutional Neural Network
  • DNA N4-methylcytosine (4mC)
  • Sequence analysis
  • Web-server
  • Word embedding

Quacquarelli Symonds(QS) Subject Topics

  • Computer Science & Information Systems
  • Data Science
  • Biological Sciences

Fingerprint

Dive into the research topics of 'Identifying DNA N4-methylcytosine sites in the rosaceae genome with a deep learning model relying on distributed feature representation'. Together they form a unique fingerprint.

Cite this