RcppMeCab 0.0.1.1 is released in Github. This version supports:

  • Parallelization with Intel TBB library
  • join parameter is added
  • user_dic parameter is added
  • Code tidying

Parallelization

The package supports parallel computation with Intel TBB library which is included in RcppParallel package. RcppParallel package isn’t fully occupied in RcppMeCab package because it doesn’t support CharacterVector conversion.

Now the package shows 50x faster morpheme analyzing compared to NLP packages based on RJava. For example,

# temp is a 100-length CharacterVector
# SimplePos22 is KoNLP's POS function
# sapply(temp, function(x) pos(x, "")) is RcppMeCab's pos function with R loop
# posLoop(temp, "") is RcppMeCab's loop version which runs in C++
# posParallel(temp, "") is RcppMeCab's parallelized version with Intel TBB library

> microbenchmark(SimplePos22(temp), sapply(temp, function(x) pos(x, "")), posLoop(temp, ""), posParallel(temp, ""))
Unit: milliseconds
                                 expr        min         lq      mean     median
                    SimplePos22(temp) 579.159705 613.425725 650.97500 635.938074
 sapply(temp, function(x) pos(x, "")) 156.958030 163.132787 183.11562 174.995390
                    posLoop(temp, "")  74.441377  81.289487  90.00520  84.931590
                posParallel(temp, "")   7.406578   8.639878  12.43173   9.694593
        uq       max neval
 657.36901 1014.7603   100
 188.84118  356.2329   100
  90.50949  218.8578   100
  10.69478  213.4961   100

> sessionInfo()
R version 3.5.0 (2018-04-23)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS High Sierra 10.13.4

Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] microbenchmark_1.4-4 KoNLP_0.80.1         RcppParallel_4.4.0  
[4] Rcpp_0.12.16        

loaded via a namespace (and not attached):
 [1] digest_0.6.15   withr_2.1.2     DBI_1.0.0       magrittr_1.5   
 [5] RSQLite_2.1.1   stringi_1.2.2   blob_1.1.1      hash_2.2.6     
 [9] tau_0.0-20      devtools_1.13.5 tools_3.5.0     stringr_1.3.1  
[13] bit64_0.9-7     bit_1.1-14      compiler_3.5.0  rJava_0.9-10   
[17] memoise_1.1.0   Sejong_0.01   

# RcppMeCab is not loaded for the comparison of basic tagger and loop version

This package uses RcppParallel for futher development. Although RcppParallel doesn’t support parallelization over CharacterVector, I expect it’ll be added sometime.

join Parameter

You can set join = FALSE in pos and posParallel functions. If join = FALSE is given, then the return value will be morphemes only. Tags will be returned in the vector names. For example,

> pos("Hi")
[1] "Hi/SL"
> pos("Hi", join = FALSE)
  SL 
"Hi" 
# SL is mecab-ko's pos id for foreign languages

User Dictionary

You can apply your compiled user dictionary to pos and posParallel function. To compile your CSV file, please refer Github. I’ll provide a full explanation about using mecab-dict-index in later post.

# person.csv has its content,
# 폼페이오,,,,NNP,인명,F,폼페이오,*,*,*,*,*
# (폼페이오 = Pompeo, who is an American politician serves Secretary of State currently)
# The file is compiled by `mecab-dict-index` with `mecab-ko-dic` model and CSV files.

> pos("폼페이오")
[1] "폼페이/NNG" "오/VCP+EC" 
> pos("폼페이오", user_dic = "~/user_dic.dic")
[1] "폼페이오/NNP"