SlideShare a Scribd company logo
1 of 20
Download to read offline
Quick Search algorithm
       and strstr


             Cybozu Labs
2012/3/31 MITSUNARI Shigeo(@herumi)
x86/x64 optimization seminar 3(#x86opti)
Agenda
 Quick Search
 vs. strstr of gcc on Core2Duo
 vs. strstr of gcc on Xeon
 fast strstr using strchr of gcc
 vs. my implementation on Xeon
   restriction
 vs. strstr of VC2011 beta on Core i7
 feature of pcmpestri
 range version of strstr



2012/3/31 #x86opti                       2 /20
Quick Search algorithm(1/2)
 Simplified and improved Boyer-Moore algorithm
   initialized table for "this is"
    char      't'    'h'   'I'   's'      ''   other
    skip     +7      +6    +2    +1       +3    +8
                                                           0   1   2   3   4   5   6   7   8   9   A   B   C   D   E   F
                                                       0   8   8   8   8   8   8   8   8   8   8   8   8   8   8   8   8
                                                       1   8   8   8   8   8   8   8   8   8   8   8   8   8   8   8   8
   How to initialize table for given                  2   3   8   8   8   8   8   8   8   8   8   8   8   8   8   8   8
    string [str, str + len)                            3
                                                       4
                                                           8
                                                           8
                                                               8
                                                               8
                                                                   8
                                                                   8
                                                                       8
                                                                       8
                                                                           8
                                                                           8
                                                                               8
                                                                               8
                                                                                   8
                                                                                   8
                                                                                       8
                                                                                       8
                                                                                           8
                                                                                           8
                                                                                               8
                                                                                               8
                                                                                                   8
                                                                                                   8
                                                                                                       8
                                                                                                       8
                                                                                                           8
                                                                                                           8
                                                                                                               8
                                                                                                               8
                                                                                                                   8
                                                                                                                   8
                                                                                                                       8
                                                                                                                       8
                                                       5   8   8   8   8   8   8   8   8   8   8   8   8   8   8   8   8
  int tbl_[256];                                       6   8   8   8   8   8   8   8   8   6   2   8   8   8   8   8   8
                                                       7   8   8   8   1   7   8   8   8   8   8   8   8   8   8   8   8
  void init(const char *str,           int len) {      8   8   8   8   8   8   8   8   8   8   8   8   8   8   8   8   8
                                                       9   8   8   8   8   8   8   8   8   8   8   8   8   8   8   8   8
      std::fill(tbl_, tbl_ +           256, len);      A   8   8   8   8   8   8   8   8   8   8   8   8   8   8   8   8
      for (size_t i = 0; i <           len; i++) {     B   8   8   8   8   8   8   8   8   8   8   8   8   8   8   8   8
                                                       C   8   8   8   8   8   8   8   8   8   8   8   8   8   8   8   8
          tbl_[str[i]] = len           - i;            D   8   8   8   8   8   8   8   8   8   8   8   8   8   8   8   8
      }                                                E   8   8   8   8   8   8   8   8   8   8   8   8   8   8   8   8
                                                       F   8   8   8   8   8   8   8   8   8   8   8   8   8   8   8   8
  }
2012/3/31 #x86opti                                                                                                         3 /20
Quick Search algorithm(2/2)
 Searching phase
   simple and fast
   see http://www-igm.univ-mlv.fr/~lecroq/string/node19.html

       const char *find(const char *begin, const char *end) {
           while (begin <= end - len_) {
               if (memcmp(str_, begin, len_) == 0) return begin;
               begin += tbl_[begin[len_]];
           }
           return end;
       };




2012/3/31 #x86opti                                                 4 /20
Benchmark
 2.13GHz Core 2 Duo + gcc 4.2.1 + Mac 10.6.8
   33MB UTF-8 text
                                  10
             cycle/Byte to find



                                   8               fast
                                   6
                                   4
                                   2                      strstr
                                   0                      org Qs




                                       substring
   Qs(Quick search) is faster for long substring
   Remark: assume text does not have ‘¥0’for strstr
2012/3/31 #x86opti                                                 5 /20
A little modification of Qs
 avoid memcmp
              const char *find(const char *begin, const char *end) {
                  while (begin <= end - len_) {
                      if (memcmp(str_, begin, len_) == 0) return begin;
                      begin += tbl_[begin[len_]];
                  }
                  return end; }



     const char *find(const char *begin, const char *end){
         while (begin <= end - len_) {
             for (size_t i = 0; i < len_; i++) {
                 if (str_[i] != begin[i]) goto NEXT;
             }
             return begin;
         NEXT:
             begin += tbl_[static_cast<unsigned char>(begin[len_])];
         }
         return end; }
2012/3/31 #x86opti                                                        6 /20
Benchmark again
 2.13GHz Core 2 Duo + gcc 4.2.1 + Mac 10.6.8
   33MB UTF-8 text
                                  10
             cycle/Byte to find



                                   8               fast
                                   6
                                   4
                                                          strstr
                                   2
                                                          org Qs
                                   0
                                                          Qs'



                                       substring

   modified Qs(Qs’) is more faster
   Should we use modified Qs'?
2012/3/31 #x86opti                                                 7 /20
strstr on gcc 4.6 with SSE4.2
 Xeon X5650 2.67Gz on Linux
 strstr with SSE4.2 is faster than Qs’
  for substring with length less than 9 byte
                                  5
             cycle/Byte to find




                                  4               fast
                                  3
                                  2
                                  1                      strstr
                                  0                      Qs'




                                      substring

 Is strstr of gcc is fastest implementation?
2012/3/31 #x86opti                                                8 /20
strstr implementation by strchr
 Find candidate of location by strchr at first,
  and verify the correctness
   strchr of gcc with SSE4.2 is fast

     const char *mystrstr_C(const char *str, const char *key) {
         size_t len = strlen(key);
         while (*str) {
             const char *p = strchr(str, key[0]);
             if (p == 0) return 0;
             if (memcmp(p + 1, key + 1, len - 1) == 0) return p;
             str = p + 1;
         }
         return 0;
     }



2012/3/31 #x86opti                                                 9 /20
strstr vs. mystrstr_C
 Xeon X5650 2.6GHz + gcc 4.6.1
 mystrstr_C is 1.5 ~ 3 times faster than strstr
   except for “ko-re-wa”(in UTF-8)
   maybe penalty for many bad candidates
                              10
         cycle/Byte to find




                               8        fast
                               6
                               4
                                               strstr
                               2
                                               Qs'
                               0
                                               my_strstr_C


                                   substring

2012/3/31 #x86opti                                           10 /20
real speed of SSE4.2(pcmpistri)
 my_strstr is always faster than Qs’
   2 ~ 4 times faster than strstr of gcc

                              10

                               8               fast
         cycle/Byte to find




                               6

                               4                      strstr
                                                      Qs'
                               2
                                                      my_strstr_C
                               0                      my_strstr



                                   substring

2012/3/31 #x86opti                                                  11 /20
Implementation of my_strstr(1/2)
 https://github.com/herumi/opti/blob/master/str_util.hpp
   written in Xbyak(for my convenience)
 Main loop
     // a : rax(or eax), c : rcx(or ecx)
     //       input a : ptr to text
     //             key : ptr to key
     //       use save_a, save_key, c

         movdqu(xm0, ptr [key]); // xm0 = *key
     L(".lp");
         pcmpistri(xmm0, ptr [a], 12);
               // 12(1100b) = [equal ordered:unsigned:byte]
         jbe(".headCmp");
         add(a, 16);
         jmp(".lp");
     L(".headCmp");
         jnc(".notFound");
2012/3/31 #x86opti                                            12 /20
Implementation of my_strstr(2/2)
 Compare tail in“headCmp”
         ...
         add(a, c); // get position
         mov(save_a, a); // save a
         mov(save_key, key); // save key
     L(".tailCmp");
         movdqu(xm1, ptr [save_key]);
         pcmpistri(xmm1, ptr [save_a], 12);
         jno(".next");
         js(".found");
         // rare case
         add(save_a, 16);
         add(save_key, 16);
         jmp(".tailCmp");
     L(".next");
         add(a, 1);
         jmp(".lp");
2012/3/31 #x86opti                            13 /20
Pros and Cons of my_strstr
 Pros
   very fast
   Is this implementation with Qs fastest?
     No, overhead is almost larger(variable address offset)
 Cons
   access max 16 bytes beyond of the end of text
     almost no problem except for page boundary
     allocate memory with margin
             4KiB readable page                                         not readable page
          FF7   FF8   FF9   FFA   FFB   FFC   FFD   FFE   FFF   000   001   002   003




                                                                 access
                       pcmpistri                                violation
                                        end of text
2012/3/31 #x86opti                                                                      14 /20
strstr of Visual Studio 11
 almost same speed as my_strstr
 of Couse safe to use
   i7-2620 3.4GHz + Windows 7 + VS 11beta
                               8
          cycle/Byte to find




                               6        fast
                               4

                               2               strstr
                                               Qs'
                               0
                                               my_strstr


                                   substring



2012/3/31 #x86opti                                         15 /20
All benchmarks on i7-2600
 find "ko-re-wa" in 33MiB text
   the results strongly depends on text and key


                                                                strstr(before SSE4.2)
                                                      fast
                                                                Qs(gcc)

                                                                Qs'(gcc)

                                                                strstr(gcc;SSE4.2)

                                                                strstr(VC;SSE4.2)

                                                                my_strstr(SSE4.2)


       0   1   2     3   4   5   6   7   8   9 10 11 12 13 14
                     rate for the timing of Qs(gcc)

2012/3/31 #x86opti                                                                      16 /20
range version of strstr
 strstr is not available for string including‘¥0’
 use std::string.find()
   but it is not optimized for SSE4.2
     naive but fast implementation by C
     const char *findStr_C(const char *begin, const char *end,
       const char *key, size_t keySize) {
       while (begin + keySize <= end) {
         const char *p = memchr(begin, key[0], end - begin);
         if (p == 0) break;
         if (memcmp(p + 1, key + 1, keySize - 1) == 0)return p;
         begin = p + 1;
       }
       return end; }

 str_util.hpp provides findStr with SSE4.2
   4 ~ 5 times faster than findStr_C on i7-2600 + VC11
2012/3/31 #x86opti                                                17 /20
feature of pcmpestri
 very complex mnemonics
xmm0 : head of key pcmpestri xmm0, ptr [p], 12
 rax : keySize
 p : pointer to text
                                  rcx : pos of key if found
 rdx : text size
                                  CF : if found
                                  ZF : end of text
                                  SF : end of key
                                  OF : all match
    L(".lp");
        pcmpestri(xmm0, ptr [p], 12);
        lea(p, ptr [p + 16]);
        lea(d, ptr [d - 16]);           do not change carry
        ja(".lp");
        jnc(".notFound");
        // compare leading str...
2012/3/31 #x86opti                                            18 /20
Difference between Xeon and i7
 main loop of my_strstr
    L(".lp");
        pcmpistri(xmm0, ptr [a], 12);
        if (isSandyBridge) {
            lea(a, ptr [a + 16]);
            ja(".lp");                      a little faster on i7
        } else {
            jbe(".headCmp");
            add(a, 16);                 1.1 times faster on Xeon
            jmp(".lp");
    L(".headCmp");
        }
        jnc(".notFound");
        // get position
        if (isSandyBridge) {
            lea(a, ptr [a + c - 16]);
        } else {
            add(a, c);
        }
2012/3/31 #x86opti                                                  19 /20
other features of str_util.hpp
 strchr_any(text, key)[or findChar_any]
   returns a pointer to the first occurrence of any
    character of key int the text
  // search character position of '?', '#', '$', '!', '/', ':'
  strchr_any(text,"?#$!/:");
   same speed as strchr by using SSE4.2
   max length of key is 16
 strchr_range(txt, key)[or findChar_range]
   returns a pointer to the first occurrence of a
    character in range [key[0], key[1]], [key[2], key[3]], ...
   also same speed as strchr and max len(key) = 16
  // search character position of [0-9], [a-f], [A-F]
  strchr_range(text,"09afAF");
2012/3/31 #x86opti                                               20 /20

More Related Content

What's hot

Intel AVX-512/富岳SVE用SIMDコード生成ライブラリsimdgen
Intel AVX-512/富岳SVE用SIMDコード生成ライブラリsimdgenIntel AVX-512/富岳SVE用SIMDコード生成ライブラリsimdgen
Intel AVX-512/富岳SVE用SIMDコード生成ライブラリsimdgenMITSUNARI Shigeo
 
realpathキャッシュと OPcacheの面倒すぎる関係
realpathキャッシュと OPcacheの面倒すぎる関係realpathキャッシュと OPcacheの面倒すぎる関係
realpathキャッシュと OPcacheの面倒すぎる関係Yoshio Hanawa
 
AWS で Presto を徹底的に使いこなすワザ
AWS で Presto を徹底的に使いこなすワザAWS で Presto を徹底的に使いこなすワザ
AWS で Presto を徹底的に使いこなすワザNoritaka Sekiyama
 
暗号技術の実装と数学
暗号技術の実装と数学暗号技術の実装と数学
暗号技術の実装と数学MITSUNARI Shigeo
 
Elasticsearch勉強会#44 20210624
Elasticsearch勉強会#44 20210624Elasticsearch勉強会#44 20210624
Elasticsearch勉強会#44 20210624Tetsuya Sodo
 
条件分岐とcmovとmaxps
条件分岐とcmovとmaxps条件分岐とcmovとmaxps
条件分岐とcmovとmaxpsMITSUNARI Shigeo
 
GoによるWebアプリ開発のキホン
GoによるWebアプリ開発のキホンGoによるWebアプリ開発のキホン
GoによるWebアプリ開発のキホンAkihiko Horiuchi
 
テスト文字列に「うんこ」と入れるな
テスト文字列に「うんこ」と入れるなテスト文字列に「うんこ」と入れるな
テスト文字列に「うんこ」と入れるなKentaro Matsui
 
Git Flowを運用するために
Git Flowを運用するためにGit Flowを運用するために
Git Flowを運用するためにShun Tsunoda
 
脱RESTful API設計の提案
脱RESTful API設計の提案脱RESTful API設計の提案
脱RESTful API設計の提案樽八 仲川
 
地理分散DBについて
地理分散DBについて地理分散DBについて
地理分散DBについてKumazaki Hiroki
 
RDB技術者のためのNoSQLガイド NoSQLの必要性と位置づけ
RDB技術者のためのNoSQLガイド NoSQLの必要性と位置づけRDB技術者のためのNoSQLガイド NoSQLの必要性と位置づけ
RDB技術者のためのNoSQLガイド NoSQLの必要性と位置づけRecruit Technologies
 
【TECH×GAME COLLEGE#32】ゼロからリアルタイムサーバーを作るまで
【TECH×GAME COLLEGE#32】ゼロからリアルタイムサーバーを作るまで【TECH×GAME COLLEGE#32】ゼロからリアルタイムサーバーを作るまで
【TECH×GAME COLLEGE#32】ゼロからリアルタイムサーバーを作るまでtechgamecollege
 
DXとかDevOpsとかのなんかいい感じのやつ 富士通TechLive
DXとかDevOpsとかのなんかいい感じのやつ 富士通TechLiveDXとかDevOpsとかのなんかいい感じのやつ 富士通TechLive
DXとかDevOpsとかのなんかいい感じのやつ 富士通TechLiveTokoroten Nakayama
 
RESTful Web アプリの設計レビューの話
RESTful Web アプリの設計レビューの話RESTful Web アプリの設計レビューの話
RESTful Web アプリの設計レビューの話Takuto Wada
 
オススメの標準・準標準パッケージ20選
オススメの標準・準標準パッケージ20選オススメの標準・準標準パッケージ20選
オススメの標準・準標準パッケージ20選Takuya Ueda
 
Dockerからcontainerdへの移行
Dockerからcontainerdへの移行Dockerからcontainerdへの移行
Dockerからcontainerdへの移行Kohei Tokunaga
 
組み込み関数(intrinsic)によるSIMD入門
組み込み関数(intrinsic)によるSIMD入門組み込み関数(intrinsic)によるSIMD入門
組み込み関数(intrinsic)によるSIMD入門Norishige Fukushima
 
日本語テストメソッドについて
日本語テストメソッドについて日本語テストメソッドについて
日本語テストメソッドについてkumake
 

What's hot (20)

Intel AVX-512/富岳SVE用SIMDコード生成ライブラリsimdgen
Intel AVX-512/富岳SVE用SIMDコード生成ライブラリsimdgenIntel AVX-512/富岳SVE用SIMDコード生成ライブラリsimdgen
Intel AVX-512/富岳SVE用SIMDコード生成ライブラリsimdgen
 
realpathキャッシュと OPcacheの面倒すぎる関係
realpathキャッシュと OPcacheの面倒すぎる関係realpathキャッシュと OPcacheの面倒すぎる関係
realpathキャッシュと OPcacheの面倒すぎる関係
 
AWS で Presto を徹底的に使いこなすワザ
AWS で Presto を徹底的に使いこなすワザAWS で Presto を徹底的に使いこなすワザ
AWS で Presto を徹底的に使いこなすワザ
 
暗号技術の実装と数学
暗号技術の実装と数学暗号技術の実装と数学
暗号技術の実装と数学
 
Elasticsearch勉強会#44 20210624
Elasticsearch勉強会#44 20210624Elasticsearch勉強会#44 20210624
Elasticsearch勉強会#44 20210624
 
条件分岐とcmovとmaxps
条件分岐とcmovとmaxps条件分岐とcmovとmaxps
条件分岐とcmovとmaxps
 
GoによるWebアプリ開発のキホン
GoによるWebアプリ開発のキホンGoによるWebアプリ開発のキホン
GoによるWebアプリ開発のキホン
 
テスト文字列に「うんこ」と入れるな
テスト文字列に「うんこ」と入れるなテスト文字列に「うんこ」と入れるな
テスト文字列に「うんこ」と入れるな
 
Git Flowを運用するために
Git Flowを運用するためにGit Flowを運用するために
Git Flowを運用するために
 
脱RESTful API設計の提案
脱RESTful API設計の提案脱RESTful API設計の提案
脱RESTful API設計の提案
 
地理分散DBについて
地理分散DBについて地理分散DBについて
地理分散DBについて
 
RDB技術者のためのNoSQLガイド NoSQLの必要性と位置づけ
RDB技術者のためのNoSQLガイド NoSQLの必要性と位置づけRDB技術者のためのNoSQLガイド NoSQLの必要性と位置づけ
RDB技術者のためのNoSQLガイド NoSQLの必要性と位置づけ
 
【TECH×GAME COLLEGE#32】ゼロからリアルタイムサーバーを作るまで
【TECH×GAME COLLEGE#32】ゼロからリアルタイムサーバーを作るまで【TECH×GAME COLLEGE#32】ゼロからリアルタイムサーバーを作るまで
【TECH×GAME COLLEGE#32】ゼロからリアルタイムサーバーを作るまで
 
DXとかDevOpsとかのなんかいい感じのやつ 富士通TechLive
DXとかDevOpsとかのなんかいい感じのやつ 富士通TechLiveDXとかDevOpsとかのなんかいい感じのやつ 富士通TechLive
DXとかDevOpsとかのなんかいい感じのやつ 富士通TechLive
 
RESTful Web アプリの設計レビューの話
RESTful Web アプリの設計レビューの話RESTful Web アプリの設計レビューの話
RESTful Web アプリの設計レビューの話
 
オススメの標準・準標準パッケージ20選
オススメの標準・準標準パッケージ20選オススメの標準・準標準パッケージ20選
オススメの標準・準標準パッケージ20選
 
C++ マルチスレッド 入門
C++ マルチスレッド 入門C++ マルチスレッド 入門
C++ マルチスレッド 入門
 
Dockerからcontainerdへの移行
Dockerからcontainerdへの移行Dockerからcontainerdへの移行
Dockerからcontainerdへの移行
 
組み込み関数(intrinsic)によるSIMD入門
組み込み関数(intrinsic)によるSIMD入門組み込み関数(intrinsic)によるSIMD入門
組み込み関数(intrinsic)によるSIMD入門
 
日本語テストメソッドについて
日本語テストメソッドについて日本語テストメソッドについて
日本語テストメソッドについて
 

Similar to Quick Search algorithm and strstr

Similar to Quick Search algorithm and strstr (15)

Sandu TAB
Sandu TABSandu TAB
Sandu TAB
 
Lagralita
LagralitaLagralita
Lagralita
 
Lagralita
LagralitaLagralita
Lagralita
 
Lagralita
LagralitaLagralita
Lagralita
 
Existing Utilities
Existing UtilitiesExisting Utilities
Existing Utilities
 
Grow By Learning 2009[1] Sept To Dec
Grow By Learning 2009[1] Sept To DecGrow By Learning 2009[1] Sept To Dec
Grow By Learning 2009[1] Sept To Dec
 
Grow By Learning[2] September December 2008
Grow By Learning[2] September   December 2008Grow By Learning[2] September   December 2008
Grow By Learning[2] September December 2008
 
MIS - Plant Performance
MIS - Plant PerformanceMIS - Plant Performance
MIS - Plant Performance
 
16193713 T2 Partners Presentation On The Mortgage Crisis
16193713 T2 Partners Presentation On The Mortgage Crisis16193713 T2 Partners Presentation On The Mortgage Crisis
16193713 T2 Partners Presentation On The Mortgage Crisis
 
16193713 T2 Partners Presentation On The Mortgage Crisis
16193713 T2 Partners Presentation On The Mortgage Crisis16193713 T2 Partners Presentation On The Mortgage Crisis
16193713 T2 Partners Presentation On The Mortgage Crisis
 
T2 Partners Presentation On The Mortgage Crisis
T2 Partners Presentation On The Mortgage CrisisT2 Partners Presentation On The Mortgage Crisis
T2 Partners Presentation On The Mortgage Crisis
 
T2 Partners Presentation On The Mortgage Crisis
T2 Partners Presentation On The Mortgage CrisisT2 Partners Presentation On The Mortgage Crisis
T2 Partners Presentation On The Mortgage Crisis
 
TEMS Total Energy Management Service
TEMS Total Energy Management ServiceTEMS Total Energy Management Service
TEMS Total Energy Management Service
 
Thoughts of a friend
Thoughts of a friendThoughts of a friend
Thoughts of a friend
 
HCI in IoT
HCI in IoTHCI in IoT
HCI in IoT
 

More from MITSUNARI Shigeo

範囲証明つき準同型暗号とその対話的プロトコル
範囲証明つき準同型暗号とその対話的プロトコル範囲証明つき準同型暗号とその対話的プロトコル
範囲証明つき準同型暗号とその対話的プロトコルMITSUNARI Shigeo
 
暗認本読書会13 advanced
暗認本読書会13 advanced暗認本読書会13 advanced
暗認本読書会13 advancedMITSUNARI Shigeo
 
深層学習フレームワークにおけるIntel CPU/富岳向け最適化法
深層学習フレームワークにおけるIntel CPU/富岳向け最適化法深層学習フレームワークにおけるIntel CPU/富岳向け最適化法
深層学習フレームワークにおけるIntel CPU/富岳向け最適化法MITSUNARI Shigeo
 
WebAssembly向け多倍長演算の実装
WebAssembly向け多倍長演算の実装WebAssembly向け多倍長演算の実装
WebAssembly向け多倍長演算の実装MITSUNARI Shigeo
 
Lifted-ElGamal暗号を用いた任意関数演算の二者間秘密計算プロトコルのmaliciousモデルにおける効率化
Lifted-ElGamal暗号を用いた任意関数演算の二者間秘密計算プロトコルのmaliciousモデルにおける効率化Lifted-ElGamal暗号を用いた任意関数演算の二者間秘密計算プロトコルのmaliciousモデルにおける効率化
Lifted-ElGamal暗号を用いた任意関数演算の二者間秘密計算プロトコルのmaliciousモデルにおける効率化MITSUNARI Shigeo
 
BLS署名の実装とその応用
BLS署名の実装とその応用BLS署名の実装とその応用
BLS署名の実装とその応用MITSUNARI Shigeo
 
LazyFP vulnerabilityの紹介
LazyFP vulnerabilityの紹介LazyFP vulnerabilityの紹介
LazyFP vulnerabilityの紹介MITSUNARI Shigeo
 
Intro to SVE 富岳のA64FXを触ってみた
Intro to SVE 富岳のA64FXを触ってみたIntro to SVE 富岳のA64FXを触ってみた
Intro to SVE 富岳のA64FXを触ってみたMITSUNARI Shigeo
 

More from MITSUNARI Shigeo (20)

範囲証明つき準同型暗号とその対話的プロトコル
範囲証明つき準同型暗号とその対話的プロトコル範囲証明つき準同型暗号とその対話的プロトコル
範囲証明つき準同型暗号とその対話的プロトコル
 
暗認本読書会13 advanced
暗認本読書会13 advanced暗認本読書会13 advanced
暗認本読書会13 advanced
 
暗認本読書会12
暗認本読書会12暗認本読書会12
暗認本読書会12
 
暗認本読書会11
暗認本読書会11暗認本読書会11
暗認本読書会11
 
暗認本読書会10
暗認本読書会10暗認本読書会10
暗認本読書会10
 
暗認本読書会9
暗認本読書会9暗認本読書会9
暗認本読書会9
 
暗認本読書会8
暗認本読書会8暗認本読書会8
暗認本読書会8
 
暗認本読書会7
暗認本読書会7暗認本読書会7
暗認本読書会7
 
暗認本読書会6
暗認本読書会6暗認本読書会6
暗認本読書会6
 
暗認本読書会5
暗認本読書会5暗認本読書会5
暗認本読書会5
 
暗認本読書会4
暗認本読書会4暗認本読書会4
暗認本読書会4
 
深層学習フレームワークにおけるIntel CPU/富岳向け最適化法
深層学習フレームワークにおけるIntel CPU/富岳向け最適化法深層学習フレームワークにおけるIntel CPU/富岳向け最適化法
深層学習フレームワークにおけるIntel CPU/富岳向け最適化法
 
私とOSSの25年
私とOSSの25年私とOSSの25年
私とOSSの25年
 
WebAssembly向け多倍長演算の実装
WebAssembly向け多倍長演算の実装WebAssembly向け多倍長演算の実装
WebAssembly向け多倍長演算の実装
 
Lifted-ElGamal暗号を用いた任意関数演算の二者間秘密計算プロトコルのmaliciousモデルにおける効率化
Lifted-ElGamal暗号を用いた任意関数演算の二者間秘密計算プロトコルのmaliciousモデルにおける効率化Lifted-ElGamal暗号を用いた任意関数演算の二者間秘密計算プロトコルのmaliciousモデルにおける効率化
Lifted-ElGamal暗号を用いた任意関数演算の二者間秘密計算プロトコルのmaliciousモデルにおける効率化
 
楕円曲線と暗号
楕円曲線と暗号楕円曲線と暗号
楕円曲線と暗号
 
HPC Phys-20201203
HPC Phys-20201203HPC Phys-20201203
HPC Phys-20201203
 
BLS署名の実装とその応用
BLS署名の実装とその応用BLS署名の実装とその応用
BLS署名の実装とその応用
 
LazyFP vulnerabilityの紹介
LazyFP vulnerabilityの紹介LazyFP vulnerabilityの紹介
LazyFP vulnerabilityの紹介
 
Intro to SVE 富岳のA64FXを触ってみた
Intro to SVE 富岳のA64FXを触ってみたIntro to SVE 富岳のA64FXを触ってみた
Intro to SVE 富岳のA64FXを触ってみた
 

Recently uploaded

The Zero-ETL Approach: Enhancing Data Agility and Insight
The Zero-ETL Approach: Enhancing Data Agility and InsightThe Zero-ETL Approach: Enhancing Data Agility and Insight
The Zero-ETL Approach: Enhancing Data Agility and InsightSafe Software
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDropbox
 
UiPath manufacturing technology benefits and AI overview
UiPath manufacturing technology benefits and AI overviewUiPath manufacturing technology benefits and AI overview
UiPath manufacturing technology benefits and AI overviewDianaGray10
 
Stronger Together: Developing an Organizational Strategy for Accessible Desig...
Stronger Together: Developing an Organizational Strategy for Accessible Desig...Stronger Together: Developing an Organizational Strategy for Accessible Desig...
Stronger Together: Developing an Organizational Strategy for Accessible Desig...caitlingebhard1
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistandanishmna97
 
Modernizing Legacy Systems Using Ballerina
Modernizing Legacy Systems Using BallerinaModernizing Legacy Systems Using Ballerina
Modernizing Legacy Systems Using BallerinaWSO2
 
WebRTC and SIP not just audio and video @ OpenSIPS 2024
WebRTC and SIP not just audio and video @ OpenSIPS 2024WebRTC and SIP not just audio and video @ OpenSIPS 2024
WebRTC and SIP not just audio and video @ OpenSIPS 2024Lorenzo Miniero
 
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)Samir Dash
 
Tales from a Passkey Provider Progress from Awareness to Implementation.pptx
Tales from a Passkey Provider  Progress from Awareness to Implementation.pptxTales from a Passkey Provider  Progress from Awareness to Implementation.pptx
Tales from a Passkey Provider Progress from Awareness to Implementation.pptxFIDO Alliance
 
WSO2 Micro Integrator for Enterprise Integration in a Decentralized, Microser...
WSO2 Micro Integrator for Enterprise Integration in a Decentralized, Microser...WSO2 Micro Integrator for Enterprise Integration in a Decentralized, Microser...
WSO2 Micro Integrator for Enterprise Integration in a Decentralized, Microser...WSO2
 
ERP Contender Series: Acumatica vs. Sage Intacct
ERP Contender Series: Acumatica vs. Sage IntacctERP Contender Series: Acumatica vs. Sage Intacct
ERP Contender Series: Acumatica vs. Sage IntacctBrainSell Technologies
 
Quantum Leap in Next-Generation Computing
Quantum Leap in Next-Generation ComputingQuantum Leap in Next-Generation Computing
Quantum Leap in Next-Generation ComputingWSO2
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Jeffrey Haguewood
 
API Governance and Monetization - The evolution of API governance
API Governance and Monetization -  The evolution of API governanceAPI Governance and Monetization -  The evolution of API governance
API Governance and Monetization - The evolution of API governanceWSO2
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
JohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptxJohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptxJohnPollard37
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Victor Rentea
 
Working together SRE & Platform Engineering
Working together SRE & Platform EngineeringWorking together SRE & Platform Engineering
Working together SRE & Platform EngineeringMarcus Vechiato
 
Event-Driven Architecture Masterclass: Challenges in Stream Processing
Event-Driven Architecture Masterclass: Challenges in Stream ProcessingEvent-Driven Architecture Masterclass: Challenges in Stream Processing
Event-Driven Architecture Masterclass: Challenges in Stream ProcessingScyllaDB
 
Harnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptx
Harnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptxHarnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptx
Harnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptxFIDO Alliance
 

Recently uploaded (20)

The Zero-ETL Approach: Enhancing Data Agility and Insight
The Zero-ETL Approach: Enhancing Data Agility and InsightThe Zero-ETL Approach: Enhancing Data Agility and Insight
The Zero-ETL Approach: Enhancing Data Agility and Insight
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
UiPath manufacturing technology benefits and AI overview
UiPath manufacturing technology benefits and AI overviewUiPath manufacturing technology benefits and AI overview
UiPath manufacturing technology benefits and AI overview
 
Stronger Together: Developing an Organizational Strategy for Accessible Desig...
Stronger Together: Developing an Organizational Strategy for Accessible Desig...Stronger Together: Developing an Organizational Strategy for Accessible Desig...
Stronger Together: Developing an Organizational Strategy for Accessible Desig...
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Modernizing Legacy Systems Using Ballerina
Modernizing Legacy Systems Using BallerinaModernizing Legacy Systems Using Ballerina
Modernizing Legacy Systems Using Ballerina
 
WebRTC and SIP not just audio and video @ OpenSIPS 2024
WebRTC and SIP not just audio and video @ OpenSIPS 2024WebRTC and SIP not just audio and video @ OpenSIPS 2024
WebRTC and SIP not just audio and video @ OpenSIPS 2024
 
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
 
Tales from a Passkey Provider Progress from Awareness to Implementation.pptx
Tales from a Passkey Provider  Progress from Awareness to Implementation.pptxTales from a Passkey Provider  Progress from Awareness to Implementation.pptx
Tales from a Passkey Provider Progress from Awareness to Implementation.pptx
 
WSO2 Micro Integrator for Enterprise Integration in a Decentralized, Microser...
WSO2 Micro Integrator for Enterprise Integration in a Decentralized, Microser...WSO2 Micro Integrator for Enterprise Integration in a Decentralized, Microser...
WSO2 Micro Integrator for Enterprise Integration in a Decentralized, Microser...
 
ERP Contender Series: Acumatica vs. Sage Intacct
ERP Contender Series: Acumatica vs. Sage IntacctERP Contender Series: Acumatica vs. Sage Intacct
ERP Contender Series: Acumatica vs. Sage Intacct
 
Quantum Leap in Next-Generation Computing
Quantum Leap in Next-Generation ComputingQuantum Leap in Next-Generation Computing
Quantum Leap in Next-Generation Computing
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
API Governance and Monetization - The evolution of API governance
API Governance and Monetization -  The evolution of API governanceAPI Governance and Monetization -  The evolution of API governance
API Governance and Monetization - The evolution of API governance
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
JohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptxJohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptx
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Working together SRE & Platform Engineering
Working together SRE & Platform EngineeringWorking together SRE & Platform Engineering
Working together SRE & Platform Engineering
 
Event-Driven Architecture Masterclass: Challenges in Stream Processing
Event-Driven Architecture Masterclass: Challenges in Stream ProcessingEvent-Driven Architecture Masterclass: Challenges in Stream Processing
Event-Driven Architecture Masterclass: Challenges in Stream Processing
 
Harnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptx
Harnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptxHarnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptx
Harnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptx
 

Quick Search algorithm and strstr

  • 1. Quick Search algorithm and strstr Cybozu Labs 2012/3/31 MITSUNARI Shigeo(@herumi) x86/x64 optimization seminar 3(#x86opti)
  • 2. Agenda  Quick Search  vs. strstr of gcc on Core2Duo  vs. strstr of gcc on Xeon  fast strstr using strchr of gcc  vs. my implementation on Xeon  restriction  vs. strstr of VC2011 beta on Core i7  feature of pcmpestri  range version of strstr 2012/3/31 #x86opti 2 /20
  • 3. Quick Search algorithm(1/2)  Simplified and improved Boyer-Moore algorithm  initialized table for "this is" char 't' 'h' 'I' 's' '' other skip +7 +6 +2 +1 +3 +8 0 1 2 3 4 5 6 7 8 9 A B C D E F 0 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 1 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8  How to initialize table for given 2 3 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 string [str, str + len) 3 4 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 5 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 int tbl_[256]; 6 8 8 8 8 8 8 8 8 6 2 8 8 8 8 8 8 7 8 8 8 1 7 8 8 8 8 8 8 8 8 8 8 8 void init(const char *str, int len) { 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 9 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 std::fill(tbl_, tbl_ + 256, len); A 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 for (size_t i = 0; i < len; i++) { B 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 C 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 tbl_[str[i]] = len - i; D 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 } E 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 F 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 } 2012/3/31 #x86opti 3 /20
  • 4. Quick Search algorithm(2/2)  Searching phase  simple and fast  see http://www-igm.univ-mlv.fr/~lecroq/string/node19.html const char *find(const char *begin, const char *end) { while (begin <= end - len_) { if (memcmp(str_, begin, len_) == 0) return begin; begin += tbl_[begin[len_]]; } return end; }; 2012/3/31 #x86opti 4 /20
  • 5. Benchmark  2.13GHz Core 2 Duo + gcc 4.2.1 + Mac 10.6.8  33MB UTF-8 text 10 cycle/Byte to find 8 fast 6 4 2 strstr 0 org Qs substring  Qs(Quick search) is faster for long substring  Remark: assume text does not have ‘¥0’for strstr 2012/3/31 #x86opti 5 /20
  • 6. A little modification of Qs  avoid memcmp const char *find(const char *begin, const char *end) { while (begin <= end - len_) { if (memcmp(str_, begin, len_) == 0) return begin; begin += tbl_[begin[len_]]; } return end; } const char *find(const char *begin, const char *end){ while (begin <= end - len_) { for (size_t i = 0; i < len_; i++) { if (str_[i] != begin[i]) goto NEXT; } return begin; NEXT: begin += tbl_[static_cast<unsigned char>(begin[len_])]; } return end; } 2012/3/31 #x86opti 6 /20
  • 7. Benchmark again  2.13GHz Core 2 Duo + gcc 4.2.1 + Mac 10.6.8  33MB UTF-8 text 10 cycle/Byte to find 8 fast 6 4 strstr 2 org Qs 0 Qs' substring  modified Qs(Qs’) is more faster  Should we use modified Qs'? 2012/3/31 #x86opti 7 /20
  • 8. strstr on gcc 4.6 with SSE4.2  Xeon X5650 2.67Gz on Linux  strstr with SSE4.2 is faster than Qs’ for substring with length less than 9 byte 5 cycle/Byte to find 4 fast 3 2 1 strstr 0 Qs' substring  Is strstr of gcc is fastest implementation? 2012/3/31 #x86opti 8 /20
  • 9. strstr implementation by strchr  Find candidate of location by strchr at first, and verify the correctness  strchr of gcc with SSE4.2 is fast const char *mystrstr_C(const char *str, const char *key) { size_t len = strlen(key); while (*str) { const char *p = strchr(str, key[0]); if (p == 0) return 0; if (memcmp(p + 1, key + 1, len - 1) == 0) return p; str = p + 1; } return 0; } 2012/3/31 #x86opti 9 /20
  • 10. strstr vs. mystrstr_C  Xeon X5650 2.6GHz + gcc 4.6.1  mystrstr_C is 1.5 ~ 3 times faster than strstr  except for “ko-re-wa”(in UTF-8)  maybe penalty for many bad candidates 10 cycle/Byte to find 8 fast 6 4 strstr 2 Qs' 0 my_strstr_C substring 2012/3/31 #x86opti 10 /20
  • 11. real speed of SSE4.2(pcmpistri)  my_strstr is always faster than Qs’  2 ~ 4 times faster than strstr of gcc 10 8 fast cycle/Byte to find 6 4 strstr Qs' 2 my_strstr_C 0 my_strstr substring 2012/3/31 #x86opti 11 /20
  • 12. Implementation of my_strstr(1/2)  https://github.com/herumi/opti/blob/master/str_util.hpp  written in Xbyak(for my convenience)  Main loop // a : rax(or eax), c : rcx(or ecx) // input a : ptr to text // key : ptr to key // use save_a, save_key, c movdqu(xm0, ptr [key]); // xm0 = *key L(".lp"); pcmpistri(xmm0, ptr [a], 12); // 12(1100b) = [equal ordered:unsigned:byte] jbe(".headCmp"); add(a, 16); jmp(".lp"); L(".headCmp"); jnc(".notFound"); 2012/3/31 #x86opti 12 /20
  • 13. Implementation of my_strstr(2/2)  Compare tail in“headCmp” ... add(a, c); // get position mov(save_a, a); // save a mov(save_key, key); // save key L(".tailCmp"); movdqu(xm1, ptr [save_key]); pcmpistri(xmm1, ptr [save_a], 12); jno(".next"); js(".found"); // rare case add(save_a, 16); add(save_key, 16); jmp(".tailCmp"); L(".next"); add(a, 1); jmp(".lp"); 2012/3/31 #x86opti 13 /20
  • 14. Pros and Cons of my_strstr  Pros  very fast  Is this implementation with Qs fastest? No, overhead is almost larger(variable address offset)  Cons  access max 16 bytes beyond of the end of text almost no problem except for page boundary allocate memory with margin 4KiB readable page not readable page FF7 FF8 FF9 FFA FFB FFC FFD FFE FFF 000 001 002 003 access pcmpistri violation end of text 2012/3/31 #x86opti 14 /20
  • 15. strstr of Visual Studio 11  almost same speed as my_strstr  of Couse safe to use  i7-2620 3.4GHz + Windows 7 + VS 11beta 8 cycle/Byte to find 6 fast 4 2 strstr Qs' 0 my_strstr substring 2012/3/31 #x86opti 15 /20
  • 16. All benchmarks on i7-2600  find "ko-re-wa" in 33MiB text  the results strongly depends on text and key strstr(before SSE4.2) fast Qs(gcc) Qs'(gcc) strstr(gcc;SSE4.2) strstr(VC;SSE4.2) my_strstr(SSE4.2) 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 rate for the timing of Qs(gcc) 2012/3/31 #x86opti 16 /20
  • 17. range version of strstr  strstr is not available for string including‘¥0’  use std::string.find()  but it is not optimized for SSE4.2 naive but fast implementation by C const char *findStr_C(const char *begin, const char *end, const char *key, size_t keySize) { while (begin + keySize <= end) { const char *p = memchr(begin, key[0], end - begin); if (p == 0) break; if (memcmp(p + 1, key + 1, keySize - 1) == 0)return p; begin = p + 1; } return end; }  str_util.hpp provides findStr with SSE4.2  4 ~ 5 times faster than findStr_C on i7-2600 + VC11 2012/3/31 #x86opti 17 /20
  • 18. feature of pcmpestri  very complex mnemonics xmm0 : head of key pcmpestri xmm0, ptr [p], 12 rax : keySize p : pointer to text rcx : pos of key if found rdx : text size CF : if found ZF : end of text SF : end of key OF : all match L(".lp"); pcmpestri(xmm0, ptr [p], 12); lea(p, ptr [p + 16]); lea(d, ptr [d - 16]); do not change carry ja(".lp"); jnc(".notFound"); // compare leading str... 2012/3/31 #x86opti 18 /20
  • 19. Difference between Xeon and i7  main loop of my_strstr L(".lp"); pcmpistri(xmm0, ptr [a], 12); if (isSandyBridge) { lea(a, ptr [a + 16]); ja(".lp"); a little faster on i7 } else { jbe(".headCmp"); add(a, 16); 1.1 times faster on Xeon jmp(".lp"); L(".headCmp"); } jnc(".notFound"); // get position if (isSandyBridge) { lea(a, ptr [a + c - 16]); } else { add(a, c); } 2012/3/31 #x86opti 19 /20
  • 20. other features of str_util.hpp  strchr_any(text, key)[or findChar_any]  returns a pointer to the first occurrence of any character of key int the text // search character position of '?', '#', '$', '!', '/', ':' strchr_any(text,"?#$!/:");  same speed as strchr by using SSE4.2  max length of key is 16  strchr_range(txt, key)[or findChar_range]  returns a pointer to the first occurrence of a character in range [key[0], key[1]], [key[2], key[3]], ...  also same speed as strchr and max len(key) = 16 // search character position of [0-9], [a-f], [A-F] strchr_range(text,"09afAF"); 2012/3/31 #x86opti 20 /20