데이터 엔지니어링

AWS

by Noong_yoon 2023. 1. 9. 09:37

Elastic Stack

- Kibana, ElasticSearch, Logstash의 적합한 버전을 합쳐둔 것(통합해둔 것)

Beats

- Ship data from the source

- Metric은 Metric beats, logs는 Filebeat로 다르게 가져와야하는데 Elastic Agent는 선택만 하면 자동으로 가져오게 됨

Cluster : ElasticSearch 시스템을 구성하는 가장 큰 단위

다수의 노드로 구성

하나의 클러스터를 다수의 서버로 바인딩해서 운영, 또는 역으로 하나의 서버에서 다수의 클러스터의 운용 가능

Node : Elasticsearch를 구성하는 하나의 단위 프로세스

다수의 샤드로 구성됨

같은 클러스터명을 가진 노드들은 자동으로 바인딩됨

Index : Document를 모아 놓은 집합

데이터 저장 단위인 인덱스는 인디시즈(indicies)라고도 함

샤드(shard) 단위로 분리되고 각 노드에 분산되어 저장됨

Primary shards : 인덱스의 원본 샤드

Replica shards : primary shard의 복제본

- Documents는 primary와 replica shards에 저장되어있음

- primary shard와 replica는 각각 다른 노드에 저장되는 것을 원칙으로

(Dev tools에서 GET _cat/shards?pretty&v&s=index 라고 하면 다른 노드에 저장된 것을 확인 가능)

- index를 생성할 때 primary shard 개수 설정(default는 1개, shard가 용량을 너무 잡아먹을 수 있으므로 많이 만들지x)

- replica shard 개수 설정(default는 primary shard 1개 당 replica shard 1개)

- 고가용성(어떤 노드 하나가 연결이 끊겨도 데이터가 살아있음, replica shard가 primary shard로 바뀌게됨)때문에 replica 생성

Elasticsearch - RDBMS

Index - Table

Mapping - Schema

Document - row

Field - column

_id - Unique ID

join불가능 - join가능

수정/삭제 느림 - 수정/삭제 빠름

- 검색 엔진에서는 Inverted Index로 저장됨(해당 term을 인덱스처럼 두고 그게 들어있는 document를 데이터로 doc1, doc2처럼 해둬서 doc1, doc2에 해당 term(단어)가 있음을 알게 해주는 느낌)

<검색 랭킹>

- 중요한 이유: 사람들은 대부분 처음 나온 결과만 봄 + 결과값이 큰 내용을 fetch하는 것은 상당히 부하가 큼

- Elasticsearch의 랭킹 알고리즘 : TF/IDF -> BM25

TF/IDF : Term Frequency(찾는 검색어가 문서에 많을수록 해당 문서의 정확도 높아짐)

Inverse Document Frequency: 전체 문서에서 많이 출현한(흔한)단어일수록 점수가 낮음

- Devtools - console에서 index aliasing 적용

PUT my-index-01/_doc/1
{
  "name":"IU",
  "age":31,
  "job":"singer"
}

POST _aliases
{
  "actions": [
    {
      "add": {
        "index": "my-index-*",
        "alias": "my-index"
      }
    }
  ]
}

PUT my-index-02/_doc/1
{
  "name":"JongSuk Lee",
  "age":35,
  "job":"actor"
}

# alias는 지정을 해주는 것이기에 alias를 한번만 실행했기에 my-index-02는 잡히지 않음
GET my-index/_search

# 그래서 아래의 alias 다시 해서 확인
POST _aliases
{
  "actions": [
    {
      "add": {
        "index": "my-index-*",
        "alias": "my-index"
      }
    }
  ]
}

# my-index-02 잡히는 것 확인 가능
GET my-index/_search


# 문서번호 _doc/1처럼 하면 _id가 1로 생성, _doc/ 다음 숫자를 안넣으면 자동으로 알아서 doc 번호를 생성

- Component templates는 재사용 가능

- Dev tools로 하는 것보다 Elastic에서 Index Management로 클릭클릭하는 게 더 편리

<Data streams> - 시계열 데이터

Data streams : 여러 인덱스에 걸쳐 시계열 데이터 저장

- 모든 data stream은 backing indices로 구성

- rollover 기능을 통해 새 backing index를 생성

- log는 이벤트 발생했을 때만 들어옴

- 동일한 하드웨어 프로필, 동일한 데이터 역할을 가진 노드 모음

# Data Tier은 5개로 구성(Content, Hot, Warm, Cold, Frozen)

- Content tier: 정적 데이터에 사용

- Hot tier: 가장 최근 데이터, 자주 검색, 업데이트 되는 시기

- Warm tier: 데이터 업데이트 자주 필요x(색인보다 검색이 주된 단계)

- Cold tier: 더이상 업데이트x 검색 점점 하지 x

- Frozen tier: 더이상 업데이트x 검색 거의 하지x

nori 이용

GET _analyze
{ 
	"analyzer":"nori",
	"text":["동해물과 백두산이"]
}

user_dictionaray_rules : 사용자 정의 사전을 배열로 입력, 여기에 넣는 단어는 나누지 않고 문장이 쪼개질 때 쪼개지지 않고 그대로 나오게됨

- tokenizer는 "analysis" 하위에 들어가야! 즉 tokenizer를 쓰려면 analysis를 써야함!(Dev tools 코드 상에서)

- nori_part_of_speech 토큰 필터 : 제거할 품사 정보의 지정이 가능하며, 옵션 stoptags 를 지정하고 해당 품사가 제거되도록 함

- nori_number filter는 숫자로 바꿔주는 토큰 필터

#3일차 <nori>
GET _analyze
{
  "tokenizer":"nori_tokenizer",
  "text":["동해물과 백두산이"]
}

PUT my_nori
{
  "settings": {
    "analysis": {
      "tokenizer": {
        "my_nori_tokenizer":{  
          "type":"nori_tokenizer",
          "user_dictionary_rules":[
            "해물"
            ]
        }
      }
    }
  }
}
# my_nori_tokenizer은 tokenizer 이름

GET my_nori/_analyze
{
  "tokenizer":"my_nori_tokenizer",
  "text":["동해물과 백두산이"]
}
# user_dictionary_rules를 통해 동해물이 동해/물이 아니라  동/해물로 해물이 그대로 나옴을 알 수 있음


# 최종으로 analyzer를 선언해줘야함
# 이름은 my_nori_analyzer이고 tokenizer는 위에서 선언한 걸 쓰겠다는 뜻임
# 아래는 index 생성
PUT my_nori_2
{
  "settings": {
    "analysis": {
      "tokenizer": {
        "my_nori_tokenizer":{  
          "type":"nori_tokenizer",
          "user_dictionary_rules":[
            "해물", "시체"
            ]
        }
      },
      "analyzer": {
        "my_nori_analyzer":{
          "type": "custom",
          "tokenizer":"my_nori_tokenizer"
        }
      }
    }
  },
  "mappings":{
    "properties": {
      "text":{
        "type":"text",
        "analyzer": "my_nori_analyzer"
      }
    }
  }
}

# document 삽입
# my_nori_2에 1번이라는 아이디를 줘서 서울시체육회 인덱싱 
PUT my_nori_2/_doc/1
{
  "text":"서울시체육회"
}

# 검색쿼리로 검색
GET my_nori_2/_search
{
  "query":{
    "match":{
      "text":"시체"
    }
  }
}

# ?로 파라미터 부여(특정 필드 지정), 1은 아이디
GET my_nori_2/_termvectors/1?fields=*



# nori_part_of_speech
# 아래에서 NR은 수사(문서 확인)
# my_pos라는 filter
PUT my_pos
{
  "settings": {
    "analysis": {
      "filter": {
        "my_pos":{
          "type":"nori_part_of_speech",
          "stoptags":[
            "NR"
            ]
        }
      }
    }
  }
}

GET my_pos/_analyze
{
  "tokenizer": "nori_tokenizer",
  "filter":["my_pos"],
  "text":"다섯아이가"
}
# 다섯이 수사에 해당되어서 제거됨을 알 수 있음


# 다섯아이가에서 위의 결과로 아이, 가 가 남았는데 가도 불필요하므로 제거 ("가"의 품사는 J,, 이거는 문서확인)
# http://kkma.snu.ac.kr/documents/?doc=postag
# 위 링크는 품사 정보 
PUT my_pos2
{
  "settings": {
    "analysis": {
      "filter": {
        "my_pos":{
          "type":"nori_part_of_speech",
          "stoptags":[
            "NR", "J"
            ]
        }
      }
    }
  }
}

# 확인
GET my_pos2/_analyze
{
  "tokenizer": "nori_tokenizer",
  "filter":["my_pos"],
  "text":"다섯아이가"
}
#아이만 남은 것을 확인 가능 

# stoptags 명시x 일 때 
PUT my_pos4
{
  "settings": {
    "analysis": {
      "filter": {
        "my_pos":{
          "type":"nori_part_of_speech"
        }
      }
    }
  }
}

# 확인하면 J 품사는 default이기에 다섯,아이는 남고 "가"는 품사가 J여서 삭제되는 것을 확인 가능 
GET my_pos4/_analyze
{
  "tokenizer": "nori_tokenizer",
  "filter":["my_pos"],
  "text":"다섯아이가"
}


# 실습
# 데이터 넣기
POST tour_nori/_bulk
{ "index": {"_id": 1} }
{ "tour_name": "치산관광지", "tour_info": "수려하고 맑은물이 흐르는 치산계곡이 있어 여름철 30,000명 이상이 찾는 관광명소임" }
{ "index": {"_id": 2} }
{ "tour_name": "회산백련지", "tour_info": "동양 최대의 백련서식지, 수상유리온실, 수생식물생태관, 생태탐방로, 야외물놀이장, 오토캠핑장 등 다양한 시설을 갖추고 있어 체험과 관광을 동시에 즐길수 있음" }
{ "index": {"_id": 3} }
{ "tour_name": "마금산온천", "tour_info": "마금산온천은 약알칼리성 수질로 평균 수온이 55℃ 이상을 유지하고 있으며, 나트륨, 철, 칼슘, 라듐 등 20여종의 광물질을 포함하고 있다. 운동욕장, 수영장, 노천탕 등 보양온천 시설을 갖추고 치료와 요양, 휴양이 가능하다." }


# 실습 1.은 nori_part_of_speech,
# 실습 2.는 stop 토큰 필터
# 실습 3.는 nori_tokenizer에 있는 user_dictionary_rules
# 실습 4.는 동의어 필터 synonym 
# 실습 5.는 nori_number 토큰필터

# 최종적으로 analyzer에서 쓰겠다는 것을 명시 
PUT tour_nori
{
  "settings": {
    "analysis": {
     "tokenizer": {
       "my_tokenizer":{
         "type":"nori_tokenizer",
         "user_dictionary_rules":[
           "물놀이","백련"
           ]
       }
     },
     "filter":{
       "my_stop":{
         "type":"stop",
         "stopwords":[
           "장","지","명","시설"
           ]
       },
       "my_synonym":{
         "type":"synonym",
         "synonyms":[
           "물놀이,수영,계곡"
           ]
       }
     },
     "analyzer": {
       "my_analyzer":{
         "type":"custom",
         "tokenizer":"my_tokenizer",
         "filter":[
           "nori_part_of_speech","my_stop","my_synonym","nori_number"
           ]
       }
     }
    }
  },
  "mappings": {
    "properties": {
      "tour_name":{
        "type":"text",
        "analyzer":"my_analyzer"
      },
      "tour_info":{
        "type":"text",
        "analyzer": "my_analyzer"
      }
    }
  }
}

# nori_part_of_speech는 기본으로 적용되어있음(J라는 품사 포함해서 실습1. 해결)
# 필터는 적어준 순서대로 적용되므로 주의!
# 마지막 mappings는 이 field에서 이 analyzer를 사용하겠다는 의미

GET tour_nori/_search
{
  "query": {
    "multi_match": {
      "query": "오토캠핑장",
      "fields": ["tour_info","tour_name"]
    }
  },
  "highlight": {
    "fields":{
      "tour_info":{},
      "tour_name":{}
    }
  }
}
# hightlight는 어디에 걸리는지 확인 가능 

GET tour_nori/_termvectors/2?fields=tour_info



# 텍스트 파일에 불용어, 동의어를 미리 넣어서 저장시켜두고 불러와서 같은 예를 실행시키는 과정(dictionaries.zip파일에 3개의 txt 파일을 넣어두고 manage this deployment에 dictionaries 파일을 upload한다음 아래의 코드로 불러옴)

PUT tour_nori_2
{
  "settings": {
    "analysis": {
     "tokenizer": {
       "my_tokenizer":{
         "type":"nori_tokenizer",
         "user_dictionary_rules":
           "user_dictionary.txt"
       }
     },
     "filter":{
       "my_stop":{
         "type":"stop",
         "stopwords_path":
           "stop.txt"
       },
       "my_synonym":{
         "type":"synonym",
         "synonyms_path":
           "stnonyms.txt"
       }
     },
     "analyzer": {
       "my_analyzer":{
         "type":"custom",
         "tokenizer":"my_tokenizer",
         "filter":[
           "nori_part_of_speech","my_stop","my_synonym","nori_number"
           ]
       }
     }
    }
  },
  "mappings": {
    "properties": {
      "tour_name":{
        "type":"text",
        "analyzer":"my_analyzer"
      },
      "tour_info":{
        "type":"text",
        "analyzer": "my_analyzer"
      }
    }
  }
}

# 데이터 삽입
POST tour_nori_2/_bulk
{ "index": {"_id": 1} }
{ "tour_name": "치산관광지", "tour_info": "수려하고 맑은물이 흐르는 치산계곡이 있어 여름철 30,000명 이상이 찾는 관광명소임" }
{ "index": {"_id": 2} }
{ "tour_name": "회산백련지", "tour_info": "동양 최대의 백련서식지, 수상유리온실, 수생식물생태관, 생태탐방로, 야외물놀이장, 오토캠핑장 등 다양한 시설을 갖추고 있어 체험과 관광을 동시에 즐길수 있음" }
{ "index": {"_id": 3} }
{ "tour_name": "마금산온천", "tour_info": "마금산온천은 약알칼리성 수질로 평균 수온이 55℃ 이상을 유지하고 있으며, 나트륨, 철, 칼슘, 라듐 등 20여종의 광물질을 포함하고 있다. 운동욕장, 수영장, 노천탕 등 보양온천 시설을 갖추고 치료와 요양, 휴양이 가능하다." }

GET tour_nori_2/_search
{
  "query":{
    "multi_match":{
      "query":"물놀이",
      "fields":["tour_info","tour_name"]
    }
  },
  "highlight":{
    "fields":{
      "tour_info":{},
      "tour_name":{}
    }
  }
}

- put은 업데이트, post는 insert로 생각(put은 _version이 계속 올라가고, post는 _version은 계속 1)

- 자기 자신을 놔두고 새로 만드는 개념

- 개념으로는 reindex와 동일하지만 자기 자신을 지우고 다시 쓰는 개념(delete와 insert)

- 조건에 맞으면 일괄 삭제

# 계속 아래 코드 실행하면 _version 계속 1
# post는 insert 성향(새롭게 넣을때) 
POST my-index-000001/_doc
{
  "test":"시험중"
}

# 계속 아래코드 실행하면 _version 계속 증가
# put은 update 성향(있는 것 수정)
PUT my-index-000001/_doc/vDYn4UBhXOmFzW7P-A4
{
  "test":"시험중"
}

# 조회할 때는 get-post 

GET my-index-000001/_search

GET my-new-index-000001/_search

# took은 처리한 결과를 dev tools에 표현하는데 걸리는 시간
# index를 destination index로 바꿈(원본 데이터를 옮기는 과정)
# 원본 index가 잘못만들어졌거나 기존 index를 수정하기 위할 때 reindex를 많이 사용함
POST _reindex
{
  "source":{
    "index":"my-index-000001"
  },
  "dest":{
    "index":"my-new-index-000001"
  }
}

- manipulate fields : set, remove, rename, dot_expander

- manipulate values : split/join, grok, dissect, gsub

- special operations : csv/json, geoip, user_agent, script, pipeline

- ingest node pipelines

- create pipelines : logstash, Dev tools를 통해 만들 수 있음

- pipeline(processors의 집합) 실습

[전반적으로 참고]

Example: Parse logs in the Common Log Format | Elasticsearch Guide [master] | Elastic

Example: Parse logs in the Common Log Formatedit In this example tutorial, you’ll use an ingest pipeline to parse server logs in the Common Log Format before indexing. Before starting, check the prerequisites for ingest pipelines. The logs you want to pa

www.elastic.co

- Grok Processor

- Enrich data : 원본 데이터를 기반으로 새로운 데이터가 들어왔을 때 원본 값에 appending하여 새로운 data merge(즉 하나의 데이터를 넣고 그 구조에 맞춰서 새로운 데이터를 넣을 수 있게 되는 것)

- Enrich policy: 원본 데이터를 기반으로 정책을 만드는 것

-----> Stack Management - Edit Processors해서 remove 등의 코드에서 한 것 추가 가능

- 튜토리얼

Example: Enrich your data by matching a value to a range | Elasticsearch Guide [master] | Elastic

Example: Enrich your data by matching a value to a rangeedit A range enrich policy uses a term query to match a number, date, or IP address in incoming documents to a range of the same type in the enrich index. Matching a range to a range is not supported.

www.elastic.co

# 코드 짤 때 API 사용법 : method /api ~~~

put /_enrich~~

# 실제 데이터를 다운받아 실습

부산광역시 기장군_카페 현황_20221228.csv

0.04MB

# 공공데이터포털에서 부산시 기장군 카페 현황 데이터를 사용하여 실습 
# 1. 소스인덱스 업로드 csv(UTF-8)
GET busan_cafe/_search

# 업소명과 일치하면 다른 것을 다 넣어라
# 2. policy 생성. busan_cafe 소스 인덱스를 기반으로 해서 "업소명"과 매칭되면 나머지를 붙여라
PUT /_enrich/policy/busan-policy
{
  "match": {
    "indices": "busan_cafe",
    "match_field": "업소명",
    "enrich_fields": ["업종","업소명","소재지(도로명)","소재지전화"]
  }
}

# 3. 만들어진 policy를 execute. 이걸 해야 Elastic 메모리로 올리게 됨. Listen 해야되므로
POST /_enrich/policy/busan-policy/_execute

# 새로운 데이터가 들어오는데 keyword에서 찾아서 업소명과 일치하면 카페라는 필드를 만들어서 다 가져다 붙여라
# 아래의 4의 이전 버전
PUT /_ingest/pipeline/busan_lookup
{
  "processors": [
    {
      "enrich":{
        "policy_name":"busan-policy",
        "field":"keyword",
        "target_field": "cafe",
        "max_matches": 1
      }
    }
  ]
}

# keyword가 산마루다방인 것 뽑아주기 
PUT /my-index-000001/_doc/busan?pipeline=busan_lookup
{
  "keyword":"산마루다방"
}

GET /my-index-000001/_doc/busan


# simulate를 붙이고 doc를 붙이면 가상으로 해볼 수 있음
# 해당 pipeline이 잘 작동하는지 확인 가능 
POST _ingest/pipeline/busan_lookup/_simulate
{
  "docs":[
    {
      "_source":{
        "keyword":"휴게음식점"
      }
    }
  ]
}

# 5. 신규 데이터를 넣으면서 pipeline을 통과하게 테스트 
POST _ingest/pipeline/busan_lookup/_simulate
{
  "docs":[
    {
      "_source":{
        "text":"산마루다방에는 사랑이 넘쳐요",
        "keyword":["산","마루","산마루","산마루다방","사랑"]
      }
    }
  ]
}

# 4. policy를 사용할 pipeline 생성, processors 안에서 foreach를 가지고 뺑뺑이 돌면서 enrich. 
PUT _ingest/pipeline/busan_lookup
{
  "processors": [
    {
      "foreach": {
        "field": "keyword",
        "processor": {
          "enrich":{
            "policy_name": "busan-policy",
            "field":"keyword",
            "target_field": "cafe",
            "max_matches": 10
          }
        }
      }
    }
  ]
}

GET /my-index-000001/_doc/busan


POST _ingest/pipeline/busan_lookup/_simulate
{
  "docs":[
    {
      "_source":{
        "text":"산마루다방에는 사랑이 넘쳐요",
        "keyword":["산","마루","산마루","산마루다방","사랑"]
      }
    }
  ]
}

Ex.

- 감정분석, 카테고리 기반 분류에서 소스 인덱스를 이용해 pipeline을 설정하고 pipelline을 통과하고 append되면서 이것을 기반으로 필터링을 하여 맞는 것들만 노출되는 형식으로 구현

<Transforms, Rollup>

- Stack Management - Data - rollup jobs와 transforms 유사

- Transforms는 기준을 가지고 요약(집계)해서 가지고 있는다 --> pivot , latest 로 최근 한달동안 있었던 일들을 기재(시점+데이터 기반 aggregation)

- rollup jobs는 watch 개념이 있어 interval을 가지고 현재와 과거를 아우르는 인근 시점을 다룸

- 해당 field를 기준으로 aggregation해서 새롭게 만들어 보관할 때 Transforms 이용!

- 실습

Tutorial: Transforming the eCommerce sample data | Elasticsearch Guide [8.6] | Elastic

If the destination index does not exist, it is created the first time you start your transform. Unlike pivot transforms, however, latest transforms do not deduce mapping definitions when they create the index. Instead, they use dynamic mappings. To use exp

www.elastic.co

여기서의 실습을 통해 Aggregation한 결과를 가지고 x또는 y axis에 넣어서 Visualize도 가능!

(주로 데이터가 많을 때(TB를 넘을 때) Transforms를 통해 aggregation하고 사용)

# Transforms에서 aggregation사용! - busan_cafe_aggs
GET busan_cafe_aggs/_search


# Transforms에서 ecommerce data사용!(tutorial)
# https://www.elastic.co/guide/en/elasticsearch/reference/current/ecommerce-transforms.html

: 로그들을 취합해 새로운 방법론을 찾는 것

- Scale up : 서버의 스펙을 늘리는 것(수직 확장)

- Scale Out : 물리적 자원을 늘려나가는 것 (수평 확장) --> 로드 밸런서 통해

- 기존에는 Monitoring(단순히 들여다보는 것) vs. Observability(의미를 들여다보는 것)

- Logs + Metrics(CPU,GPU등의 status 확인 가능)+ Traces

- Microservice Architecture에서 elastic search가 제일 잘적용될 수 있음

- 이벤트에 매칭되면 Alert

- Cases : 부서간에 봐주라고 요청하는 것 ex. 오류 발생

- Monitors : 서버의 health check 가능

- User Experience - Dashboard : 사용자의 모든 정보를 받을 수 있음

- microservice demo - logging 코드를 날리면 필요한 것들 모두 다운 가능

- RUM : real user monitoring (사용자의 행위를 수집 가능) --> Google Analytics와 유사

(RUM은 개발자 관점, GA는 마케터 관점)

- RUM 분석요건 : 네트워크 가장 긴 시간이 어디인지 etc

- 코딩언어가 무엇이던 클라우드url만 넣으면 elastic으로 끌고 올 수 있음

- Elastic의 장점:

사용자의 위치 , 즉 사용자의 브라우저(어떤 언어를 쓰는지)(RUM)으로부터 시작(서버로부터 시작하는 것이 x)

'AWS' 카테고리의 다른 글

데이터 전문가 - elastic (0)	2023.01.05
Deep Learning on AWS (0)	2023.01.04
Building Data Analytics Solutions Using Amazon Redshift (0)	2023.01.03
Building Data Lakes on AWS (0)	2023.01.02