하나둘셋넷

PREV 이전 1 ···9 10 11 12 13 14 15 ···20 NEXT 다음

전체 글

python 기법] 워드 클라우드 2023.12.03 1
데이터 전처리 가변수화] one-hot encoding, pd.get_dummies 2023.12.02
jupyter markdown] 마크다운 색 설정 2023.12.02
데이터 분석기법] 카이제곱검정, anova(f_oneway), ttest_p-value, barbplot 2023.11.30
데이터 분석기법] 상관관계 분석_피어슨 상관계수, regplot, heatmap 2023.11.30
시각화 matplotlib] "막대 그래프 그리기" plt.bar, plt.barh, ylim, ylabel, xticks, title, rotation 2023.11.30
시각화 matplotlib, seaborn 범주형] "countplot, bar chart, pie chart 그리기", sns.countplot, plt.pie, pd.Series.plot(kind='bar') 2023.11.29
python 기법] 이메일 2023.11.29
SQL SELECT] 프로그래머스_인기있는 아이스크림 2023.11.26

python 기법] 워드 클라우드

하나둘셋넷_1234 2023. 12. 3. 00:19

2023. 12. 3. 00:19

728x90

python 기법] 워드 클라우드

텍스트 전처리

ㆍ 파일 읽기, 내용 확인

# 파일 읽기
file = open('Dream.txt', 'r', encoding='UTF-8')
text = file.read() 
file.close()

# 확인(100 글자만)
text[:100]

ㆍ split() 메소드를 이용하여 단어 단위로 잘라 리스트 형태로 만들기

# 공백을 구분자로 하여 단어 단위로 자르기
wordList = text.split()

# 확인(10 개만)
wordList[:10]

ㆍ 단어별 빈도수 계산하여 딕셔너리에 저장

# 중복 단어 제거
worduniq = set(wordList)

# 딕셔너리 선언
wordCount = {}

# 단어별 개수 저장
for w in worduniq:
    wordCount[w] = wordList.count(w)

# 제외 대상 조사 
del_word = ['the','a','is','are', 'not','of','on','that','this','and','be','to', 'from']

# 제외하기
for w in del_word:
    if w in wordCount:
        del wordCount[w]

워드 클라우드 그리기

# 패키지 설치
!pip install wordcloud

# 라이브러리 불러오기
import matplotlib.pyplot as plt
from wordcloud import WordCloud
%config InlineBackend.figure_format='retina'

# 워드 클라우드 만들기
wordcloud = WordCloud(font_path = 'C:/Windiws/fonts/HMKMRHD.TTF', 
                      width=2000,
                      height=1000,
                     # colormap='Blues'
                      background_color='white').generate_from_frequencies(wordCount)

# 표시하기
plt.figure(figsize=(12, 6))
plt.imshow(wordcloud)
plt.axis("off")
plt.tight_layout(pad=0)
plt.show()

불필요한 단어나 조사 추가 제거

# 제외 대상 조사
del_word = ['for','But','into','So', 'which','by','as','With','am','was','when','who', 'an', 'has', 'in']

# 제외하기
for w in del_word:
    if w in wordCount:
        del wordCount[w]

워드 클라우드 그리기

# 워드 클라우드 만들기
wordcloud = WordCloud(font_path = 'C:/Windiws/fonts/HMKMRHD.TTF',
                      width=2000, 
                      height=1000, 
                      background_color='white').generate_from_frequencies(wordCount)

# 표시하기
plt.figure(figsize=(12, 6))
plt.imshow(wordcloud)
plt.axis("off")
plt.tight_layout(pad=0)
plt.show()

728x90

'Python 기법' 카테고리의 다른 글

python 기법] 파일 읽고 쓰기 mkdir, read, write, writelines, readlines, readline (1)	2023.12.03
python 기법] 이메일 (0)	2023.11.29

데이터 전처리 가변수화] one-hot encoding, pd.get_dummies

하나둘셋넷_1234 2023. 12. 2. 19:54

2023. 12. 2. 19:54

728x90

데이터 전처리 가변수화] one-hot encoding, pd.get_dummies

one-hot encoding

- 원-핫 인코딩(One-Hot Encoding):

각 범주를 독립된 열로 나타내며, 해당 범주에 속하는 경우 1, 그렇지 않은 경우 0으로 표시

예를 들어, '색상'이라는 특성이 '빨강', '파랑', '초록' 세 가지 범주를 가질 때, 원-핫 인코딩은 이를 세 개의 열로 변환합니다 ('빨강', '파랑', '초록'). 각 열은 해당 색상이면 1, 아니면 0으로 표시됩니다.

- 가변수화(Dummy Variable Encoding):

원-핫 인코딩과 유사하지만, 범주 중 하나를 기준(기준 범주)으로 삼아 그 범주를 제외한 나머지 범주에 대해서만 열을 생성

이 방법은 다중공선성(multicollinearity) 문제를 방지하기 위해 사용되며, 통계 모델링에서 자주 사용

예를 들어, 위와 같은 '색상' 특성에 대해 가변수화를 적용할 때, '빨강'을 기준으로 삼으면 '파랑'과 '초록'에 대해서만 열이 생성

pd.get_dummies 함수를 사용하여 drop_first=True 옵션을 설정한 경우, 이는 원-핫 인코딩을 수행하되 첫 번째 범주를 제외한 가변수화(dummy variable encoding)에 해당

cat_cols = ['ShelveLoc', 'Education', 'US', 'Urban']
x = pd.get_dummies(x, columns = cat_cols, drop_first = True)

728x90

'데이터 - 전처리' 카테고리의 다른 글

데이터 전처리 날짜, date, Date] (0)	2024.01.05
데이터 전처리 파일 다루기] (0)	2024.01.05
데이터 전처리] 1,234 등 숫자에서 쉼표를 제거하고 숫자형으로 형식 변경, 빈 칸 np.nan으로 대체 및 제거, str.replace(',', '').astype(float), np.nan, subset (0)	2023.12.16
데이터 전처리] 데이터 파싱, xml.etree.ElementTree, bs4, Beautiful Soup Parsing, pprint (0)	2023.11.18
데이터 전처리 그룹] (0)	2023.09.08

jupyter markdown] 마크다운 색 설정

하나둘셋넷_1234 2023. 12. 2. 13:18

2023. 12. 2. 13:18

728x90

jupyter 마크다운 색 설정

# <span style="font-style:italic; font-weight:bold;font-family:serif; font-size:1.5em;line-height:1.5em;color:rgba(255, 87, 51, 0.5);">최대 팀원 수, 수상자 수, 일일 제출횟수를 조정하는 것은 참여자 수를 늘리는 데에 유효한가?</span>

- font-style : normal, italic, oblique
- fontweight : normal, bold, bolder, lighter
- font-family : snas-serif, serif, cursive, fantasy
- font-size : 1em, 0.5em, 1.5em
- line-height : 1em, 0.5em, 1.5em
- color = : orange, red, blue, rgba(255, 87, 51, 0.5),hsl(12, 100%, 60%)

728x90

데이터 분석기법] 카이제곱검정, anova(f_oneway), ttest_p-value, barbplot

하나둘셋넷_1234 2023. 11. 30. 23:29

2023. 11. 30. 23:29

728x90

범주형 feature -> 수치형 target

카이제곱 검정, ttest, ANOVA 검정에서의 p_value의 의미

ㆍ t-검정

- 귀무가설 : 집단의 평균 간에 차이가 없을 것이다.

- p-value < 0.05 : 귀무 가설 기각, 집단 간의 평균에 유의미한 차이가 있다.

ㆍ 카이제곱 검정

- 귀무가설 : 두 집단의 빈도 분포가 독립적이다.

- p-value < 0.05 : 귀무 가설 기각, 두 집단의 빈도 분포가 독립적이지 않을 것이다.

ㆍ ANOVA

- 귀무가설 : 집단(세 개 이상)의 평균 간에 차이가 없을 것이다.

- p-value < 0.05 : 귀무 가설 기각, 집단 간의 평균에 유의미한 차이가 있다.

(1) Gender

plt.figure(figsize = (15,8))
sns.barplot(x='Gender', y='Score_diff_total', data = base_data)
plt.grid()
plt.show()

## 범주 데이터 확인 : value_counts()
base_data['Gender'].value_counts()

## 평균 분석 : ttest_ind

t_male = base_data.loc[base_data['Gender']=='M', 'Score_diff_total']
t_female = base_data.loc[base_data['Gender']=='F', 'Score_diff_total']

spst.ttest_ind(t_male, t_female)

3-2-2) 학습목표

# 그래프 분석 : barplot

plt.figure(figsize = (15,8))
sns.barplot(x='학습목표', y='Score_diff_total', data = base_data)
plt.grid()
plt.show()

## 범주 데이터 확인 : value_counts()
base_data['학습목표'].value_counts()

## 분산 분석 : f_oneway

anova_1 = base_data.loc[base_data['학습목표']=='승진', 'Score_diff_total']
anova_2 = base_data.loc[base_data['학습목표']=='자기계발', 'Score_diff_total']
anova_3 = base_data.loc[base_data['학습목표']=='취업', 'Score_diff_total']

spst.f_oneway(anova_1, anova_2, anova_3)

3-2-3) 학습방법

## 그래프 분석 : barplot

plt.figure(figsize = (15,8))
sns.barplot(x='학습방법', y='Score_diff_total', data = base_data)
plt.grid()
plt.show()

## 범주 데이터 확인 : value_counts()
base_data['학습방법'].value_counts()

## 분산 분석 : f_oneway

anova_1 = base_data.loc[base_data['학습방법']=='온라인강의', 'Score_diff_total']
anova_2 = base_data.loc[base_data['학습방법']=='오프라인강의', 'Score_diff_total']
anova_3 = base_data.loc[base_data['학습방법']=='참고서', 'Score_diff_total']

spst.f_oneway(anova_1, anova_2, anova_3)

3-2-4) 강의 학습 교재 유형

## 그래프 분석 : barplot

plt.figure(figsize = (15,8))
sns.barplot(x='강의 학습 교재 유형', y='Score_diff_total', data = base_data)
plt.grid()
plt.show()

## 범주 데이터 확인 : value_counts()
base_data['강의 학습 교재 유형'].value_counts()

## 분산 분석 : f_oneway

anova_1 = base_data.loc[base_data['강의 학습 교재 유형']=='일반적인 영어 텍스트 기반 교재', 'Score_diff_total']
anova_2 = base_data.loc[base_data['강의 학습 교재 유형']=='영상 교재', 'Score_diff_total']
anova_3 = base_data.loc[base_data['강의 학습 교재 유형']=='뉴스/이슈 기반 교재', 'Score_diff_total']
anova_4 = base_data.loc[base_data['강의 학습 교재 유형']=='비즈니스 시뮬레이션(Role Play)', 'Score_diff_total']

spst.f_oneway(anova_1, anova_2, anova_3, anova_4)

3-2-6) 취약분야 인지 여부

## 그래프 분석 : barplot

plt.figure(figsize = (15,8))
sns.barplot(x='취약분야 인지 여부', y='Score_diff_total', data = base_data)
plt.grid()
plt.show()

## 범주 데이터 확인 : value_counts()

base_data['취약분야 인지 여부'].value_counts()

## 평균 분석 : ttest_ind

t_yes = base_data.loc[base_data['취약분야 인지 여부']=='알고 있음', 'Score_diff_total']
t_no = base_data.loc[base_data['취약분야 인지 여부']=='알고 있지 않음', 'Score_diff_total']

spst.ttest_ind(t_yes, t_no)

728x90

'데이터 - 분석기법' 카테고리의 다른 글

데이터 분석기법] 상관관계 분석_피어슨 상관계수, regplot, heatmap (0)	2023.11.30

데이터 분석기법] 상관관계 분석_피어슨 상관계수, regplot, heatmap

하나둘셋넷_1234 2023. 11. 30. 18:13

2023. 11. 30. 18:13

728x90

피어슨 상관계수, regplot

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

import scipy.stats as spst   

target = '등록차량수'

for feature in analyze_features:

    print(f"[{feature}] 통계 분석 및 그래프 분석")

    # 통계 분석 : 통계 분석

    print("   ***** 통계 분석 *****")
    result = spst.pearsonr(data[feature], data[target])
    print(feature, " vs ", target, " 상관 분석: ", spst.pearsonr(data[feature], data[target]))
    
    if result[1] > 0.05:
        print(f"통계분석 결과 : {feature}는 등록차량수에 영향을 주지 않는다")
    else:
        print(f"통계분석 결과 : {feature}는 등록차량수에 영향을 준다")

    # 그래프 분석 : regplot

    # plt.figure(figsize = (12,8))
    print("   ***** 그래프 분석 *****")
    sns.regplot(x = feature, y= target, data = data)
    plt.grid()
    plt.show()
    
    print("")
    print("-"*50)

heatmap

## 각 컬럼간 상관계수에 대한 heatmap 그래프 분석

plt.figure(figsize = (20,12))
sns.heatmap(data[col_num].corr(),cmap="PiYG", annot=True)
plt.show()

728x90

'데이터 - 분석기법' 카테고리의 다른 글

데이터 분석기법] 카이제곱검정, anova(f_oneway), ttest_p-value, barbplot (0)	2023.11.30

시각화 matplotlib] "막대 그래프 그리기" plt.bar, plt.barh, ylim, ylabel, xticks, title, rotation

하나둘셋넷_1234 2023. 11. 30. 01:48

2023. 11. 30. 01:48

728x90

시각화 matplotlib] plt.bar, plt.barh

plt.bar

import matplotlib.pyplot as plt
%config InlineBackend.figure_format='retina'

plt.figure(figsize=(6,4))
plt.bar(x=tmp['AgeGrp'], height=tmp['Survived'])
plt.xlabel('AgeGrp')
plt.ylabel('Survived')
plt.ylim(0,1)
plt.show()

plt.bar, plt.xticks(rotation = 숫자)

plt.figure(figsize = [20,15])
plt.bar(x = df_participate['월별'], height = df_participate['참가자 수'])
plt.xticks(rotation =45 )
plt.show()

plt.barh

gongong

import pandas as pd
import matplotlib.pyplot as plt
gongong = pd.read_csv('한국건강가정진흥원_다문화가족 이중언어코치 지역별 현황_20220831.csv', encoding = 'CP949')

# 한글 폰트를 설정하자
plt.rc('font', family='Malgun Gothic') # For Windows
plt.rc('axes', unicode_minus=False)
plt.rcParams['font.family']

# 인덱스가 한글이기 때문에 가로 막대로 출력하는 것이 더 가시적이다.
plt.barh(y=gongong['지역'].astype(str), width = gongong['합계 : 이중언어코치 인원(명)'], color = ['C4'], alpha = 0.7, 
         label = ' 인원(명)')
plt.xticks(list(range(0,21,2)))
plt.title('이중언어코치의 수')


plt.legend()
plt.show()

728x90

'데이터 - 시각화' 카테고리의 다른 글

시각화 matplotlib] "barh 내림차순 정렬" plt.barh, transpose, sort_values (0)	2023.12.15
시각화 matplotlib, seaborn 범주형] "countplot, bar chart, pie chart 그리기", sns.countplot, plt.pie, pd.Series.plot(kind='bar') (0)	2023.11.29
시각화 matplotlib] plot 차트_x, y 설정, 꾸미기, axhline, xticks, grid, xlabel, title, rotation, rc, rcParams, subplot, figsize, tight_layout (1)	2023.11.14
시각화 matplotlib] 한글 입력, 경고문구 무시, 경로, 목록_rc, rcParams, os, getcwd(), lisdir() (0)	2023.11.14
데이터 시각화 다변량] crosstab (0)	2023.09.27

시각화 matplotlib, seaborn 범주형] "countplot, bar chart, pie chart 그리기", sns.countplot, plt.pie, pd.Series.plot(kind='bar')

하나둘셋넷_1234 2023. 11. 29. 02:33

2023. 11. 29. 02:33

728x90

시각화 matplotlib, seaborn 범주형] countplot, bar chart, pie chart

seaborn countplot

# sns.countplot(x=titanic['Pclass'])
sns.countplot(x='Pclass', data=titanic)
# sns.countplot(y='Pclass', data=titanic)
plt.grid()
plt.show()

막대 그래프 시각화, plot(kind='bar')

train.groupby('Pclass').mean()['Survived'].plot(kind='bar')

pie chart

plt.pie(temp.values, labels = temp.index, autopct = '%.2f%%',
        startangle=90, counterclock=False,
        explode = [0.05, 0.05, 0.05], shadow=True)
plt.show()

728x90

'데이터 - 시각화' 카테고리의 다른 글

시각화 matplotlib] "barh 내림차순 정렬" plt.barh, transpose, sort_values (0)	2023.12.15
시각화 matplotlib] "막대 그래프 그리기" plt.bar, plt.barh, ylim, ylabel, xticks, title, rotation (0)	2023.11.30
시각화 matplotlib] plot 차트_x, y 설정, 꾸미기, axhline, xticks, grid, xlabel, title, rotation, rc, rcParams, subplot, figsize, tight_layout (1)	2023.11.14
시각화 matplotlib] 한글 입력, 경고문구 무시, 경로, 목록_rc, rcParams, os, getcwd(), lisdir() (0)	2023.11.14
데이터 시각화 다변량] crosstab (0)	2023.09.27

python 기법] 이메일

하나둘셋넷_1234 2023. 11. 29. 02:02

2023. 11. 29. 02:02

728x90

Python_기법] 이메일

라이브러리 import

import smtplib
from email.mime.multipart import MIMEMultipart
from email.mime.text import MIMEText
from email.mime.base import MIMEBase
from email import encoders

파일

def send_email(subject, body, recipient, files):
    sender = '메일@메일.com'
    password = '앱 비밀번호'

    server = smtplib.SMTP('smtp.gmail.com', 587)
    server.starttls()
    server.login(sender, password)

    message = MIMEMultipart()
    message['From'] = sender
    message['To'] = recipient
    message['Subject'] = subject
    message.attach(MIMEText(body, 'plain'))

    for file in files: # files의 항목들을 연다.
        attachment = open(file, 'rb')
        part = MIMEBase('application', 'octet-stream') # MIMEBase 타입을 설정한다. 'application', 'octet-stream' 이것은 일반적인 바이너리 파일을 나타내는 MIME, 다양한 유형의 파일 첨부 가능
        part.set_payload((attachment).read())
        encoders.encode_base64(part) # ASCII 문자열로 변환하여 이메일을 통한 전송 중에 데이터가 손상되지 않도록 한다.
        part.add_header('Content-Disposition', "attachment; filename= %s" % file)
        message.attach(part)

    # 이메일 발송
    server.send_message(message)
    server.quit()

함수 사용

if st.button('Sending email'): # streamlit 전송 버튼
    send_email(
        subject = '제목',
        body = 'Check This File',
        recipient = '메일@메일.com',
        files = ['./under.csv', './over.csv']
        )
    st.write('Complete')

728x90

'Python 기법' 카테고리의 다른 글

python 기법] 파일 읽고 쓰기 mkdir, read, write, writelines, readlines, readline (1)	2023.12.03
python 기법] 워드 클라우드 (1)	2023.12.03

SQL SELECT] 프로그래머스_인기있는 아이스크림

하나둘셋넷_1234 2023. 11. 26. 22:42

2023. 11. 26. 22:42

728x90

프로그래머스_인기있는 아이스크림

SELECT FLAVOR
    FROM FIRST_HALF
    ORDER BY TOTAL_ORDER DESC, SHIPMENT_ID;

728x90

'SQL - 프로그래머스 SELECT' 카테고리의 다른 글

SQL SELECT] 서울에 위치한 식당 목록 출력하기 (0)	2023.12.11
SQL SELECT] 재구매가 일어난 상품과 회원 리스트 구하기 (1)	2023.12.04
SQL SELECT] 프로그래머스_평균 일일 대여 요금 구하기 (0)	2023.11.26
SQL SELECT] 프로그래머스_12세 이하인 여자 환자 목록 출력하기 (0)	2023.11.26
SQL SELECT] 프로그래머스_3월에 태어난 여성 회원 목록 출력하기 (0)	2023.11.26

전체 글

python 기법] 워드 클라우드

텍스트 전처리

워드 클라우드 그리기

불필요한 단어나 조사 추가 제거

워드 클라우드 그리기

'Python 기법' 카테고리의 다른 글

데이터 전처리 가변수화] one-hot encoding, pd.get_dummies

one-hot encoding

'데이터 - 전처리' 카테고리의 다른 글

jupyter 마크다운 색 설정

범주형 feature -> 수치형 target

카이제곱 검정, ttest, ANOVA 검정에서의 p_value의 의미

'데이터 - 분석기법' 카테고리의 다른 글

피어슨 상관계수, regplot

heatmap

'데이터 - 분석기법' 카테고리의 다른 글

시각화 matplotlib] plt.bar, plt.barh

plt.bar

plt.bar, plt.xticks(rotation = 숫자)

plt.barh

'데이터 - 시각화' 카테고리의 다른 글

시각화 matplotlib, seaborn 범주형] countplot, bar chart, pie chart

seaborn countplot

막대 그래프 시각화, plot(kind='bar')

pie chart

'데이터 - 시각화' 카테고리의 다른 글

Python_기법] 이메일

라이브러리 import

파일

함수 사용

'Python 기법' 카테고리의 다른 글

프로그래머스_인기있는 아이스크림

'SQL - 프로그래머스 SELECT' 카테고리의 다른 글

티스토리툴바