ABOUT ME

-

Today
-
Yesterday
-
Total
-
  • [python / pandas] 변수별 상관계수 구하기 및 내림차순 정렬
    python 데이터 분석 2024. 4. 15. 01:55

    데이터 불러오기

    import pandas as pd
    from sklearn.model_selection import train_test_split
    from sklearn.linear_model import LinearRegression, Lasso, Ridge, LassoCV, BayesianRidge
    import statsmodels.formula.api as sm
    import matplotlib.pylab as plt
    
    from dmba import regressionSummary, exhaustive_search
    from dmba import backward_elimination, forward_selection, stepwise_selection
    from dmba import adjusted_r2_score, AIC_score, BIC_score
    
    import pandas as pd
    Airfares = pd.read_csv('./--file_name--/Airfares.csv')
    
    Airfares.describe()

     

    Airfares 에 Dataframe 으로 저장

     

    예측변수(독립변수) 정하기 및 원 -핫 인코딩 (가변수) 만들기

    col_name = ['COUPON', 'NEW', 'VACATION', 'SW','HI', 'S_INCOME', 'E_INCOME', 
                'S_POP', 'E_POP', 'SLOT', 'GATE', 'DISTANCE','PAX']
    X = pd.get_dummies(Airfares[col_name], drop_first =True)

     

     

    종속 변수 정하고 상관계수 구하기 및 정렬해서 살펴보기

    # 종속변수로 y
    y = Airfares[outcome]
    
    from scipy.stats import pearsonr
    import seaborn as sns
    
    cor_p_li = []
    
    print(type(df))
    for i in range(len(X.columns)):
        
        # 상관계수 구하기
        correlation, p_value = pearsonr(X[X.columns[i]], y)
        
        
        # 상관계수와 p-value를 출력합니다.    
        cor_p_li.append([X.columns[i], abs(correlation), p_value])
    
    print('----------------------')
    cor_p_li.sort(key=lambda x: x[1], reverse=True)
    
    # 정렬된 리스트를 출력
    for item in cor_p_li:
        print('Variable: {} - Correlation: {}, P-value: {}'.format(item[0], item[1], item[2]))

     

    df 을 새로 만들지 않고, list 에 상관계수 구한 것을 list 안에 list 로 담아서 넣겠다.

    상관계수는 절댓값이 클 수록 상관이 있는 것이기므로

    1. abs() 함수를 통해 절댓값을 씌우고,

    2. 담긴 리스트를 정렬하면,

    3. Correlation 이 높은 것부터 출력이 된다.

     

     

    결과

    ----------------------
    Variable: DISTANCE - Correlation: 0.670015994502263, P-value: 2.393445217607039e-84
    Variable: SW_Yes - Correlation: 0.5438127487418762, P-value: 2.205947129441277e-50
    Variable: COUPON - Correlation: 0.4965369601315234, P-value: 5.076170504583025e-41
    Variable: E_INCOME - Correlation: 0.32609228753446495, P-value: 2.84802618577061e-17
    Variable: E_POP - Correlation: 0.2850429850968601, P-value: 2.1606240131806804e-13
    Variable: VACATION_Yes - Correlation: 0.27686844846664727, P-value: 1.0851863657903474e-12
    Variable: SLOT_Free - Correlation: 0.20943763357050568, P-value: 9.342784525150931e-08
    Variable: S_INCOME - Correlation: 0.20913485291820177, P-value: 9.758218308530221e-08
    Variable: GATE_Free - Correlation: 0.20854008934428664, P-value: 1.0626816664580893e-07
    Variable: E_CITY_New York/Newark     NY - Correlation: 0.19988986082942195, P-value: 3.5710034675687336e-07
    Variable: E_CITY_San Francisco       CA - Correlation: 0.18044213462589395, P-value: 4.507639418661187e-06
    Variable: S_CITY_New York/Newark     NY - Correlation: 0.15599628631165136, P-value: 7.596723678741837e-05
    Variable: S_CITY_Las Vegas           NV - Correlation: 0.15390195641935986, P-value: 9.500355136280146e-05
    Variable: E_CITY_Reno                NV - Correlation: 0.15299163770235882, P-value: 0.00010460768984859742
    Variable: S_POP - Correlation: 0.14509707907385505, P-value: 0.00023576361493951768
    Variable: S_CITY_Minneapolis/St Paul MN - Correlation: 0.1422377455950115, P-value: 0.0003133212769592067
    Variable: S_CITY_Burbank             CA - Correlation: 0.13190510174839679, P-value: 0.0008383378859221293
    Variable: E_CITY_Washington          DC - Correlation: 0.13125944798489997, P-value: 0.0008895111764732214
    ...

     

Designed by Tistory.