Causal inference overview | comparision with ab test

Overview of Causal inference

Causal inferenceis a powerful tool that is used in a variety of fields to establish cause-and-effect relationships between variables. Here are some examples of where causal inference has been used successfully:

Medical research: Causal inference is commonly used in medical research to determine the effectiveness of treatments. For example, a randomized controlled trial may be used to establish whether a particular drug is effective in treating a specific disease.

Public policy: Causal inference is used in public policy to evaluate the impact of interventions such as education programs or social policies. For example, a study may be conducted to determine whether a particular policy aimed at reducing poverty has had a measurable impact on poverty rates.

Economics: Causal inference is used in economics to estimate the causal effects of various economic policies or events. For example, an economist may use data on unemployment rates and inflation to determine whether there is a causal relationship between the two variables.

Marketing: Causal inference is used in marketing to determine the effectiveness of advertising campaigns. For example, an A/B test may be conducted to determine whether a particular advertisement leads to higher sales than another advertisement.

Environmental science: Causal inference is used in environmental science to determine the causes of environmental problems. For example, a study may be conducted to determine whether a particular chemical is responsible for the decline of a particular species of fish in a river.

What is overlap and differece between ab tesing and causal inference

A/B testing and causal inference are both methods used to understand cause-and-effect relationships between variables, but they differ in their goals, methods, and level of rigor.

Similarities:

Both A/B testing and causal inference are used to establish cause-and-effect relationships between variables.

Both A/B testing and causal inference require careful design and analysis to minimize bias and confounding variables.

Both A/B testing and causal inference can be used in a variety of fields, including marketing, healthcare, and social sciences.

Differences:

Goal: The goal of A/B testing is to evaluate the effectiveness of different treatments or interventions. The goal of causal inference is to establish a causal relationship between an independent variable and a dependent variable.

Method: A/B testing typically involves randomly assigning participants to one of two groups and measuring the difference in outcomes between the groups. Causal inference involves a more complex statistical analysis that controls for confounding variables to establish a causal relationship.

Level of rigor: A/B testing is often used in industry settings where the goal is to make quick decisions about product design or marketing campaigns. Causal inference is used in more rigorous research settings where the goal is to establish a causal relationship with a high degree of confidence.

Industry examples of Causal inference

Google example of causal inference

Mobile_app_AB_experiments

Goal :¶

Which group method have more impact on user to app download.
Which group method have more impact on user to spent more time(min).

In [ ]:

In [1]:

# data Manipulation - first we check information about data if any problems we will fix it.

# import data_manipulation from AB_test
from AB_experiment import data_manipulation

#create alias to call data_manipulation
dm=data_manipulation()

data='app_data.csv'
column1="group"
column2=["downloaded_app","time_spent(min)"]
quartile1=0.25
quartile3=0.75
info = True
download_df=False
filename='new'

dm.data_info(data,column1,column2,quartile1,quartile3,info,download_df,filename)

Out[1]:

{'1': ['dataframe_shape', {'Observations': 30000, 'Column': 4}],
 '2': ['missing_data_info', {'No missing values'}],
 '3': ['outliers_info',
  [{'variable_name time_spent(min)': 'No outliers present'}]],
 '4': ['data_types',
  [{'object_values': "['group', 'downloaded_app']"},
   {'float_values': '[]'},
   {'int_values': ['user_id', 'time_spent(min)']},
   {'bool_val': []}]],
 '5': ['numerical_Variables', ['user_id', 'time_spent(min)']],
 '6': ['Categorical_variables', ['group', 'downloaded_app']],
 '7': [{'Unique values count for variable':           group
   ad        17903
   referral  12097},
  {'Unique values count for variable':      downloaded_app
   Yes           18393
   No            11607},
  {'Unique values count for variable':     time_spent(min)
   20             2351
   18             2271
   19             2241
   12             2213
   13             2212
   17             2209
   14             2205
   10             2196
   15             2188
   11             2186
   16             2151
   8              1150
   6              1121
   7              1119
   5              1108
   9              1079}],
 '8': ['Descriptive statistics-numerical_Variables',
               user_id  time_spent(min)
  count   30000.000000     30000.000000
  mean   497244.479467        13.548800
  std    289220.271868         4.290116
  min        41.000000         5.000000
  25%    246691.000000        10.000000
  50%    495162.000000        14.000000
  75%    747418.250000        17.000000
  max    999979.000000        20.000000,
  '********************',
  'Descriptive statistics-Categorical_variables',
          group downloaded_app
  count   30000          30000
  unique      2              2
  top        ad            Yes
  freq    17903          18393,
  '********************'],
 '9': {'category_stats': [         time_spent(min)                                    
                      count median       mean       std min max
   group                                                       
   ad                 17903   13.0  12.533654  4.633040   5  20
   referral           12097   15.0  15.051170  3.177318  10  20]},
 '10': ['Dataframe',
     user_id     group downloaded_app  time_spent(min)
  0   784598        ad            Yes               13
  1   699052  referral            Yes               11
  2   218829        ad             No                7
  3   627414        ad            Yes                7
  4   190259  referral             No               10]}

In [ ]:

In [4]:

# Since categorical variable present we will convert it into numerical using categorical_encoding
# import data_manipulation from AB_test
from AB_experiment import data_manipulation

#create alias to call data_manipulation
dm=data_manipulation()

data='app_data.csv'
variables=['downloaded_app']
download_df=True
filename='new'

dm.categorical_encoding(data, variables, download_df, filename)

Out[4]:

[{'Before encoding': {'Variable_name': 'downloaded_app',
   'unique_values': array(['Yes', 'No'], dtype=object)},
  'After encoding': {'Variable_name': 'downloaded_app_coded',
   'unique_values': array([1, 0])}}]

In [12]:

#From above function we have converted variable into numeric variable hence we also convert its datatype into bool for better analysis.

data='new.csv'
change_variables=['downloaded_app_coded']
dtype=['bool']
drop_variables=[]
download_df=True
filename='new'

dm.change_variables(data,change_variables,dtype,drop_variables,download_df,filename)

Out[12]:

[{'Variable1': ['downloaded_app_coded', dtype('bool')]}]

In [13]:

# After changing data types we chacking agian data_info
# import data_manipulation from AB_test

data='new.csv'
column1="group"
column2=["downloaded_app","time_spent(min)"]
quartile1=0.25
quartile3=0.75
info = True
download_df=False
filename='new'

dm.data_info(data,column1,column2,quartile1,quartile3,info,download_df,filename)

Out[13]:

{'1': ['dataframe_shape', {'Observations': 30000, 'Column': 5}],
 '2': ['missing_data_info', {'No missing values'}],
 '3': ['outliers_info',
  [{'variable_name time_spent(min)': 'No outliers present'}]],
 '4': ['data_types',
  [{'object_values': "['group', 'downloaded_app']"},
   {'float_values': '[]'},
   {'int_values': ['user_id', 'time_spent(min)']},
   {'bool_val': ['downloaded_app_coded']}]],
 '5': ['numerical_Variables', ['user_id', 'time_spent(min)']],
 '6': ['Categorical_variables',
  ['group', 'downloaded_app', 'downloaded_app_coded']],
 '7': [{'Unique values count for variable':           group
   ad        17903
   referral  12097},
  {'Unique values count for variable':      downloaded_app
   Yes           18393
   No            11607},
  {'Unique values count for variable':     time_spent(min)
   20             2351
   18             2271
   19             2241
   12             2213
   13             2212
   17             2209
   14             2205
   10             2196
   15             2188
   11             2186
   16             2151
   8              1150
   6              1121
   7              1119
   5              1108
   9              1079},
  {'Unique values count for variable':        downloaded_app_coded
   True                  18393
   False                 11607}],
 '8': ['Descriptive statistics-numerical_Variables',
               user_id  time_spent(min)
  count   30000.000000     30000.000000
  mean   497244.479467        13.548800
  std    289220.271868         4.290116
  min        41.000000         5.000000
  25%    246691.000000        10.000000
  50%    495162.000000        14.000000
  75%    747418.250000        17.000000
  max    999979.000000        20.000000,
  '********************',
  'Descriptive statistics-Categorical_variables',
          group downloaded_app downloaded_app_coded
  count   30000          30000                30000
  unique      2              2                    2
  top        ad            Yes                 True
  freq    17903          18393                18393,
  '********************'],
 '9': {'category_stats': [         time_spent(min)                                    
                      count median       mean       std min max
   group                                                       
   ad                 17903   13.0  12.533654  4.633040   5  20
   referral           12097   15.0  15.051170  3.177318  10  20]},
 '10': ['Dataframe',
     user_id     group downloaded_app  time_spent(min)  downloaded_app_coded
  0   784598        ad            Yes               13                  True
  1   699052  referral            Yes               11                  True
  2   218829        ad             No                7                 False
  3   627414        ad            Yes                7                  True
  4   190259  referral             No               10                 False]}

In [ ]:

# From above output info we can say that in our data there is no outliers , no missing values present 
# and datatypes of all variables correct
#Now we findout sample size 

In [6]:

#fist we findout baseline conversion rate
# import stats_test from AB_test
from AB_experiment import stats_test

#create alias to call stats_test
st=stats_test()

data='new.csv'
column1="group"
column1_value='referral'

a = st.baseline_conversion_rate(data,column1,column1_value,column2='downloaded_app_coded')
b = st.baseline_conversion_rate(data,column1,column1_value,column2='time_spent(min)',bool_var=False,threshold=13.5)
print('downloaded_app',a,'/ntime_spent(min)',b)

downloaded_app_coded {'Baseline conversion rate(p1) of group referral': 0.4877} 
time_spent(min) {'Baseline conversion rate(p1) of group referral for greater than or equal to threshold value 13.5': 0.6419}

In [ ]:

In [16]:

#Sample size using baseline conversion rate.

p1= 0.4877
mde=0.02
alpha=0.05
power=0.8
n_side=2

# For variable downloaded_app_coded
a=st.sample_size(p1,mde,alpha,power, n_side)

# For variable time_spent(min)
p1=0.6419
b=st.sample_size(p1,mde,alpha,power, n_side)

print('downloaded_app',a,'/ntime_spent(min)',b)

downloaded_app_coded {'Sample size': 9806} 
time_spent(min) {'Sample size': 8985}

In [ ]:

In [29]:

# Now we check assumptions for all combinations to perform statistical tests for AB testing 

# import stats_test from AB_test
from AB_experiment import stats_test

#create alias to call stats_test
st=stats_test()

data='new.csv'
sample_size=9806
column1="group"
column1_value1='referral'
column1_value2='ad'
column2="downloaded_app_coded"
alpha=0.05
paired_data=False 

# For variable downloaded_app_coded
a=st.AB_Test_assumption(data, sample_size, column1, column1_value1, column1_value2, column2, alpha, paired_data)

# For variable time_spent(min)
sample_size=8985
column2="time_spent(min)"

b=st.AB_Test_assumption(data, sample_size, column1, column1_value1, column1_value2, column2, alpha, paired_data)

print('For downloaded_app variable/n',a,'/n',40*'*','/n For time_spent(min) variable/n',b)

For downloaded_app_coded variable
 ({'Target variable is boolean data type': 'Use Chi-Squared Test'}, {'Note': 'If our data involve time-to-event or survival analysis (e.g., time until a user completes a task), we can use methods such as the log-rank test'}) 
 **************************************** 
 For time_spent(min) variable
 ({'Assumption of Normality is not satisfied': 'Non-parametric test => Use Mann-Whitney U test.'}, {'Note': 'If we are comparing more than two groups, such as in an A/B/C testing scenario, we can use Kruskal-Wallis test.'})

C:/Users/VINAYAK/anaconda3/lib/site-packages/scipy/stats/morestats.py:1760: UserWarning: p-value may not be accurate for N > 5000.
  warnings.warn("p-value may not be accurate for N > 5000.")

In [ ]:

By checking assumptions we use Chi-Squared Test for variable downloaded_app¶

Define the null and alternative hypotheses :

Null hypothesis (H0) : There is no significant difference in the proportions of the two variables.
Alternative hypothesis(Ha) : There is significant difference in the proportions of the two variables.

By checking assumptions we perform Non-parametric test Mann-Whitney U test for variable time_spent(min)¶

Define the null and alternative hypotheses :

Null hypothesis (H0) : There is no difference between the distributions of the two groups.
Alternative hypothesis (H1) : There is significant difference between the distributions of the two groups.

In [30]:

# import stats_test from AB_test
from AB_experiment import stats_test

#create alias to call stats_test
st=stats_test()

# perform chi-square test
data='new.csv'
sample_size=9806
column1='group'
column1_value1='referral'
column1_value2='ad'
column2='downloaded_app_coded'
alpha=0.05
reverse_experiment=False

# For variable downloaded_app_coded
a=st.chi_squared_test(data, sample_size, column1, column1_value1, column1_value2, column2, alpha, reverse_experiment)

# For variable time_spent(min)
sample_size=8985
column2="time_spent(min)"

b=st.mann_whitney_U_test(data, sample_size, column1, column1_value1, column1_value2, column2, alpha, paired_data)

('For downloaded_app variable',a,40*'*','For time_spent(min) variable',b)

Out[30]:

('For downloaded_app variable',
 [{'Test name': 'Chi-square test',
   'Timestamp': '2023-08-11 13:45:42',
   'Sample size': 9806,
   'Status': 'We can reject H0 => group ad is more successful',
   'P-value': 1.602395622342239e-193,
   'alpha': 0.05,
   'Test Statistic': 880.6203723014223,
   'Confidence Interval': (-0.2217828294823734, -0.19490083358105723)},
  {'proportion1': 0.4884, 'proportion2': 0.6967}],
 '****************************************',
 'For time_spent(min) variable',
 {'Test name': 'Mann whitney U test',
  'Timestamp': '2023-08-11 13:45:45',
  'Sample size': 8985,
  'Status': 'We can reject H0 => group referral performs better',
  'P-value': 1.2118957304952622e-289,
  'alpha': 0.05,
  'Test Statistic': 52979345.0,
  'Confidence Interval': (2.0, 3.0)})

In [ ]:

Conclusion¶

From downloaded_app Variable

From above output we conclude that we reject null hypothesis and there is significant difference in the proportions of the two variables.
Group referral proportion is 0.4884 and group ad proportion is 0.6967.
Hence we conclude that group ad have more impact on user to app download.

From time_spent(min) Variable

From above output we conclude that we reject null hypothesis and there is significant difference between the distributions of the two groups.
Group referral median value is 15.0 and group ad median value is 13.0
Hence we conclude that group referral users have spent more time on app site.

In [ ]:

Causal inference in anks

Google is a company that uses data extensively in their products and operations. As a result, they also use causal inference to understand the impact of various factors on their business. Here are some examples of causal inference at Google:

Ad effectiveness: Google uses causal inference to evaluate the effectiveness of its advertising products. For example, they may conduct A/B tests to determine whether a particular ad format leads to higher click-through rates than another format.

Search ranking algorithms: Google uses causal inference to evaluate the impact of changes to their search ranking algorithms. For example, they may conduct experiments to determine whether a particular change leads to higher quality search results for users.

User engagement: Google uses causal inference to understand the factors that drive user engagement with their products. For example, they may conduct experiments to determine whether a particular feature or design change leads to higher user engagement.

Employee satisfaction: Google uses causal inference to understand the factors that contribute to employee satisfaction. For example, they may conduct surveys and experiments to determine whether a particular benefit or policy change leads to higher employee satisfaction.

Product performance: Google uses causal inference to evaluate the performance of their products. For example, they may conduct experiments to determine whether a particular feature or design change leads to higher user retention or conversion rates.

Topic

Home

Chatgpt

Asociation-to-understanding

Association-to-understanding

Observationstudy-fieldexperim

Research-methodology