Advanced Applied Statistics for Data Scientists

Advanced Applied Statistics for Data Scientists is part of the statistical bedrock that every model sits on top of. Understand it well and you stop treating p-values, confidence intervals and distributions as magic numbers and start reasoning about them as the tools they really are.

Why Advanced Applied Statistics Matters

Misusing statistical tools is how otherwise-talented teams ship confident but wrong conclusions. Solid foundations here protect you from hallucinated effects, under-powered studies and false certainty.

Define the random variables and their distributions precisely.
Choose estimators whose bias and variance you can reason about.
Quantify uncertainty with confidence or credible intervals.
Use p-values only as one piece of evidence, never the conclusion.

How Advanced Applied Statistics Shows Up in Practice

In a typical project, advanced applied statistics for data scientists is combined with the rest of the Statistics & Probability toolkit. You rarely use any one technique in isolation; the real skill is knowing which combination fits the problem you are trying to solve, and being able to explain that choice to a non-technical stakeholder.

Apply this material in any experiment, A/B test, survey analysis or report that will be used to make a real-world decision.

Back to the Data Science curriculum Ã¢â€ â€™

Code Examples: Advanced Applied Statistics for Data Scientists (5 runnable snippets)

Copy any block into a file or notebook and run it end-to-end Ã¢â‚¬â€ each example stands alone.

Example 1: One-sample t-test with 95% CI

# Example 1: One-sample t-test with 95% CI -- Advanced Applied Statistics for Data Scientists
import numpy as np
from scipy import stats

rng    = np.random.default_rng(7)
sample = rng.normal(loc=102.4, scale=14.0, size=60)
mu0    = 100.0

t_stat, p_val = stats.ttest_1samp(sample, popmean=mu0)
ci_lo, ci_hi  = stats.t.interval(0.95, df=len(sample) - 1,
                                 loc=sample.mean(),
                                 scale=stats.sem(sample))

print(f"mean    : {sample.mean():.2f}")
print(f"95% CI  : ({ci_lo:.2f}, {ci_hi:.2f})")
print(f"t, p    : {t_stat:.3f}, {p_val:.4f}")
print("verdict :", "reject H0" if p_val < 0.05 else "fail to reject H0")

Example 2: Bayesian Beta-Binomial update

# Example 2: Bayesian Beta-Binomial update -- Advanced Applied Statistics for Data Scientists
import numpy as np
from scipy import stats

prior_a, prior_b   = 2, 2          # weak Beta(2,2) prior
successes, trials  = 47, 80        # observed data

post_a = prior_a + successes
post_b = prior_b + (trials - successes)
post   = stats.beta(post_a, post_b)

print(f"posterior mean       : {post.mean():.3f}")
print(f"95% credible interval: "
      f"({post.ppf(0.025):.3f}, {post.ppf(0.975):.3f})")
print(f"P(p > 0.5 | data)    : {1 - post.cdf(0.5):.3f}")

samples = post.rvs(size=20_000, random_state=0)
print(f"Monte-Carlo check    : {samples.mean():.3f}")

Example 3: Bootstrap CI for a robust statistic

# Example 3: Bootstrap CI for a robust statistic -- Advanced Applied Statistics for Data Scientists
import numpy as np

rng  = np.random.default_rng(1)
data = rng.lognormal(mean=1.2, sigma=0.4, size=200)

def bootstrap_ci(x, stat=np.median, B=5_000, alpha=0.05):
    n = len(x)
    draws = np.empty(B)
    for b in range(B):
        idx       = rng.integers(0, n, n)
        draws[b]  = stat(x[idx])
    lo, hi = np.quantile(draws, [alpha / 2, 1 - alpha / 2])
    return stat(x), lo, hi

point, lo, hi = bootstrap_ci(data, np.median)
print(f"median  = {point:.3f}  (95% CI: {lo:.3f}, {hi:.3f})")

Example 4: Two-sample Mann-Whitney U test

# Example 4: Two-sample Mann-Whitney U test -- Advanced Applied Statistics for Data Scientists
import numpy as np
from scipy import stats

rng = np.random.default_rng(0)
a   = rng.gamma(shape=2.0, scale=1.0, size=120)    # right-skewed
b   = rng.gamma(shape=2.0, scale=1.2, size=140)

u_stat, p = stats.mannwhitneyu(a, b, alternative="two-sided")
effect    = 1 - 2 * u_stat / (len(a) * len(b))     # rank-biserial r

print(f"medians       : {np.median(a):.2f} vs {np.median(b):.2f}")
print(f"U statistic   : {u_stat:.0f}")
print(f"p-value       : {p:.4f}")
print(f"effect size r : {effect:+.3f}")

Example 5: Chi-squared test of independence

# Example 5: Chi-squared test of independence -- Advanced Applied Statistics for Data Scientists
import numpy as np
import pandas as pd
from scipy.stats import chi2_contingency

# Observed: plan type vs. churn outcome
observed = pd.DataFrame(
    [[180,  20],
     [220,  80],
     [150, 150]],
    index=["basic", "standard", "premium"],
    columns=["retained", "churned"],
)

chi2, p, dof, expected = chi2_contingency(observed.values)
cramer_v = np.sqrt(chi2 / (observed.values.sum() *
                            (min(observed.shape) - 1)))

print(observed)
print(f"\nchi2 = {chi2:.2f}  dof = {dof}  p = {p:.4g}")
print(f"Cramer's V = {cramer_v:.3f}")