Hugging Face Open-Model Pulse

Executed Notebook

This notebook asks what a dated Hugging Face public snapshot can and cannot say about open-model attention. A single snapshot supports a current adoption-proxy table; repeated snapshots are required before discussing momentum, acceleration, or retention.

The main output is a source card, a snapshot ranking, and, when enough dated snapshots exist, a decomposition of repeated download observations.

In [1]
from pathlib import Path
import os

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

from examples.hot_trends.data import (
    HotTrendDataError,
    append_real_snapshot,
    build_arxiv_monthly_counts,
    fetch_coingecko_market_chart,
    fetch_defillama_stablecoin_chains,
    fetch_github_repo_metadata,
    fetch_github_stargazers,
    fetch_huggingface_models,
    fetch_wikipedia_pageviews,
    source_audit_table,
)
from examples.hot_trends.decomposition import (
    component_summary,
    decompose_table,
    editorial_priority,
    residual_event_table,
)
from examples.hot_trends.scoring import article_publication_phrasing

pd.set_option("display.max_columns", 80)
pd.set_option("display.max_rows", 80)
plt.rcParams.update({"axes.grid": True})

CACHE_DIR = Path("examples/hot_trends/cache")
OUTPUT_DIR = Path("examples/hot_trends/outputs")
CACHE_DIR.mkdir(parents=True, exist_ok=True)
OUTPUT_DIR.mkdir(parents=True, exist_ok=True)

def save_table(df, name):
    path = OUTPUT_DIR / f"{name}.csv"
    df.to_csv(path, index=False)
    print(f"saved: {path.as_posix()}")

1. Fetch a model snapshot

In [2]
HF_LIMIT = 50
HF_SORT = "downloads"
HF_DIRECTION = -1
hf_endpoint = f"https://huggingface.co/api/models?limit={HF_LIMIT}&sort={HF_SORT}&direction={HF_DIRECTION}"
snapshot = fetch_huggingface_models(limit=HF_LIMIT, sort=HF_SORT, direction=HF_DIRECTION)
snapshot.head(20)

2. Source card and snapshot audit

In [3]
source_card = pd.DataFrame([{
    "source": "Hugging Face Hub API",
    "endpoint": hf_endpoint,
    "access_date": snapshot["snapshot_date"].iloc[0],
    "query_params": f"limit={HF_LIMIT}; sort={HF_SORT}; direction={HF_DIRECTION}",
    "time_range": f"snapshot_date={snapshot['snapshot_date'].iloc[0]}",
    "cache_path": "examples/hot_trends/cache/hf_model_snapshot_log.csv",
    "metric_semantics": "downloads and likes are public Hub metadata fields in a dated API response",
    "interpretation_scope": "single snapshot = current public adoption proxy; repeated snapshots required for momentum",
}])
snapshot_audit = pd.DataFrame([{
    "snapshot_date": snapshot["snapshot_date"].iloc[0],
    "models": int(len(snapshot)),
    "non_null_downloads": int(snapshot["downloads"].notna().sum()),
    "non_null_likes": int(snapshot["likes"].notna().sum()),
    "source": "Hugging Face Hub API",
    "endpoint": hf_endpoint,
    "query_params": source_card.loc[0, "query_params"],
    "interpretation_scope": source_card.loc[0, "interpretation_scope"],
}])
display(source_card)
snapshot_audit

3. Append snapshot to a dated local log

Each row records a Hugging Face API snapshot. The notebook deduplicates same-day (snapshot_date, model_id) rows before writing the log so repeated runs do not create false momentum.

In [4]
log_path = CACHE_DIR / "hf_model_snapshot_log.csv"
snapshot_for_log = snapshot.sort_values("last_modified").drop_duplicates(["snapshot_date", "model_id"], keep="last")
log = append_real_snapshot(snapshot_for_log, log_path)
log = log.drop_duplicates(["snapshot_date", "model_id"], keep="last").sort_values(["snapshot_date", "model_id"])
log.to_csv(log_path, index=False)
log.tail(20)

4. Convert repeated snapshots to a time series if enough data exists

In [5]
log["snapshot_date"] = pd.to_datetime(log["snapshot_date"])
log["downloads"] = pd.to_numeric(log["downloads"], errors="coerce")
series_log = log.dropna(subset=["model_id", "downloads"]).sort_values(["model_id", "snapshot_date"])
ready_models = series_log.groupby("model_id")["snapshot_date"].nunique().loc[lambda s: s >= 4].index.tolist()
ready_models[:10], len(ready_models)

5. Decompose only after repeated snapshots exist

The chart becomes a momentum read only after the same API query has been collected across enough dates. Until then, the notebook publishes the current snapshot table and the collection-depth chart.

In [6]
if ready_models:
    decomp_input = series_log[series_log["model_id"].isin(ready_models)].rename(columns={"snapshot_date": "date", "model_id": "series", "downloads": "count"})
    decomp_input = decomp_input.dropna(subset=["count"])
    components = decompose_table(decomp_input, entity_col="series", time_col="date", value_col="count", method="MA_BASELINE", period=7, trend_window=3, transform="log1p")
    summary = editorial_priority(component_summary(components, entity_col="series", time_col="date"), entity_col="series")
    events = residual_event_table(components, entity_col="series", time_col="date", top_n=20, trim_edges=1)
else:
    components = pd.DataFrame()
    summary = pd.DataFrame([{"status": "not_enough_snapshots", "required": "collect at least 4 snapshot dates per model before decomposition"}])
    events = pd.DataFrame()
summary

6. Snapshot ranking for immediate publication

This table is cross-sectional. The axes to read are downloads and likes in the selected API response; do not read the ranking as momentum or model quality.

In [7]
snapshot_rank = snapshot.sort_values(["downloads", "likes"], ascending=False, na_position="last").head(25)
snapshot_rank[["model_id", "pipeline_tag", "downloads", "likes", "last_modified", "source"]]

Visualization: Hugging Face snapshot status

The left panel reports snapshot depth by model. The dashed line is the minimum repeated-snapshot threshold used before decomposition. The right panel reports current downloads from one dated snapshot, which is useful for a source table but not for retention claims.

In [8]
if not components.empty and "series" in summary.columns:
    top_models = summary["series"].head(4).tolist()
    fig, axes = plt.subplots(len(top_models), 2, figsize=(11, max(3.0, 2.4 * len(top_models))), squeeze=False)
    for row, model_id in enumerate(top_models):
        panel = components.loc[components["series"].eq(model_id)].sort_values("date").copy()
        panel["date"] = pd.to_datetime(panel["date"])
        axes[row, 0].plot(panel["date"], panel["observed"], label="observed", linewidth=1.6)
        axes[row, 0].plot(panel["date"], panel["trend"], label="trend", linewidth=1.8)
        axes[row, 0].set_title(model_id)
        axes[row, 1].bar(panel["date"], panel["residual"], color=np.where(panel["residual"] >= 0, "tab:red", "tab:blue"))
        axes[row, 1].set_title("residual")
    axes[0, 0].legend(loc="best")
else:
    snapshot_depth = series_log.groupby("model_id")["snapshot_date"].nunique().sort_values(ascending=False).head(20)
    top_downloads = snapshot_rank.dropna(subset=["downloads"]).head(15).sort_values("downloads")
    fig, axes = plt.subplots(1, 2, figsize=(12, 5))
    snapshot_depth.sort_values().plot(kind="barh", ax=axes[0], color="tab:blue", title="Distinct snapshot dates per model")
    axes[0].axvline(4, color="tab:red", linestyle="--", linewidth=1.0, label="decomposition threshold")
    axes[0].legend(loc="lower right")
    top_downloads.plot(kind="barh", x="model_id", y="downloads", ax=axes[1], color="tab:green", legend=False, title="Top current downloads")
    axes[1].set_ylabel("")
plt.tight_layout()
plt.show()

7. Publication language

In [9]
phrasing = article_publication_phrasing()
phrasing
In [10]
save_table(source_card, "03_hf_source_card")
save_table(snapshot_audit, "03_hf_snapshot_audit")
save_table(snapshot_rank, "03_hf_snapshot_rank")
save_table(summary, "03_hf_decomposition_or_collection_status")
if not events.empty:
    save_table(events, "03_hf_residual_events")
save_table(phrasing, "03_hf_publication_phrasing")