Power of Truly Open Source AI. OLMo 7B. Nomic Embed. HugginChat Assistants. Eagle 7B. Objective Driven AI. Markov Chains & LMs. Transformer Circuits. Time-LLM. Exphormer. SymbolicAI. MambaTab.
The Power of Truly Open Source AI. The spin doctors of some big closed-AI companies have been busy inflating the “AGI is here soon, AGI will be an existential risk” bubble. But that thankfully that is deflating quickly, and backfiring somehow.
In the meantime, the open source AI community is stubbornly embarked upon releasing truly open source, efficient, smallish, powerful AI models that match or beat the closed AI models from big companies.
The reaction from these big closed AI companies: “Oh! open source AI models are dangerous, we need to regulate open source AI. And btw: We’re dropping the pricing trousers for using our closed models.” A recent report from Stanford HAI, totally debunks all the myths about dangerous open source AI, and the exaggerations coming from the closed AI companies.
Truly open source AI research and models is the only way forward to advance AI.
A new, truly open source language model. Two days ago, The Allen Institute for AI (AI2) released OLMo 7B, a truly open source SOTA language model which was trained with Databricks Mosaic Model Training. OLMo was released on Apache 2.0 license and comes with:
- Full training data used, training code, training logs, and training metrics
- Full model weights and 500+ model checkpoints
- Fine-tuning code and adapted models
Checkout the blogpost, repo & tech report here: How to Get Started with OLMo SOTA truly open source LM.
A new, truly open source text embedding model. Also a few days ago, Nomic AI released Nomic Embed, a truly open source text embedding model, that is SOTA in 2 main benchmarks. Nomic Embed has a 8192 context-length, and beats Open AI text-embedding-3-small. The model is released under Apache 2.0 license and comes with the full training code, training data and model weights. Checkout the blogpost, repo and tech report here: Introducing Nomic Embed: A Truly Open Text Embedding Model.
Want to learn more on Nomic Embed? Checkout this vid from the guys at LangChain: How to build a long context RAG app with OSS components from scratch using Nomic Embed 8k, Mistral-instruct 32k and Ollama.
And speaking of text embedding models, Salesforce Research just released SFR-Embedding-Mistral model, now SOTA in the MTEB benchmark. The model was trained on top of 2 open source models: E5-mistral-7b-instruct and Mistral-7B-v0.1.
A new, fully open source SOTA multi-lingual model based on a RNN. Last week, a team of independent researchers backed by Stability AI and Eleuther AI, released Eagle 7B. The model beats all 7B open source models in the main multilingual benchmarks, and it’s super cheap compute-efficient. The beauty of this model is that it’s an attention-free, linear transformer built on the RWKV-v5 architecture, which is based on a RNN. Checkout the blogpost, repo, and demo here: Eagle 7B : Soaring past Transformers with 1 Trillion Tokens Across 100+ Languages (RWKV-v5.)
Yesterday, Hugging Face released HuggingChat Assistants (blogpost, demo), a nice alt to closed-model chat assistants, that uses 6 top open source models. Albeit rather basic yet, the idea is to have the open source community developing several powerful features already planned.
This is such a cool open source AI project! ADeus: An Open-Source AI Wearable Device for less that $100 (repo, sw/hw list.) It uses Ollama, Supabase, Coral AI microcontroller (soon to be replaced by Raspberry Py Zero.) Checkout the intro vid:
Have a nice week.
10 Link-o-Troned
- Yann LeCun – Objective Driven AI: The Future of AI (video & slides)
- Markov Chains are the Original Language Models
- From Naive RAG to Advanced Agents
- The Ever-Growing Power of Small Models
- Four Approaches to ML Model Fitting: Gradient Flow
- [now open] AI Grant Batch 3 – Up to $2.5M
- [free e-book] ML for High-Risk Apps (469 pages)
- Hallucinating Law: Disturbing LLM Errors in Legal Tasks
- The Best Solution Write-ups from Kaggle 2023 Winners
- Antrophic – Ideas on Transformer Circuits & ML Interpretability
Share Data Machina with your friends
the ML Pythonista
- Programming Foundation Models with DSPy Explained
- A Simple Imp of Mamba Selective State Spaces in PyTorch
- Phinetuning 2.0: How to Fine-tune Phi-2 with Synth Data & QLoRA
Deep & Other Learning Bits
- DeepMind- Transfer Learning for Text Diffusion Models
- Google Exphormer: Scaling Transformers for Graph-Structured Data
- Google- A Decoder-only Foundation Model for Time-series Forecasting
AI/ DL ResearchDocs
- Time-LLM: SOTA Time Series Forecasting by Reprogramming LLMs
- SymbolicAI: Combining Probabilistic Programming and GenAI
- MambaTab: SOTA Model for Tabular Tasks with S-SSM (No Transformers)
MLOps Untangled
- MLOps: From Jupyter to Prod. (blog, vid, repo)
- MLOps at The Crossroads and New Tools
- Auto Signature Recognition MLOps Pipeline on AWS at CapGemini
data v-i-s-i-o-n-s
- Friends Don’t Let Friends Make Bad Graphs
- Dep Tree – Visualise the Entropy of Your Code Base in 3D
- [free e-book] Handbook of Graphs and Networks in People Analytics
AI startups -> radar
- Nabla – An AI-Copilot for Doctors
- Kode – A No-code Platform for AI Enterprise Apps
- Cohere Health – AI for Automating Health Plan Authorisations
ML Datasets & Stuff
- AutoMathText Dataset – 200 GB of Mathematical Texts
- OpenHermes-2.5 – 1 Million Chat Conversations
- Dolma Dataset – 3 trillion tokens from web, academic pubs, code, books
Postscript, etc
Keep up with the very latest in AI / Machine Learning research, projects & repos. A weekly digest packed with AI / ML insights & updates that you won’t find elsewhere
Submit your suggestions, feedback, posts and links to: