Lone's notes

Recently Updated

  • Adaptive KV cache pruning with selective features

    Feb 25, 2025

  • eigen values and Page Rank

    Mar 02, 2025

  • The effects of neural net layers on activation space

    Oct 29, 2024

  • which features represented by Attention layers vs by MLP layers

    Dec 03, 2024

  • Mechanistic Interpretability

    Oct 24, 2024

      • a different type of intelligence
      • Be motivated by the action, not by the goal. Constant action gives constant progress, which is a recurring source of dopamine.
      • Cinnamon should rethink research
      • Don't seek to understand, seek for a satisfying intuition. Understanding is when you know something makes sense, intuition is when you can feel the sense that was made.
      • how Obsidian is good for me
      • I have a constant pressure of how much I don't know and how much more I don't even know that I don't know.
      • Live intentionally, spend time intentionally.
      • my favourite activity is thinking
      • The problem with modelling the human brain
      • what knowledge gives me
      • working in AI and understanding statistics
      • write your thoughts down
        • IEEE 754 floating-point
          • relay ssh-agent and gpg-agent in windows
          • set automounts option for WSL
          • symlink windows exe to wsl
          • Auto Draw for Pen
          • Hardware Eraser Support
          • Zoom to Fit Selected Elements
      • Drawing 2023-12-05 04.02.33.excalidraw
      • temperature-vs-nucleus-sampling.excalidraw
      • (2x2)D attention
      • A pen that can write on any surface and the content will be recorded in an app
      • ads that appear in game environment
      • auto suggest which paper to cite
      • Dimensions of language
      • document layout synthesis
      • dynamic computation for deep learning models
      • is LLaMa better than LLaMa 2
      • network of documents connected by small-scale ideas
      • QA on documents
      • random attention
      • Related work writing tool
      • swap windows in 2 monitors
      • The relationship between knowledge and creativity
      • the self-attention equation similar to Newton's gravity equation
        • *Refusal in LLMs is mediated by a single direction — LessWrong
        • LoRA Fine-tuning Efficiently Undoes Safety Training from Llama 2-Chat 70B — LessWrong
        • BAM! Just Like That - Simple and Efficient Parameter Upcycling for Mixture of Experts
        • Beyond neural scaling laws - beating power law scaling via data pruning
        • CAMEx - Curvature-Aware Merging of Experts
        • A Mathematical Framework for Transformer Circuits
        • Memory-efficient Model Weight Loading
        • A Language Model's Guide Through Latent Space
        • Are Emergent Abilities of Large Language Models a Mirage
        • BAM! Just Like That - Simple and Efficient Parameter Upcycling for Mixture of Experts
        • CAMEx - Curvature-Aware Merging of Experts
        • Linear Representations of Sentiment in Large Language Models
        • Mechanistic Interpretability for AI Safety -- A Review
        • Refusal in Language Models Is Mediated by a Single Direction
        • Representation Engineering - A Top-Down Approach to AI Transparency
        • Scaling Laws for Fine-Grained Mixture of Experts
        • Steering Language Models With Activation Engineering
        • Superposition of many models into one
        • The Linear Representation Hypothesis and the Geometry of Large Language Models
        • Training Large Language Models to Reason in a Continuous Latent Space
        • Transformer Feed-Forward Layers Are Key-Value Memories
        • Universal and Transferable Adversarial Attacks on Aligned Language Models
        • Zotero Integration template
      • "the effort to use machines to try to mimic human reasoning is both foolish and dangerous"
      • 1's complement
      • 2's complement
      • Advices on building LLMs
      • All models are wrong. Some models are useful.
      • Bayesians are frequentists
      • Career Advice, Reading Research Papers by Andrew Ng
      • Do important work.
      • Earth rotation can't be used for time keeping
      • fleeting notes
      • Goodhart's Law
      • If I can't find a path I'll make one
      • Joy is be able to see things that people couldn't.
      • kasten (boxes)
      • literature notes
      • llama-cpp
      • llama-cpp-python
      • LLMs shortcomings
      • MosaicML Foundry
      • MosaicML Streaming Data Loader
      • MosaicML tools
      • offset binary
      • Open Problems in the Theory of Deep Learning - MITCBMM
      • permanent notes
      • pipelining
      • Principles of Success by Ray Dalio
      • risc vs cisc
      • Sidereal day
      • sign-magnitude
      • Survey of LLMs
      • syncthing
      • The Earth is slowing down
      • The length of a solar day is not actually 24 hours
      • The opposite of a profound truth
      • The productivity of a knowledge worker should be tracked by how many permanent notes you produce a day.
      • The pursuit of ignorance.
      • The Unknown Unknown
      • Things that change the rotation of the Earth
      • Training and Fine-tuning LLMs - W&B course
      • What would you do even if you know you would fail
      • You just can't compete with someone who is having fun
      • zettel (notes)
      • zettlekasten
        • Organize your Omnivore library with labels
      • A day is not actually 24h long
      • Activation Checkpointing
      • Adam
      • Aho-Corasick algorithm
      • ALiBi
      • Bayesian statistics
      • Building your own tokenizers
      • Checkpointing for LLM training
      • Chinchilla scaling laws
      • Choosing architecture for training LLM
      • classical probability
      • Cost to train a LLM
      • Data parallelism
      • descent
      • Dijkstra algorithm
      • Fermat's 2-square theorem
      • frequentist statistics
      • frequentist vs Bayesian
      • hash
      • HFU - Hardware FLOPs Utilization
      • HumanEval Pros and Cons
      • insertion sort
      • Intuition on understanding the meaning of variance and bias for Machine Learning
      • inversion
      • KMP - Knuth-Morris-Pratt
      • Lagrange's 4-square theorem
      • Legendre's 3-square theorem
      • likelihood function
      • LLM fine-tuning
      • LLM parallelism strategies
      • LLM sampling strategy
      • LLMs evaluation
      • Logistics of Data Loading for LLMs training
      • LoRA
      • LSP - Longest Suffix which is also a prefix
      • Mahonian numbers
      • major index
      • Memory usage for LLM training
      • MFU - Model FLOPs Utilization
      • modular multiplicative inverse
      • nucleus sampling (top_p)
      • obsidian backup
      • obsidian sync
      • palindrome substrings
      • pattern searching
      • permutation
      • perplexity vs entropy
      • Pipeline parallelism
      • polynomial rolling hash
      • principle of difference
      • probability vs likelihood
      • Problems during LLMs pre-training
      • Rabin-Karp
      • RLHF
      • rolling hash
      • RoPE - Rotary Position Embeddings
      • symlink windows exe to wsl
      • Techniques for improving the stability of training large ML models
      • temperature sampling
      • Tensor parallelism
      • The ability to make analogies indicates intelligence. If it helps you understanding things, don't be afraid of making too many analogies.
      • To curb one's own ignorance is a joy cherished only by the most restless of minds.
      • union-find
      • What can't be done by an AI will be done by many AIs working together.
      • Z-function
      • ZeRO - Zero Redundancy Optimizer
      • AQLM
      • AutoGPTQ
      • AWQ
      • BF16 - brain floating-point
      • Bias-Complexity tradeoff
      • Continuous Batching
      • Control LLM generation
      • Excellent explanation of the Euler equation
      • ExLlamaV2
      • Expressivity and Universal Approximation Theorems
      • From Autoencoder to Beta-VAE
      • FSDP - Fully Sharded Data Parallel
      • GPT4All
      • GPTQ
      • graphical model
      • great resources
      • Grouped-query Attention (GQA)
      • Gumbel-sigmoid
      • Hierarchical Navigable Small Worlds (HNSW)
      • KV Cache
      • Lessons learnt for training LLM (from bigscience)
      • LLM101n - Course by Andrej Kapathy
      • Locally Typical Sampling
      • Mechanistic Interpretability
      • Mirostat
      • MLC LLM
      • new English words
      • Paged Attention
      • quantization
      • QuIP
      • RAG
      • retrival
      • Smoothquant+
      • SqueezeLLM
      • Tail-Free Sampling
      • task arithmetic
      • text chunking techniques
      • the brilliance of transformers' sinusoidal positional embeddings
      • To read
      • uncensor LLMs
      • Why embeddings are added, not concatenated
      • (some) functions are vectors
      • `model.eval()` vs `torch.no_grad()`
      • a hypothesis for the mechanism of reasoning
      • Adaptive KV cache pruning with selective features
      • Adaptive LoRA
      • alias in CMD
      • anyway vs any way vs anyways
      • Aphrodite
      • Applying to Ph.D. Programs in Computer Science - Mor Harchol-Balter, CMU
      • Approximate Nearest Neighbors
      • AQLM
      • associative memory
      • auto start ssh-agent
      • AutoAWQ
      • auxiliary loss for language models
      • Bertrand's theorem
      • Best and Worst of both worlds - combining LSTM and Transformers
      • binary prefix trie
      • Bring Windows features to MacOS
      • bubble sort
      • Byte-level BPE
      • Byte-Pair Encoding (BPE)
      • Calculate GPU memory requirement and tokens for any LLM
      • career advices from Terrance Tao
      • Central Limit Theorem
      • Changing MLP layers in Transformers to Probabilistic encoders
      • chat agent with action planning
      • combination
      • Combining Modular Skills in Multitask Learning
      • common bit manipulation
      • complex number vs 2d vectors
      • Conformal prediction
      • Connections of SVD, PCA, eigenvectors and eigenvalues
      • count the GCD values of all pairs
      • coupling object and relation representation
      • Cross-lingual transfer by forced alignment of embeddings in different languages
      • Cross-moments of a random vector
      • CTransformers
      • curse of dimenstionality
      • detect if PATH has a specific directory entry in it
      • DFS, BFS, Dynamic Programming and LLM decoding
      • discriminative and generative models
      • Discriminative-Generative Learning
      • Dynamic-sized latent representations
      • Efficient Transformers with Dynamic Token Pooling
      • eigen values and Page Rank
      • Euler's theorem
      • EXL2
      • Exploiting the Bias-Variance trade-off for data fitting
      • Failure is a favour to the future
      • Fenwick Tree
      • GCD Convolution
      • generate primes
      • generate random unit vectors
      • GGML
      • GGUF
      • GPTQ-for-LLaMa
      • Graduate Application Aid
      • Graduate Applications Advice - Nathan Lambert
      • Graph of Life
      • Hierarchical sliding window transformers
      • Highlight Colour Codings
      • how does smaller versions of the same llm (3b, 7b, 13b, etc.) are trained
      • how to "average" a set of vectors
      • How To Train Your LLM Efficiently
      • hybrid of full weight update and PEFT
      • Hyrum's Law
      • I do things to satisfactory.
      • I know nothing
      • IELTS materials
      • Information Retrieval service
      • Is an LLM a one giant Hidden Markov model ?
      • it's better to learn unique skills instead of common ones
      • Just be. Smart. - letam.io
      • karabiner complex modification to make macOS feel more like Windows
      • learning by interacting with oneself
      • Legendre's conjecture
      • limit equals to e
      • LLM and HNSW
      • llm fine-tuning resources
      • LLMs development
      • local AI server
      • Local LLM
      • LoRA-abliterated
      • Machine Learning vocabs
      • me
      • merge sort
      • Mixture of Experts
      • Mixture-of-Distribution learning
      • ML might be bad for science and the importance of understanding how ML works
      • modular multi-modal hard routing
      • Moravec's paradox
      • my hybrid sort
      • my naive theory on ML
      • my systematic errors in English
      • No-Free-Lunch theorem
      • overfitting-underfitting and variance-bias
      • priority queue
      • profile vs rc scripts
      • programming interview questions to ask
      • Pursuit ignorance, pursuit mastery.
      • python max heap
      • quick sort
      • random unit vectors in high dimension are nearly orthogonal
      • Refusal behaviour can be controlled by turning a knob
      • related concepts might be packed in a low-dimensional subspace
      • relay ssh-agent and gpg-agent in windows
      • research interests
      • Riemannian Manifolds and Fisher Information
      • Rotate any vector to a target angle in high dimensional space
      • Rotate from one vector to another vector in high dimensional space
      • safetensor
      • Scaling Sparse Fine-Tuning to Large Language Models
      • Segment Tree
      • SentencePiece
      • Set a deadline to end, do not set a deadline to start.
      • set automounts option for WSL
      • setup pyenv
      • SFT - Supervised Fine-tuning
      • Sharded checkpointing
      • show all files cmd
      • speculative decoding
      • Stanford cheatsheet
      • Structure LLMs output
      • stupid things
      • subfractorial
      • The Bitter Lesson
      • The effect of BatchNorm vs LayerNorm vs RMSNorm
      • The effects of neural net layers on activation space
      • The only work that really matters is the work that no one sees
      • The Reparameterization trick
      • the type of system design philosophy that I like
      • The wrong way of teaching
      • Theory of activation space
      • tokenizers for LLMs
      • Toy Models of Superposition
      • Transforming column vectors of model weights to standard bases
      • trie
      • Unigram Tokenization
      • Universal Approximation Theorems
      • vector database
      • vector space
      • vimium c configs
      • vLLM
      • VRDSynth
      • what ML models learns and what's the real solution
      • When we think of what could go wrong, we achieve so little.
      • which features represented by Attention layers vs by MLP layers
      • why I like ML
      • WordPiece
      • Writing to learn
      • yet another hypothesis on the mechanism of LLMs
    Home

    ❯

    permanent notes

    ❯

    The ability to make analogies indicates intelligence. If it helps you understanding things, don't be afraid of making too many analogies.

    The ability to make analogies indicates intelligence. If it helps you understanding things, don't be afraid of making too many analogies.

    Mar 02, 20251 min read

    • thought

    thought


    Graph View

    Created with Quartz v4.4.0 © 2025

    • GitHub
    • LinkedIn
    • Google Scholar
    • CV