WaveStack LLM model - Initial Findings
I’m working on a novel, multi-lane architecture for large language models I’m calling WaveStackLLM.
https://github.com/jefflinwood/wavestack-hybrid
The code, the design, the math behind it, my experiment log, and the current findings are in all in the Github repository.
This is an exploratory LLM project for figuring out if explicitly setting out mathematical representations of language or code structure can produce useful results. The hypothesis is that this could be more efficient than a completely implicit model.
WaveStackLLM Design
The basic idea is that we can use multiple lanes to process input data. These lanes each consist of defined mathematical techniques - right now I’m using:
- Chebyshev Polynomials
- Wavelets
- Fourier Analysis
I have some hypotheses to explore about why each of these techniques could be important for emitting English language and computer code.
There are probably more mathematical techniques that could be used here as well. The important thing is that all three do contribute to the end result, and they aren’t just repeating their results. I tested this by running lane-ablation experiments.
Each of these lanes ends up getting combined in a lane mixer using a gating strategy. This is a little different from “mixture-of-experts” - each lane gets to create a representation, but the mixer combines them.
Another key area of exploration is composability with the WaveStack model. If additional lanes can be added or removed at runtime, there may not be the need to directly fine-tune the base model.
Current Findings
The results I’m most impressed with are that the WaveStackLLM model outperforms the PyTorch transformer model on the TinyStories dataset with the same number of parameters. On the WikiText-2 dataset, WaveStackLLM is within 10% of the transformer.
With WaveStackLLM, there are lots of knobs that are still left to be tuned and adjusted, so I’m confident that performance can continue to be improved.
Another key finding, although it is still preliminary, is that WaveStackLLM scales linearly with the sequence length, while the transformer model scales super-linearly with the sequence length. This holds with the experiments I’ve done so far with a small model, at sequence lengths up to 4096.
This matches a hypothesis I had that WaveStackLLM could be more memory-efficient than transformers for large context sizes.
Conclusion
WaveStackLLM is a research project - large language models are a fascinating field, and I’m curious about where they might go in the future.
The multiple lane approach might all turn out to be a dead-end in terms of practicality, or applicability, especially once I try scaling it to larger parameter sizes or larger training data sets. That’s ok! I’m learning a lot by exploring all of this.
It might also turn out that combining this approach with attention creates the best solution, either as an additional lane or lane(s) or at the mixer level - this would be another area to explore.