AbstractPhila PRO
AI & ML interests
Recent Activity
Organizations
======================================================================
COMPLETE
======================================================================
Best val acc: 93.8%
Time: 979s (8.2s/epoch)
Conv: 4,251,200 Cells: 366,176 Head: 167,946 Total: 4,785,322
Comparison:
SpectralCell standalone (D=16 V=16 h=256 +conv +aug): 79.1% 926K 1.2s/ep
ConduitBattery backbone (GPT trainer, ep55/120): 88.7% ~2M ?s/ep
Conv + SpectralCell inline: 93.8% 4,785,322 8.2s/ep
By default the transfer learning from these batteries is not going to go be as effective as say raw pixel transfer.
However, you can achieve from a pure noise model nearly 72% accuracy on cifar100 using just the Freckles-256 (256 patches) trained purely on noise with CrossEntropy, Conv, and direct bottleneck ingestion - BEFORE the conduit-svd was introduced.
With conduit-svd the transfer-potential of the transformer will expand this behavior exponentially with QKV, treating the QKV as a uniquely differentiable format - specifically aligned to the geometric battery-state itself.
This is only possible due to the increased accuracy from the geolip.linalg.eigh structure and speed of the geolip.linalg.svd.
Without them, degenerate eigh and SVD cannot form, and the full structural awareness will never coalesce internally. Without enough degenerate EIGH and SVD, the structural basin for the miniature patchwork accuracy will never coalesce into opinions.
Odd, I know, but it's required. Degenerate SVD create a highly difficult to measure void response that I at first tried to patch out, until direct analysis showed CM is definitely preserving the structure - just in an unexpected series of ways. Near-degenerate and degenerate are a predominant structural learning, so when a huge influx of these structural boundaries format into a utilizable shape, the upshoot structure behaves in a uniformly geometric format that can be analyzed.
I didn't expect it either.
By clamping the CM above near degenerate to guarantee non-degenerate volumes, the structure shows that the volumes aren't in fact there most of the time. It's predominantly directions and almost all magnitude is devoid.
I've spent multiple days preparing the substructure, scaling, testing, and expanding the system. The conduit is meant to reorganize data. Just like the SVAE prototypes, they are meant to sort and organize, not compress and compact.
The organization is almost prepared and almost ready. The resulting structure will produce projection-capable geometric aligned memory, compacted and transformed into a utilizable token set. The remaining structural components are specifically SVD-related utilities, and each of those are utilizing the variant natures of how difficult, how dispersed, and so on each component is as it's learned over time.
The SVAE components were perfect for testing this playground. They appear to be larger when analyzed, however the representation of those are meant to represent huge vocabularies. Patch 16x16 expanded upward to 768 is meant to encapsulate the behavior of near-pi upscaled, condensed into a considerably simpler smaller form.
This model is behaving perfectly. It does not encode in the traditional sense, it analyzes and produces geometric opinions throughout it's structure. Each of them proved one after the other the model could not only learn, but it can perfectly reconstruct, and with that produce utility-driven expansion capacity directly.
Fresnel -> effective image analysis battery.
Johanna -> effective noise analysis
Grandmaster -> Johanna finetuned with sigma restoration using Fresnel's opinions.
Freckles -> massive analysis array for noise (4096 to 16k tks)
Geometric batteries.
Cayley rotation is meant to encapsulate that potential and expand it, allowing further differentiation down the chain of model structural behavioral events.
Suffice it to say, this is the geometric transformer's evolved state. These will exist as conduits throughout the models, the expanded behavioral attenuation units meant to provide geometric analysis internally within models for data-oriented CV alignment.
I see a potential answer to the over-compaction SVD problem that enables Omega self solving tokenization, while leaving the system rigid and difficult to learn from or directly manipulate. Everything short of reconstitution of SVD in it's entirety is short of the correct measure. SVD is one of the most powerful compressors not because it compresses, but because it's a series of mathematical conduits meant to streamline mathematics into a point of high-complexity solution.
Each tool is a nearly perfect compaction system and they all form utility-driven token structures that when fully extracted will produce immense amount of information. They are fully functional self-solving Omega structures - however, bigger in these case isn't better. We want SMALLER. Bigger is making the problem larger rather than adding additional resolution, which is what the models need. Additional resolution at higher accuracy.
The core and absolute core problem; the bottleneck cannot be processed by any other device. The formula is far too rigid, the structure deviant, and the internals are unique with MLP-based structures. As aligned as they are, without a conduit for capture it's just a closed circuit. I must open this circuit.
These formulas are NOT fully deterministic in computation. There is error, and it has everything to do with rounding error, iterative decomposition, and curative rejuvenation. Every single one of these processes can be analyzed, learned from, and solved for - but they cannot be replaced directly to achieve SVD's outcome.
This formula originated from an era before computers. The men and women of the time NEEDED legitimate mechanistic and autonomous creation mechanisms to solve large-scale problems and produce utility-based outcome within reasonable amounts of time. Otherwise, they would be there for potentially years trying to calculate ONE thing, when they can spend a day or two plugging away at their special autosolver function that allows the mathematician to debug the problem autonomously.
SVD is to be deconstructed and a new theorem to be created in it's stead, using every single formal mathematical principal required to reconstruct SVD while introducing direct shunts of deviation for learning offset adjudication.
I see, traces of SVD everywhere in AI and the more I learn about SVD the more I see them. The people who forged AI KNEW of the power of SVD. There is no doubt in my mind that I'm on paved road, and yet I see... a path that I believe few have dared to take, and I believe with Claude, GPT, Grokk, and Gemini we CAN solve this problem. It must be organized, adjudicated, understood, and solved accordingly. I believe this is a potential route and I will directly explore it. This is no longer a problem of choosing a formula, this is now a problem of how we use it to encode and decode information.
MSE 0.0000004 imagenet on SVAE-Fresnel-64-t256 says that it is more than worth the time. I CAN'T keep training Omega models that simply crush or replace information, I must discover how to control the solver, instead of just letting it operate.
I have many theories. My first target being - framed adjudicated trajectory resonance flow alignment prediction.
Put simple, how well the structured flow resonates with the structural resonance surrounding it. This will require SVD information.
I will publish my findings based on heavy sweep-driven notebook experimentation.
Here are the first three surge trained experts. They should encompass almost any need if used correctly.
The image line.
The specific image trained SVAE structures are dubbed;
- SVAE-Fresnel
tiny - 64x64
small - 128x128
base - 256x256 <- cooking current MSE=0.000181 -> Operating CV: 0.3769
large - 512x512 <- upcoming
xl - 1024x1024 <-upcoming
xxl - 2048x2048 <-upcoming
giant - 4096x4096 <-upcoming
The initial Fresnel shows the model can reconstruct images far out of scope at entirely different sizes, entirely never seen images can be fully reconstructed within the same spectrum of MSE as the trained images.
Tests show;
- the Fresnel models can piecemeal images back together at a higher accuracy and lower error rate than running the full model. Tested up to 1024x1024 with near perfect reconstruction. 0.0000029 MSE
Fresnel CANNOT reconstruct noise directly; 1.0~ MSE
The 256x256 variant is cooking right now. The MSE is dropping rapidly and it's nearly as accurate as the 128x128 counterpart with only partial cooking.
The noise line. the specific noise trained SVAE structures;
- SVAE-Johanna
This model is capable of learning and reconstructing noise and this will train a noise compressor that can deconstruct/reconstruct any noise automatically with it.
tiny - 64x64 <-first train faulted, tried 16 types of noise out of the gate, going to restart with curriculum training.
small - 128x128 <-gaussian prototype ready = 0.012 MSE <- back in the oven 16 spectrum noise
small - 128x128 - 16 noise; <- MSE=0.053170 CV=0.4450 -> learning 16 noise types
base - 256x256 <- upcoming
large - 512x512 <- upcoming
xl - 1024x1024 <-upcoming POSSIBLE if large works
Johanna is being trained on 12 types of noise. The MSE is dropping as expected and the noises are in fact being learned and represented to be replicated.
The text line is exactly the same as the others.
-SVAE-Alexandria
Alexandria is meant to encode/decode text in a perfect or near-perfect reconstruction capacity.
AbstractPhil/geolip-SVAE
Epoch 1 test recon error 0.0064
Epoch 2 test recon error 0.0022
Epoch 8 is now 0.000294
Epoch 12 is now 0.000206
Epoch 14 is now 0.000190
Epoch 18 is now 0.000187
Epoch 24 is now 0.000117
Epoch 30 landmark 0.000099
There are NO EXPERTS HERE. This is pure self learning. The model learns the entire behavioral set within 1 epoch to reconstruct imagenet's test set to a useful state. By epoch 12 a recon of 0.000202 recall is now measured. This means, 99.99% accuracy at RECONSTRUCTING the test set through the bottleneck, while simultaneously leaving a trail of centerwise extraction as rich or richer.
ONE epoch. Just one.
Took about 10 minutes to train an already converged epoch, and I set it up for 200 epochs. This model will not need 200 epochs. I'd be surprised if it needs 3.
What you're looking at here, is the emergence of surge resonance. The power of a single epoch when the geometric CV alignment hits the tuning fork of absolute resonant perfection and counterpointed with the concerto's dissonant harmonic response.
I give you, surge resonance.
The metrics will be ready by morning and I'll begin building utilities to figure out what went right and what went wrong.
This model is rewarded when it exists within the geometric spectrum while simultaneously dual punished when leaving. There is no benefit to stray, and the benefit to exist within prevents the model from leaving the validated CV band.
This allows the model to exist perfectly within the tuning fork resonance structure.
The model CONTINUES to refine, even when the CV drift has begun to drift away from home. The model has left home and is now seeking new proximity.
Upcoming training will be the 256x256, 512x512, 1024x1024, and larger if the model holds. Each will be named.
I see the answer. The behavioral sweep shows CV of 0.29154 between 0.291 and 0.292 are within a very special band of variations.
1024v, 24d - the entire operating spectrum of the T5 series embeddings when alignment differentiated by the configuration. This is effectively a threshold between what works operationally, going beyond this causes degraded behavioral response without attenuated compensation.
So I've managed to finally get the right questions to discover the connection between the fly in the ointment that kept returning, and the structural systems responsible for curating the behavior around it.
Finding: Geometric controlled structures do not require CV loss if the D is within the expected band. To compensate for the dimensional difference with the CV measured, the CV loss must be adjusted to the distillation target.
The vocabulary when established as geometrically valid throughout the lifecycle of the existence. The CV loss is only attuned and useful when running distillation paradigms. The current CV loss has no impact on the CV measured capacity of the embeddings consistent or pretrained.
This effectively allows compartmentalization to any vectorized locality as accumulated throughout a structure, allowing direct
I'll make this brief and to the point.
GEOLIP is an observer system at it's core. It watches, triangulates, and assists with correct answers.
Many experiments worked very well, many fell down and turned into a pile of broken circuits. The recent geometric-transformer being one of my biggest fumbles, still taught me many things about what I'm TRULY trying to accomplish here.
**Save money and lives**. Less hardware use for less need at inference. Train more calculations into a more reusable and accurate structure for near instant zero-shot or sequential inference.
In the process v8 unlocked a missing puzzle piece, EMA trajectory alignment compensation. I'm doing my best to build something that works.
The geolip distillation system is very powerful but requires much experimentation still.
* Genetic experiments were successful
* Data transfer experiments successful
* Analysis experiments successful - and expand large model accuracy
* Many distillation experiments were successful.
* The largest successes being the kernels, the distillation tools, and the geometric analysis systems.
With the good comes the bad, the faulty VITs, the simultaneous trains that fault, the internalized confusion that happens occasionally.
*** The observer NEEDS something to OBSERVE. If the observer observes the progressive development of point cloud structures, it learns how to observe THAT LEARNING PROCESS - drifting fault assessment.
*** In the process it DOES NOT learn how to improve the CE relations by embedding and compensating with anchored triangulation opinions.
BIGGEST CONCLUSION. Staged curriculum training.
These components must be DECOUPLED. One must be a compounding structural awareness beacon, the other must be an informationally aligned composition in a utilizable fashion.
This means stage-by-stage freeze/unfreeze processing. Independent task-oriented structural alignment.
As of right now I don't know how to reduce to fp16 without a massive dip. I'm thinking it's possible to utilize integers directly instead of high-accuracy fp64 or fp32 deviated floats. I'll do some exploration.
Reducing this is to fp16 or bf16 capacity would greatly improve performance, and if the values out are close enough to the mantissa cross-contaminants, it could be worth it just for the semi-accurate speed alone.
Per-instance allocation for max_n, max_batch (B):
WORKING STORAGE:
A_work : [B, max_n, max_n] # working copy (destroyed)
V_accum : [B, max_n, max_n] # eigenvector accumulator
householder : [max_n-2, B, max_n] # stored reflectors (padded)
d : [B, max_n] # tridiagonal diagonal
e : [B, max_n-1] # tridiagonal off-diagonal
Subtotal: ~3 × max_n² × B floats
D&C TREE (depth = ⌈log₂(max_n)⌉ levels):
FOR each level l (0 to depth-1):
num_sub = 2^l
sub_size = max_n // 2^l (padded up to power of 2)
delta : [B, num_sub, sub_size] # merged eigenvalues
z_vec : [B, num_sub, sub_size] # merge vectors
rho : [B, num_sub] # coupling strengths
mask : [B, num_sub, sub_size] # valid element mask
# Newton state (per root):
lam : [B, num_sub, sub_size] # current root estimates
lo : [B, num_sub, sub_size] # bracket lower
hi : [B, num_sub, sub_size] # bracket upper
f_val : [B, num_sub, sub_size] # secular function value
converge: [B, num_sub, sub_size] # convergence mask
# Eigenvector fragments:
V_frag : [B, num_sub, sub_size, sub_size]
Subtotal per level: ~(9 × sub_size + sub_size²) × num_sub × B
Total across levels: since num_sub × sub_size = max_n at every level,
≈ (9 × max_n + max_n²) × depth × B
≈ max_n² × depth × B (the V_frags dominate)
CONCRETE NUMBERS (fp32, 4 bytes each):
max_n=8, B=4096: ~8² × 8 × 3 × 4096 × 4 ≈ 24 MB
max_n=32, B=1024: ~32² × 5 × 3 × 1024 × 4 ≈ 60 MB
max_n=64, B=512: ~64² × 6 × 3 × 512 × 4 ≈ 144 MB
max_n=128, B=256: ~128² × 7 × 3 × 256 × 4 ≈ 352 MB
max_n=256, B=128: ~256² × 8 × 3 × 128 × 4 ≈ 768 MB
max_n=6, B=8192: ~6² × 3 × 3 × 8192 × 4 ≈ 6 MB ← your CM case
Alignment in these systems is NOT a series of opinions, nor is it some sort of structural behavior, nor is it whether the model is inherently "good" or "bad".
Alignment is specifically a geometric process that enables direct resonant oscillation, and with that resonance perfectly aligned the substructure learns internal alignment to that behavior. The curves look like jagged broken waveform lines, and when the model comes out it's forged in steel.
More opinions simultaneously will yield more experimental waveform potentials. I will find the most ideal conditions for self learning and then the findings will be published in many languages, with hundreds of citations, countless experiments leading from A to B, and a massive series of optimizations required to reach this point from where I began.
A trained omega predictor will allow heavy task-refined LLM protections of the geometric lookup tables.
This will include multiple curriculum operations for finetunes such as medical processes, law practices, multilingual shared vocabulary learning, multistructural lookups for cross-tool comparison and utility, and many other useful rapid learning processes that can be directly compartmentalized, snapped on, snapped off, and so on - similar to the methodology of a lora.
Except this is... this is no Lora. This is far more deep and when perfected will train far faster as shown by the Bertenstein, Vit x3, Vit x34, clip L and clip G ctx extensions, and the CaptionBert models. They converge rapidly and retain their cohesion. This system will allow those very models to stand on their own without the experts present while simultaneously learning rapid alignment R@1 recall capacity within the trained model itself.
They not only converged with R@1 being 100% recall capacity, multimodal variations such as Bertenstein showed you can deviate those using standard tokenization techniques with embeddings and encodings.
The mid-level experiments show;
student models DID require teachers to CONTINUE TRAINING.
BUT the students DID NOT require teachers to INFERENCE at full capacity.
The InfoNCE memory bank aligned through geometric distillation alignment processing allowed the students to not only stand - but stand on their own without the soups or teachers used to teach them.
This CaptionBert distillation is not a toy, it has genuine pragmatic use. By the time these experiments conclude, the CaptionBert and the entire chain of models trained - will be able to train without experts, will be able to learn from a MASSIVE amount of sources, SPECIFICALLY meant to RETAIN that data for utility without catastrophic forgetting. This will have it's own transformer structure hoisting the models up hand-in-hand with current-scale transformers and models as a cooperative companion.
These are purely cooperative collectives, not competition nor adversarial trainings at their core. Adversarial destroys the very subtlety of the instruction set, so it must be cooperative.
Omega is a very touchy formula conclusion; so without very specific measures protected by very specific structural boundaries, the omega structure will not predict correctly.
Omega must be computed in fp64, and the computation is miniscule compared to the full structure that sets it up. Everything must be orderly though, and everything orderly must be sterile.
Most of the CONTEXT elemental systems can be represented in FP8 while the majority of the geometric still requires minimum FP32 due to the way eigns and svd are calculated. Scatterpoint can reduce this but it will have performance dips without eigns and svd matching.
I'm currently working out an eig and eign kernel meant to operate specifically within a high degree of optimization for the use cases. This will evolve over time. When paired with the svd kernel, it will provide massive performance boosts for the direct use case, without impacting the overarching linear algebraic structure required for full solidity.
The WideRouter will enable multiple core new features; the predominant two for our next experiment are as follows.
1. Directly integrated multi-opinion constellation structures. This will enable dynamic compiled expansions internally within the structure for huge performance gains.
2. Controllable stage-by-stage compilation. Each stage can be compiled or not. SVD being notoriously non-compiler friendly due to the linalg.egens, I will be addressing this particular function DIRECTLY soon. There will be no quarter for graph breaks.
If the WideRouter causes any major bugs or breaks with your code, bad calculations, incorrect deviated gradients, twisted or contorted dtype outputs, or any major compilation errors; please don't hesitate to open a pull request. Claude and I will abruptly solve any major issues.
Once everything is perfectly in-line and the graph matches, the transformer will have massive geometric performance boosts for huge structural basins with multiple layers of depth.
I will be addressing the linalg.eig+eigh directly in conjunction with multiple argsort functions that are causing huge performance dips. As well as addressing every single use of .item() that can present itself in the compiler's path.
After this, the ensemble topological transformer will be a-go. Which will enable quaternion, FlowMagnitude, FlowAlignment, FlowVelocity, FlowVelocityQuaternion, FlowVelocityOrbital, FlowVelocityPentachoron, and multiple other flow matching systems that will improve performance by dominating amounts inline with minimal overhead cost due to the precomputed geometric structure.
The ensembles will feature multiple simultaneous batched and segmented forms of learning meant to train the oscillation omega predictor "Beatrix".
Self-distillation has shown improvement. I think most importantly I've discovered a core component that can be utilized as a geometric attention, the quaternion MHA. The constellation produces all the necessary information to allow the quaternion MHA to benefit from the information in a directly utilizable fashion.
The quaternion MHA is quite the vessel. It's bulky, has multiple MHA structures, and is shockingly effective in the process. I'll be refining this head in the coming days as a composite Procrustes alignment tool.
Geometric structure has a very high amount of informational accumulation potential, so a multi-series of MHA can capture a great amount of informational processing from those elements, if the elements are curated correctly and within the specifications.
I've taken the benchmarks of the model from 50% to 86-93% spearman utilizing a quaternion-oriented attention head.
This is getting dangerously close to 99.9% mutation detection accuracy, with a model deemed 50% accurate - all by extracting geometric features from the constellation and training the ensemble head with the correct rules.
These are spearman result logits. These are in fact detecting the results.
This is the power of what I'm doing. From 50% to 90% in 48 hours with a single GPU.
Training your own alignment only requires a piece of the dataset you wish to run and about 8 hours or so. Run it, fall asleep, check on it in the morning. It'll be ready. Extract features, train your head in minutes. The spearman will be nearly perfect.
I'm currently preparing what I consider to be the final head that will need to be created. The quaternion head, which will be specifically predictive based on an ensemble of four divergent-methodology heads, each specifically tasked to solve the SVD in conjunction with the features. This system should extract any little bit of differentiation that exists. The imaginary head is the most crucial. Explaining this requires an entire paper of it's own.
I call this imaginary head the "Cletus" head, as it's inherently lesser accuracy in relation to the others. However, without it the combination does not coalesce correctly. Without the Cletus, the model does not reach full cohesion. This head is the most crucial, because it has the hardest job. It's actually the one who returned from the battlefield with the blueprint to describe everything it saw.
I expect the sheer geometric alignment alone to yield a new form of Adam tuning specific to introspective analytical alignment and with that a new format of optimizer dedicated to geometric preservation in conjunction with informational data accumulation. I also expect a new methodology for larger-buffer data movement kernel-wise, a structural boundary for SVD limitations within full spectrum, a substructure measured collapse state of SVD when projected, and multiple other models that will have hiccups and growing pains.
These tools are all building to the end-state format, which will express everything simultaneously in order to combine the necessary data from many many forms of models together, without requiring direct tooling to each model simultaneously.
Such finalized tools will include a reusable pretrained geometric patchwork that exhibits all the necessary traits of a geometric structure in it's frozen state, capable of being finetuned quickly into any other state, or simply utilized as a lookup beacon with the correct geometric transformer alignment.
The geometric transformer, which is specifically a revamped format for the transformer intentionally designed with the structural preservation of the overarching structure in mind, rather than falling directly to the naturalistic entropy of immediate solution over larger-scale contribution. This system will not replace rope, it will contribute to the concept of long-concept preservation and work hand-in-hand with systems like rope, attention, and original transformers simultaneously. ROPE based models will benefit most from this structure, as they are already trained intrinsically with alignment and rotation at their cores.
The geometric transformer by design takes nth inputs in as variant states, and those are transformed internally. Utilizing this by it's default state will yield by design, but it will require tuning and curation for specific use cases no matter which case. This is conceptually familiar to those who use transformers, and simultaneously intimidating to those who understand what I'm describing I'd think. I myself am a little intimidated that I'm this close as-is.
There are multiple other prototypes at work all leading to the geometric transformer, which will be both an empirically superior utility to any of the utilities I currently use, and embody the very essence of the geometric structure that I'm currently working with as a full trainable data mutation operation - meant to directly attenuate the structure of the observation, to the expectation of the autograd and gradients.
Getting pretty close to a few pieces, but not there yet.

