GenerativeAI

Modeling the World with Tensors and Neural Networks

We walk through the evolution from simple feedforward networks to powerful transformers and graph models, showing how richer data structures demand more sophisticated architectures. More importantly, we uncover how embedding spaces allow different modalities to interact.

Deepak Sadulla

26 Apr 2025 • 7 min read

Intersection of various data modalities and related neural architectures explained.

Every moment of perception teems with detail: millions of hues shimmer across a sunset sky, yet the language we use can only hint at their true richness. When we digitize such scenes, we're forced to compress that richness into discrete values: 8-bit or 16-bit color channels, each a coarse approximation of the gradients our eyes naturally perceive. Even at higher bit-depths, the digital palette remains finite. Likewise, video frame rates fail to capture the full continuity of motion, sampling gaps lead to interpolation, introducing blur or stutter where once there was fluidity.

The limitations extend deeper still. Even the simplest real number often can’t be represented exactly in binary, with rounding errors, overflows, and underflows lurking beneath the surface of every computation. In transforming reality into arrays of numbers, we inevitably lose information.

Yet, with the right data structures: vectors for tabular data, sequences for time series, grids for images, higher-order tensors for audio spectrograms or video, and carefully designed neural architectures, we can do more than just approximate the world. We can enhance it, generate it. Convolutional networks, transformers, graph and set models: these tools enable representations that, while not exact replicas, are rich in structure and inference power.

What’s more, AI is beginning to unlock new dimensions of creativity, not by replacing human imagination, but by amplifying it through strange and powerful combinations of modalities and algorithms. Text becomes image, image becomes sound, and models dream in forms we hadn’t thought to conjure. It's not art in the traditional sense, perhaps, it lacks intention, emotion and backstory. And yet, there's a kind of joy in asking for something impossible and watching it take shape: a surreal painting, a synthetic voice, a world that never was. The results may be random, generative, even uncanny but they are uniquely ours in the moment of creation, shaped by our prompt, our hope and curiosity.

This fusion of perception, data, and machine learning doesn’t restore every nuance we lose but it opens new creative frontiers: envisioning colors no one has named, hearing sounds no one has recorded, and solving problems that once lay beyond our grasp.

This article walks through various modalities: tabular, time series, images, video, audio (waveforms and spectrograms), graphs, sets, and point clouds, with clear examples and informs about relevant deep learning architectures. By the end, you’ll see how increasing tensor rank reflects richer data structure and drives the choice of specialized networks.

Francois Chollet, in his book "Deep Learning with Python", explains how data is encoded into tensors of various ranks (shapes) to make it amenable for training deep learning models. Below is a similar but extended summary of data modalities, tensor ranks, example shapes, and typical model architectures. If you are interested in reading more about this, read the Section 2.2: "Data representations for neural networks" from Deep Learning with Python.

Modality	Tensor Rank	Shape Example	Example Model Architectures
Tabular	2	(N, features)	MLP; TabNet Keras; FT-Transformer Keras
Time Series	3	(N, timesteps, features)	LSTM/GRU; TCN GitHub; Temporal Fusion Transformer PyTorch Forecasting; Informer
Images	4	(N, H, W, C)	CNNs (ResNet, EfficientNet); ViT/Swin Transformer
Video	5	(N, frames, H, W, C)	SlowFast; TimeSformer
Audio (wave)	3	(N, samples, 1)	WaveNet
Audio (spec)	4	(N, freq_bins, time_steps, 1)	PANNs (CNN14); Audio Spectrogram Transformer
Graphs	2+ (features + adj)	(nodes, features) + adjacency	GCN; GAT; GraphSAGE
Sets	2 (unordered)	(elements, features)	Deep Sets
Point Clouds	3	(points, coordinates)	PointNet++

N – number of examples
features – independent variables of a tabular dataset
H – height (in pixels)
W – width (in pixels)
C – channel
frames – image frames from a video
samples – sampling used for audio waveform
nodes – nodes on a graph
adjacency – adjacency matrix of a graph
elements – elements of a set
coordinates – triplets of x, y & z axis values

Tabular Data (Rank 2)

Tabular datasets are stored as rank‑2 tensors of shape (samples, features), where each row is a feature vector.
Standard multi-layer perceptrons (MLPs) process these vectors, but specialized models like TabNet use sequential attention to select important features at each decision step.
The FT‑Transformer treats columns as “tokens” and applies self‑attention, closing the gap with tree‑based methods.

Time Series Data (Rank 3)

Time series live in rank‑3 tensors (samples, timesteps, features), encoding sequences of feature vectors.
RNNs (LSTM/GRU) maintain hidden state over time but can be slow on long sequences.
Temporal Convolutional Networks (TCNs) use causal, dilated convolutions to capture long‑range dependencies with parallelism.
The Temporal Fusion Transformer combines gating, recurrent layers, and attention for multi‑horizon forecasting.
For very long sequences, Informer reduces self‑attention complexity, achieving near‑linear scaling.

Image Data (Rank 4)

Images are rank‑4 tensors (samples, height, width, channels) under the channels‑last convention.
CNNs remain dominant: EfficientNet scales depth, width, and resolution for top ImageNet performance with fewer FLOPS.
Vision Transformers, especially Swin Transformer, apply self‑attention hierarchically over patches, surpassing many CNNs on classification and detection.

Video Data (Rank 5)

Videos are rank‑5 tensors (samples, frames, height, width, channels), combining spatial and temporal dimensions.
SlowFast Networks use a slow pathway for spatial semantics and a fast one for fine‑grained motion, excelling in action recognition.
TimeSformer relies solely on space‑time self‑attention, matching or beating 3D CNNs on benchmarks like Kinetics‑400.

Audio Data

Raw Waveforms (Rank 3)

Raw audio is a rank‑3 tensor (samples, time_steps, channels) (e.g., mono = 1 channel).
WaveNet is an autoregressive dilated‑convolution model predicting each sample from its history, achieving human‑like speech synthesis.

Spectrograms (Rank 4)

Spectrograms for example log‑mel spectrograms are rank‑4 tensors (samples, frequency_bins, time_steps, 1).
PANNs (CNN14) trained on AudioSet provide strong audio tagging performance.
The Audio Spectrogram Transformer (AST) uses pure self‑attention on spectrogram patches, setting new state‑of‑the‑art on AudioSet and ESC‑50.

Graphs, Sets, & Point Clouds

Graphs

Graphs combine node feature tensors (nodes, features) with adjacency structures.
GCNs aggregate neighbor information via simplified spectral filters.
GATs add attention weights per edge, and GraphSAGE enables inductive learning on large graphs.

Sets

Unordered sets of elements form rank‑2 tensors (elements, features) without intrinsic order.
Deep Sets enforce permutation invariance through sum‑and‑transform functions, provably covering all invariant mappings.

Point Clouds

3D point clouds are unordered sets of (x,y,z) coordinates, a rank‑3 tensor (points, 3). PointNet++ hierarchically groups points into local regions and applies PointNet recursively for fine‑grained feature learning.

By matching data modality to tensor rank and choosing specialized architectures: attention‑based, convolutional, or graph‑based, we can effectively tackle a wide spectrum of machine‑learning problems.

Bridging Modalities Through Embeddings: Modern Information Alchemy

Different data types: images, text, audio, tabular values, may look nothing alike on the surface, but neural networks allow us to project them into shared embedding spaces, where they can begin to “understand” and relate to each other. These embeddings are like compressed summaries, dense vectors that capture the essence of the input. They allow models to compare, align, or condition one modality using another.

Take CLIP (Contrastive Language Image Pretraining) as an example. Trained on millions of image-caption pairs, CLIP learns to map both text and images into a joint embedding space where a caption like “a dog playing frisbee” lands close to actual pictures of dogs catching frisbees. This enables zero-shot classification, where the model can recognize new concepts without having seen explicit training labels, just by comparing text and image embeddings.

We see similar ideas everywhere:

Word embeddings (like Word2Vec or GloVe) let models capture the nuanced relationships between words: king - man + woman ≈ queen (in vector space).
Categorical embeddings in tabular data turn discrete variables (like “region” or “job title”) into vectors that reflect their statistical influence, helping MLPs treat them like continuous features.
Audio and spectrograms are two views of the same sound: one temporal (waveform), one frequency-based (spectrogram). Models can embed both into common spaces to learn more robust representations.
Multimodal video understanding combines it all: audio, music, images, language, even action, all spread across time. Embedding each stream lets the model “weave” them together: aligning the emotional tone of music with scene visuals, syncing lip movements with speech, or conditioning visual creativity with narrative structure.

At the heart of this is the idea that embeddings act as translation layers. They reduce each modality into a common geometric space where patterns can be discovered, compared, and reasoned across. Even when raw data is incompatible, their embeddings can become fluent in each other's semantics.

In a sense, this is how we bridge the gap between fragmented digital signals and the holistic richness of human perception. It’s also what powers many of today’s most advanced AI models, from text-to-image generators like DALL·E to speech-aware avatars and cross-modal retrieval systems.

In many modern models, we also see a fascinating asymmetry: starting from compact, low-bit representations like text, and expanding them into rich, high-dimensional outputs like images, audio, or even 3D scenes. Although a caption occupies far fewer bits than a full-resolution image, techniques like diffusion, denoising, and up-sampling allow models to inject plausible detail and texture, synthesizing outputs that align with the abstract essence encoded in the text. It's as if, through training, the model compresses the vast space of all images corresponding to a concept into a navigable embedding. At inference time, it spends compute to "inflate" that compressed idea back into vivid, coherent reality. In this way, neural networks don't just translate across modalities; they actively enrich sparse information, breathing life into minimal prompts.

Curious about how to harness the power of multimodal deep learning models, embeddings, generative models, and advanced AI architectures in your own projects? Reach out to DialectAI, our expertise in custom AI solutions can help you bridge the gap between raw data and rich, actionable intelligence, fueling the next wave of innovation in your organization.

Feel free to share this article with your colleagues or reach out in the comments below if you have any questions or would like to explore specific Generative AI solutions.