• Turing Post
  • Posts
  • FOD#59: The Art of Crafting AI with Synthetic Data

FOD#59: The Art of Crafting AI with Synthetic Data

plus a collection of thought-provoking essays, important research papers, and news from usual suspects

In partnership with

Next Week in Turing Post:

  • Wednesday, AI 101: What is LongRAG?

  • Friday, Interview with Innovators: we discuss the impact of AI on search engines with ML experts from Yandex Search

If you like Turing Post, consider becoming a paid subscriber. You’ll immediately get full access to all our articles, investigations, and tech series →

The last week was marked by two very interesting research papers related to the use of synthetic data in AI, offering thought-provoking insights into the future of this technology. The first, "LLM See, LLM Do: Guiding Data Generation to Target Non-Differentiable Objectives," by Cohere researchers, explores how synthetic data can be used to fine-tune AI models. The second, "Scaling Synthetic Data Creation with 1,000,000,000 Personas," by Tencent AI Lab, unveils a colossal persona-driven framework to generate diverse and realistic synthetic data.

What if we could combine these approaches? Active inheritance allows us to guide AI models toward desirable attributes, like reducing toxicity and increasing lexical diversity. Imagine layering this with the vast, varied personas from Persona Hub. One billion personas is no joke! Could we then create a new generation of AI – a new AI Nation – trained on data that's diverse, ethically sound, and highly functional?

The potential here is immense. These papers collectively suggest a future where AI models are not just trained but finely sculpted through sophisticated data generation techniques.

There are, of course, a few questions to consider: as we move AI behavior through targeted data, how do we ensure we’re not embedding unintended biases? Data from one billion personas is massive – how do we manage it ethically and effectively? How can we make sure this AI Nation is different from us, biased humans?

Synthetic data is on the rise, and we still don’t know all the answers or even the right questions to ask. The conversation around synthetic data in AI is just beginning; the promise is truly fascinating, and it's one we must approach with both enthusiasm and caution.

Click the link below so we can make some money on this ad 🙂 You might also like what they offer →

Your Brilliant Business Idea Just Got a New Best Friend

Got a business idea? Any idea? We're not picky. Big, small, "I thought of this in the shower" type stuff–we want it all. Whether you're dreaming of building an empire or just figuring out how to stop shuffling spreadsheets, we're here for it.

Our AI Ideas Generator asks you 3 questions and emails you a custom-built report of AI-powered solutions unique to your business.

Imagine having a hyper-intelligent, never-sleeps, doesn't-need-coffee AI solutions machine at your beck and call. That's our AI Ideas Generator. It takes your business conundrum, shakes it up with some LLM magic and–voila!--emails you a bespoke report of AI-powered solutions.

Outsmart, Outpace, Outdo: Whether you're aiming to leapfrog the competition or just be best-in-class in your industry, our custom AI solutions have you covered.

Twitter Library

News from The Usual Suspects ©

  • AI’s Financial situation

  • Anthropic's Safety Dance 

    • Anthropic is on a mission to fund third-party evaluations of advanced AI models, focusing on AI Safety Level assessments and advanced capability metrics. Proposals are open for those eager to keep AI in check. Will safety become the new AI arms race?

  • Character.AI’s Love Triangle 

    • Character.AI, the chatbot trendsetter, is flirting with Google and Meta as competition heats up. Once the darling of quirky AI interactions, it's now navigating partnerships and content controversies to stay in the game.

  • Apple's AI Adventure 

    • Apple is joining forces with OpenAI, gaining an observer seat on its board. Phil Schiller will oversee this AI alliance, aiming to integrate ChatGPT into Apple devices and boost Siri’s smarts—all without spending a dime.

  • Stability AI’s Generous Diffusion 

    • Stability AI has dropped Stable Diffusion 3 Medium weights on Hugging Face under a new Community License. Small businesses and researchers can now use it for free, while big players need an Enterprise license. Kudos to open-source and artistic freedom!

  • World Artificial Intelligence Conference (WAIC) in Shanghai

    • Despite U.S. restrictions, China’s AI firms continue to rival market leaders. As often happens, sanctions fuel innovations, and Chinese companies successfully develop workarounds to remain competitive. At WAIC, SenseTime unveiled SenseNova 5.5, claiming it outperforms GPT-4 in key metrics. Alibaba highlighted user growth for its Tongyi Qianwen models, which have over 20 million downloads. Both companies emphasize their commitment to open-source development amidst intense domestic competition in the AI sector.

    • Elon Musk is an often visitor to China. Tesla's Optimus humanoid robot made a splash at the WAIC, though safely behind glass. Alongside it, 18 Chinese robotics firms showcased their bots, tackling high costs and US tech restrictions with creative solutions.

    • Discussions centered on how Chinese companies can innovate despite US technology restrictions, focusing on areas like cloud computing and AI application development.

  • Kyutai’s Voice Revolution 

    • Kyutai introduced Moshi, the first openly accessible voice-enabled AI, created by an 8-member team in just six months. Demonstrated in Paris, Moshi's code and model weights are free to all, pushing for open collaboration in AI. I liked the reaction of Hugging Face’s CTO Julien Chaumond the most:

In other newsletters/posts (a lot of thought-provoking pieces!):

The freshest research papers, categorized for your convenience

Optimization and Performance Enhancements

  • MInference 1.0: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention
    Utilizes dynamic sparse attention patterns to speed up the pre-filling stage of long-context LLMs, significantly reducing inference latency while maintaining accuracy. Read the paper

  • AGENTLESS: Demystifying LLM-based Software Engineering Agents
    Simplifies LLM-based software development using a two-step process of localization and repair without autonomous tool usage, achieving high performance and low cost. Read the paper

  • RouteLLM: Learning to Route LLMs with Preference Data
    Optimizes cost and performance by dynamically selecting between strong and weak LLMs, reducing costs while maintaining response quality through data augmentation and human preference data. Read the paper

  • LiteSearch: Efficacious Tree Search for LLM
    Develops a novel tree search algorithm to improve LLMs' performance on mathematical reasoning tasks, reducing computational costs while maintaining competitive performance. Read the paper

  • Let the Expert Stick to His Last: Expert-Specialized Fine-Tuning for Sparse Architectural Large Language Models
    Proposes Expert-Specialized Fine-Tuning (ESFT) for sparse Mixture-of-Experts (MoE) architectures, tuning only the most relevant experts for a task, improving tuning efficiency and performance. Read the paper

Benchmarks and Evaluation

  • TabReD: A Benchmark of Tabular Machine Learning in-the-Wild
    Presents a benchmark collection of industry-grade tabular datasets with temporal splits, highlighting the performance of different architectures and the impact of time-based splits. Read the paper

  • Summary of a Haystack: A Challenge to Long-Context LLMs and RAG Systems
    Proposes the SummHay task to evaluate LLMs and RAG systems on long-context summarization, highlighting models' challenges in precise citation and comprehensive coverage. Read the paper

  • MIRAI: Evaluating LLM Agents for Event Forecasting
    Develops a benchmark for assessing LLM agents' capabilities in predicting international events using the GDELT event database, highlighting the need for advanced temporal reasoning. Read the paper

  • WE-MATH: Does Your Large Multimodal Model Achieve Human-like Mathematical Reasoning?
    Introduces a benchmark for evaluating visual mathematical reasoning in LMMs, revealing significant struggles with insufficient knowledge despite advancements in generalization. Read the paper

Content Regulation, Alignment, and Safety

  • UnUnlearning: Unlearning is not sufficient for content regulation in advanced generative AI
    Highlights that unlearning fails to prevent reintroduction of removed knowledge through in-context learning, emphasizing the need for robust content filtering mechanisms. Read the paper

  • ProgressGym: Alignment with a Millennium of Moral Progress
    Introduces a framework to align LLMs with human moral progress using historical texts and LLMs, offering benchmarks to track evolving values and address value lock-in risks in AI. Read the paper

  • Safe Unlearning: A Surprisingly Effective and Generalizable Solution to Defend Against Jailbreak Attacks
    Proposes a method to defend against jailbreak attacks by unlearning harmful knowledge, significantly reducing attack success rates and demonstrating remarkable generalizability. Read the paper

  • A False Sense of Safety: Unsafe Information Leakage in ‘Safe’ AI Responses
    Explores limitations of current AI safety measures, introducing "inferential adversaries" to exploit seemingly safe outputs, emphasizing the need for new defense mechanisms. Read the paper

  • Self-Evaluation as a Defense Against Adversarial Attacks on LLMs
    Develops a defense mechanism using self-evaluation to reduce attack success rates, outperforming existing defenses and remaining robust even under adaptive attacks. Read the paper

Multimodal Models and Applications

  • 4M-21: An Any-to-Any Vision Model for Tens of Tasks and Modalities
    Trains a vision model on over twenty diverse modalities, enabling it to perform a wide range of tasks without performance loss, enhancing multimodal generation and retrieval. Read the paper

  • Understanding alignment in multimodal LLMs: a comprehensive study
    Explores alignment of responses in multimodal LLMs with image content, proposing Bias-Driven Hallucination Sampling (BDHS) and highlighting the benefits of combined offline and online methods. Read the paper

  • ROS-LLM: A ROS framework for embodied AI with task feedback and structured reasoning
    Integrates LLMs with the Robot Operating System (ROS) to facilitate intuitive robot programming, incorporating feedback to refine tasks, demonstrating robustness and scalability. Read the paper

  • STARK: Social Long-Term Multi-Modal Conversation with Persona Commonsense Knowledge
    Introduces a large-scale multi-modal conversation dataset featuring diverse social personas and images, enabling the creation of advanced conversation models with superior visual imagination abilities. Read the paper

Advanced Techniques and New Models

  • Chain-of-knowledge: Integrating Knowledge Reasoning into Large Language Models by Learning from Knowledge Graphs
    Enhances LLMs with knowledge reasoning abilities using knowledge graphs and a trial-and-error mechanism, improving general reasoning capabilities and addressing rule overfitting. Read the paper

  • Learning to (Learn at Test Time): RNNs with Expressive Hidden States
    Proposes Test-Time Training (TTT) layers, which update hidden states even during test sequences, demonstrating superior performance to Transformer and modern RNN baselines in long context scenarios. Read the paper

  • E2 TTS: Embarrassingly Easy Fully Non-Autoregressive Zero-Shot TTS
    Introduces a non-autoregressive zero-shot text-to-speech system with a simple architecture, achieving human-level naturalness and state-of-the-art speaker similarity and intelligibility. Read the paper

Long-Context and Retrieval Capabilities

  • Is It Really Long Context if All You Need Is Retrieval? Towards Genuinely Difficult Long Context NLP
    Argues that defining long-context NLP tasks by input length is insufficient, proposing a taxonomy to better evaluate and develop LLM capabilities in genuinely difficult long-context scenarios. Read the paper

  • Show Less, Instruct More: Enriching Prompts with Definitions and Guidelines for Zero-Shot NER
    Employs instruction-tuning with enriched prompts containing definitions and guidelines, significantly improving the model's ability to generalize to unseen entity types in NER tasks. Read the paper

Novel Architectures and Techniques

  • Consistency Flow Matching: Defining Straight Flows with Velocity Consistency
    Enhances flow matching in generative models by enforcing self-consistency in the velocity field, improving training efficiency and sample quality. Read the paper

  • DotaMath: Decomposition of Thought with Code Assistance and Self-correction for Mathematical Reasoning
    Improves LLM performance on complex math tasks by decomposing problems into logical subtasks and incorporating self-correction, demonstrating robust generalization capabilities. Read the paper

  • MInference 1.0: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention
    Utilizes dynamic sparse attention patterns to speed up the pre-filling stage of long-context LLMs, significantly reducing inference latency while maintaining accuracy. Read the paper

Please send this newsletter to your colleagues if it can help them enhance their understanding of AI and stay ahead of the curve. You will get a 1-month subscription!

Leave a review!

Login or Subscribe to participate in polls.

Reply

or to participate.