kix

Building a Hacker News ChatGPT Plugin

2023-05-05T00:00:00+00:00

I recently received access to develop and use ChatGPT plugins, and embarked on a project to build a Hacker News integration as a learning exercise. My goal was to enable retrieval of content from HN to answer questions and produce insights in conversations with ChatGPT.

You can experience the plugin in one of three ways:

Check out the simple demo which approximates parts of the plugin.
(or) add ‘hn.kix.in’ as a plugin — if you have ChatGPT plugin access.
(or) watch this short video:

In this blog post I’ll cover the process of building this plugin. If you are interested in learning about how ChatGPT plugins work, the Hacker News API and dataset, or building a semantic search index through use of embeddings — read on!

What are ChatGPT plugins?

Plugins are a new feature announced by OpenAI that allows ChatGPT to extend its functionality by calling external APIs. This unlocks key capabilities such as web browser and code execution, but also allows for bringing various data sources into the large language model. No more “knowledge cutoff” problems!

The official documentation goes into the process of building a plugin in much more detail, but at a high level, you can build a ChatGPT plugin by:

Describing an existing (or new) API to ChatGPT in plain english. The specific format is the “OpenAPI” (formerly Swagger) spec, but the most important fields in the spec are the description fields which ChatGPT will read to understand your API.
Processing API calls when ChatGPT calls them. The system will decide when to invoke your API given your description and the user’s utterance, it generally does a pretty good job at this. Data returned by your API will then be processed by ChatGPT as part of the “prompt” in order to do whatever the user is asking - whether it is taking an action or answering a question.

There is a waitlist for both using and creating plugins, if you aren’t signed up already.

Tips for plugin development

While in theory you can just “plug and play” an existing API by providing an OpenAPI specification for it, to get the most out of the integration, I’ve found that creating something more bespoke to the specific style in which ChatGPT invokes them is useful. A few things I’ve learned:

Fewer calls with more arguments are better than many calls with fewer arguments. My initial design for the hacker news API involved individual endpoints for stories, comments, polls, etc. Simplifying it to just /items and /users with many query parameters to further control the output worked much better.
Learn what functionality to add by iteration. I found that ChatGPT would sometimes hallucinate parameters for your API that don’t exist. My initial API did not have a sort_order parameter, but I kept seeing ChatGPT add it for certain types of queries. That was a good hint for me to just implement it! You can (and should) run a plugin API on localhost first which makes iteration fairly quick and easy.
Be as terse as possible. This holds true for both your OpenAPI specification and the actual API responses. You do need to be descriptive but short and to-the-point descriptions actually stuck more than lengthy flowery language. I’ve noticed that if your actual API responses are too long, it increases chances of hallucinations or the model just ignoring your response. This is likely related to context window limits for the GPT models.
- The official documentation states the limit for API respones is 100,000 characters - in practice you’ll want to be well below it.
- Some plugin authors have found a trick by forgoing JSON as an output format altogether, plain text responses work just as well and saves quite a few characters!
Be tolerant of inputs, more than usual. ChatGPT is a very language driven model and is not as precise when it comes to numbers. Avoid use of things like UNIX timestamps in your APIs, it’s often better to receive standardized date formats like ISO8601, and even better to accept natural language.
- Using parsers like dateparser in python for processing natural language dates and times can be helpful.
- ChatGPT often inserts comments into its POST requests. If you handle JSON as payload, use a parser like json5 to be tolerant of this.
Set reasonable defaults. I’ve fluctuated on the default limit value for the /items endpoint, from 1 to 10 and back to 5. I’ve found that 3 was the magic number that allowed the response to be as long as possible without throwing ChatGPT off the rails while still being useful enough to summarize any given topic.
Use ChatGPT itself to help you! Not only can ChatGPT write code for the implementation of your API, it’s also very good at creating terse descriptions of APIs from lengthy documentation. That’s often a great starting point - I started my project by throwing the Hacker News Firebase API documentation at it.

With these guidelines in mind, here is a rough sketch of what we want a Hacker News API to look like:

/items          # find and retrieve stories, comments, polls, or jobs
    query       # search for items matching this text
    type        # story, comment, poll, job
    by          # filter by author
    after_time  # content submitted after this (natural language ok)
    before_time
    min_comment # minimum number of comments for a story
    max_comment
    min_score   # minimum score for a story
    max_score
    sort_by     # relevance, score, time, or number of comments
    sort_order  # asc or desc
    limit       # maximum number of items to return
    offset      # offset into the results to page through
/users          # find and retrieve users
    ...         # similar API as above

You can see the full API I ended up with here. The directions you give the plugin in the ai-plugin.json manifest file through the description_for_model are even more important than the individual description lines you put in your OpenAPI schema. This part will likely take a lot of tweaking for you to find something that works optimally. For the hacker news plugin, here is the prompt I ended up using:

Retrieve stories, comments, polls, and jobs from the Hacker News (HN) community in real-time. Follow these guidelines:

General rules:
You MAY provide natural language for dates, but ONLY after converting spelled-out numbers into their numerical equivalents. For instance, 'a couple of days' should become '2 days' and 'few weeks later' should become '3 weeks later'.
ALWAYS attempt to provide the hacker news URL (hn_url) and original URL (url) in your response.
ONLY incorporate API response data in your output.
Utilize the 'text' and 'top_comments' fields from API responses to answer questions, provide insights, and generate summaries.

Using find_items:
Search for user requested topics with find_items.
Remove 'Ask HN' prefix from user queries when providing them as the 'query' argument.
Use 'text' and 'top_comments' fields to answer questions or provide summaries.
Request a minimum of 3 stories for summarizing or searching a topic.

Using get_item:
Obtain more comments for any story using this endpoint.
Provide an ID obtained from find_items.

Using get_user and find_users:
Use get_user to access detailed information about a single user.
Employ find_users to search for users based on specific criteria.

Now let’s talk about the best way to implement this API!

Search index considerations

There are really two main pieces to our API. The first is to retrieve content matching a certain set of filters, which feels like a straightforward mapping to a SQLite database or even directly with the Hacker News Firebase API.

The second, more interesting part is implementing the query argument on the /items endpoint. Plugin users are likely to want to retrieve many kinds of content from hacker news using natural language.

Hacker News already has a Firebase API to retrieve the raw data, but this by itself is insufficient, as you need a search index in order to properly rank and retrieve only a subset of documents for any given user query.

There are basically two options for building such a search index:

Traditional keyword search. This is the classic information retrieval technique refined over a couple of decades, and services like ElasticSearch and Algolia make it easy to create such indexes. Algolia already has a great HN search index that can “plug and play” with ChatGPT plugins for the most part.
Semantic search. With all the attention on AI recently, a fairly old technique called “embeddings” has received renewed interest and enthusiasm. Embeddings are a way to generate an n-dimensional vector for any input content, such that content “similar” to each other will be near each other in this n-dimensional space.

I first built a plugin with the Algolia search API. It performed relatively well, especially when the questions were “keyword-y” in nature, like asking for more information about a specific a project or person. However, there was room for improvement on questions that were more generic in nature or long queries of a conversational style. There is no publicly available API or dataset for embeddings on the HN corpus, so time to roll up my sleeves and build one!

Downloading the dataset

⬇️ Download the SQLite DB from HuggingFace 🤗

The first step was to download the Hacker News corpus onto my computer. As of April 2023, HN contained just under 36 million items (an item can be a story, comment, job, or poll) and just under 900k users. That’s small enough to download and process on a single computer but large enough to make it a non-trivial and interesting exercise!

I wrote implementation in node, go, and python to see which one would perform best. Node turned out to be the most reliable because it uses the Firebase SDK while the go and python versions used the REST API. I ended up using the node version, taking the performance hit for better reliability. To make the download faster, I simply parallelized it over 32 AWS spot instances:

fetch.js is the core download script.
run.sh is a quick-and-dirty user-script to parallelize the download on AWS EC2. Note the hard-coded number of machines.
fetch-users.js is a script to fetch user data profiles, can be done on a single machine and is fairly quick.
merge.py can be used to merge each partition into a single sqlite file.

This cost around $4 and took an hour to run. The final DB is around 32GB on disk, but compresses down to 6.5GB. Check out this python notebook for quick and dirty ways to visualize data from this SQLite table.

Constructing embeddings

Next step is to take all this content and generate embeddings from them. To do this, we have three main questions to answer: what embedder to use, how to structure the input content, and where will we store the embeddings for retrieval.

Embedder options

There are many ways to construct embeddings from all kinds of data (text, images, even video). Our focus is on text, so a good place to start is by looking at the “Massive Text Embedding Benchmark” (MTEB). You can filter the leaderboard by various criteria to find the right embedder for your use-case.

Note that some embedding services run in the cloud behind an API call, such as OpenAI’s ada-002 or Cohere. Most of them can be downloaded and run locally though, normally in python. LangChain is a good way to quickly experiment with different embedders with small datasets.

After a bunch of experimentation, I decided to pick instructor-large as it gave me a good balance of quality and speed of generation, plus the ability to run locally and leverage my new NVIDIA GPU.

In your selection of the embedder, also keep in mind that whatever you choose to embed your corpus will also be the one you will need to use at runtime when processing user queries!

Structuring input documents

Most embedding models have a maximum token length they will accept as input, so we need to think about how to represent our data. The default for instructor-large is 512 tokens, but can be extended to around 1024 with only a slight dip in quality.

The naive approach would be to simply embed each item in our table (story or comment), giving us around ~35 million embeddings. But most of these items are much smaller than 512 tokens, not to mention, not every piece of content is worth embedding because it might be spam or of low relevance.

Since comments on a story are usually pretty related to the story itself, a smarter way would be to group all the comments for a story along with the story text and treat it as a single “document”. By adding an additional filter for stories that have at-least 20 upvotes and 3 comments to make each document meaningful (and weed out spam), we get:

sqlite> select count(*) from items where score >= 20 and descendants >= 3;
402007

That’s a much more manageable ~400k documents. Keep in mind that since we are grouping comments together with the story, a document can get much longer than 1024 tokens. To solve this, we will chunk each document into “pages” of up to 1024 tokens each. It seems on average, every document is around 7.5 pages:

sqlite> select count(*) from embeddings;
3050324

Storing embeddings for retrieval

Embeddings are just an array of floats. The length of this array is known as the dimensionality of our vector. instructor-large produces embeddings of 768 dimensions. A floating point number can be represented in 32 or 64 bits. Assuming we use 32-bit floats, one embedding would be just over 3kb (768x4 bytes).

Note that the size of the output embedding is fixed for any input size. This is one of the reasons we tried to maximize the number of tokens per embedding when generating our input document. For 3 million embeddings that’s around 10GB of data. Nothing a sqlite table can’t handle so let’s just store it there.

sqlite> select * from sqlite_schema;
table|embeddings|embeddings|2|CREATE TABLE embeddings (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    story INTEGER,
    part_index INTEGER,
    embedding BLOB,
    UNIQUE (story, part_index)
)

Generating 3 million embeddings took almost an entire day on my RTX 4090. I let this job run overnight. Each embedding itself takes a fraction of a second to generate, but is not parallelizable on my Gaming GPU as there is not enough VRAM to load multiple model instances.

Vector search & indexing

Now that we have generated embeddings for the most interesting stories and comments, we can use it as the basis of a semantic search engine. This process boils down to:

Generate an embedding for the query.
- instructor-large accepts an instruction argument while generating an embedding, note that we give different instructions for query embedding than we do for document embedding.
Find k vectors nearest to the query embedding.
Rank these k vectors to obtain results with the highest relevance to your input query.
- Limit this list to the top n results, retrieve supporting metadata for each item and return them.

Step 2 is the most interesting part of this flow. Generally, if you have less than a million embeddings, the naive approach of comparing your query embedding to every vector in your dataset is quite feasible. When comparing two vectors you can quickly compute the distance between them. Sorting by distance in ascending order is an easy way to find the most relevant documents for a query. This approach is known as k-NN (k-nearest neighbors).

With more than a million embeddings the brute force approach breaks down and starts taking too long. We have to find some way to reduce the number of vectors we have to compare with. There are a few strategies to do this:

Compress your embedding down to fewer dimensions. There are a handful of smart ways to employ lossy compression while minimizing a drop in accuracy. In this approach you don’t reduce the number of embeddings to compare against per-se, but rather reduce the size of each vector so as to reduce the time taken for each comparison.
- As an example, if you compress 768 dimensions down to 384, you can now do 2 million vector comparisons by brute force in a reasonable amount of time. Quantization is one common way by which you can reduce the dimensionality of a vector. Google’s ScaNN library is a popular choice.
Cluster your embeddings. You can pre-process your dataset into n clusters, compare the query embedding to the centroid of every cluster. Then you only have to compare the query embedding to vectors in cluster that was closest.
- There are small variations of this where you can have a large number of smaller clusters and you compare the vector to everything from a few adjacent clusters. Facebook’s FAISS library has a few implementations of this general type of technique.
Small world graphs. This is another way to partition your dataset following the intuition that vectors based on real data will follow “small world” clustering rules similar to the real world (e.g. Six Degrees of Kevin Bacon).
- In this technique we navigate the graph finding “small world” clusters. More complex implementations (like HNSW - hierarchical navigable small world) add other techniques to make this more robust. The FAISS library mentioned above also has an HNSW implementation.
Partitioning using trees. One technique to partition your vectors are to pick two random vectors and split by a plane equidistant between them. This is effectively a random split and can be repeated multiple times until the number of vectors in each leaf node is low enough. One might also construct a “forest” of binary trees with different random split paths taken.
- In practice, this works very well when you have a small number of dimensions (less than 100). Spotify’s Annoy library is a popular implementation of this technique.

These strategies are known as “Approximate Nearest Neighbors” or “ANN”. I recommend this guide if you want to dive deeper on any of these.

One of the primary benefits of a ChatGPT plugin is the ability to access real-time data, so for our use-case we need to consider the ability to update the embedding index with new data periodically. I settled on a simple IndexIVFFlate implementation using FAISS. This is a type of clustering based on assigning vectors to a voronoi cell. The cells are determined once at boot based on the initial set of embeddings, new embeddings inserted are assigned to an existing cell.

All embeddings are loaded in-memory, for around 3M embeddings this takes around 16GB of RAM (there is some overhead due to clustering and metadata). FAISS has an option to use a disk-based index, but this was small enough to fit on my 32GB machine.

The full implementation of this is a very short 80-line python program!

Ranking

FAISS will return k embeddings nearest to your query ranked by distance. For our Hacker News plugin, story upvotes and time of submission are also pretty important factors. Relying only on distance would often surface stories with low scores or very old submissions at the very top which was undesirable.

Semantic search based on embeddings also has the drawback of not being great at exact keyword match, particularly when that word doesn’t occur often in the corpus. To make up for this slightly, I also introduced the notion of “topicality” where we boost stories whose title has words matching the query.

Once you normalize these four values on a 0-1 scale, you can pick weights to associate with each attribute. Through trial and error, I landed on something like this:

# Compute topicality
query_words = set(word.lower() for word in query.split())
title_words = [word.lower() for word in title.split()]
topicality = calculate_topicality(query_words, title_words)

# Weights for score, distance, story age, and topicality
w1, w2, w3, w4 = 0.2, 0.25, 0.35, 0.2
score_rank = (
    w1 * normalized_scores[i]
    + w2 * normalized_distances[i]
    + w3 * normalized_ages[i]
    + w4 * topicality
)

You can see we give the most importance to the story age, followed by the vector distance, and finally account for story upvotes and topicality. The full ranker implementation can be found here.

Keeping the data & index updated

Moving onto our next piece of the puzzle — keeping the data updated. Luckily for us, the Firebase API helps us keep things real-time, simply by subscribing to the changes endpoint. This endpoint is updated roughly every 15 to 30 seconds and typically has a dozen item and user profile changes.

Fetching these items and profiles on every update, then inserting them into the SQLite table was fairly straightforward. What’s more complex is what we do with our embeddings index — simply adding to the items to the table isn’t enough since the API won’t be able to find it through a text search.

I refactored the code that did the initial embedding pass to also run on individual documents. Recall that generating an embedding is a single-threaded sequential process (because of my limited VRAM). Generating embeddings every time a story was updated had the chance to completely starve incoming queries from being embedded which would be bad.

To solve this problem, I employed two techniques:

While the data updates are processed in real-time, we batch the embedding updates every 15 minutes. This allows us to collect a bunch of changes to an active story (comments are added rapidly and upvoted) and process them together.
Implemented a priority queue in the embedder service such that we would always process embedding an incoming query over embedding an updated document.

This gave us a good balance between keeping our data updates while not compromising on the machine’s ability to respond to incoming queries.

API server + Q&A

All of this is brought together by a FastAPI server to implement the API spec we defined earlier. Doing this was pretty easy through use of SQAlchemy.

We run two independent processes, one for the data update and embedder service, and another for the FastAPI server. The update/embedder service has write locks on the SQLite databases while the FastAPI opens the db in readonly mode.

The first time our embedding server starts, we do a quick “catchup” on any missed stories or embedding updates to keep the database fresh even if the server was offline for any reason.

The final step is to take the text and comments from returned results in the /items API call — and optionally generate an answer using ChatGPT-3.5-turbo. This is pretty simple to do, we just take the text and prompt the model with something like:

Given the following hacker news discussions:

Answer the question: {user-query}

This functionality powers what you see on the simple demo page and is an approximation of the plugin experience right in ChatGPT.

Closing thoughts

Hope this was a useful tutorial on building a non-trivial ChatGPT plugin and helped with your understanding of embeddings and semantic search. My advice for anyone dipping their toes in this space is to:

Focus on the first principles of what you are building. There is a lot of buzz around embeddings and AI, but having a conceptual understanding of these tools will help you navigate the landscape. You don’t need to know how these tools work as long you know what they do, and why you need them.
Keep things simple and beware premature optimization! I’ve seen a few examples that are built for hyper-scale from day one, but it’s usually a better idea to start small and only add layers of complexity as you need them. The entirety of this particular project is around 2500 lines of python code, including boilerplate.
Use ChatGPT liberally. You’d be surprised at how much this tool can you help you, right from writing API descriptions and specs, to full fledged server code, to helping debug issues when they occur. $20/mo for ChatGPT plus is an absolute bargain.

Happy hacking!

Fine-tuning with LoRA: create your own avatars & styles!

2023-04-07T00:00:00+00:00

Remember Magic Avatars in the Lensa app that were all the rage a few months ago? The custom AI generated avatars from just a few photos of your face were a huge hit!

Reference portrait image of me

One of my Lensa “Magic Avatars”

The technology behind this product is the open source Stable Diffusion image generation model. You can run this model on your own computer, on a cloud GPU instance, or even a free colab notebook — to generate your own avatars. In this post I’ll walk you through the process of doing exactly that!

Before diving in, it would help to review the AI Primer: Image Models 101 post if you are new to the world of image models in general. I also wrote a quick guide on getting Stable Diffusion set up on your computer — I’ll assume you’ve already done that.

What is fine-tuning?

I previously argued that one of the main advantages Stable Diffusion has over a proprietary model like Midjourney is the ability to customize it. While Midjourney produces stunning imagery with very little effort, it is going to have a difficult time producing photos of your likeness or in a niche style. This is getting better with their image-to-image features, but for any work requiring a high degree of flexibility and control, it is hard to beat Stable Diffusion’s capabilities.

Examples of AI-generated images with Stable Diffusion, after fine-tuning

This level of customization is unlocked by the concept of fine-tuning. At a high level, fine-tuning is the process of taking a large pre-trained model and training it on your own data to achieve a specific result. This process is becoming increasingly popular in the ML community as the pre-trained models get larger and much more capable. It provides the last mile tuning you need to get dramatically improved performance on your specific problem — with much less effort than it would take to build a whole new model from scratch.

Fine-tuning has been successfully applied in many realms such as ChatGPT and Alpaca for text. In the image generation space, it is typically used to teach models to generate images featuring custom characters, objects, or specific styles — especially those that the large pre-trained model has not encountered before.

Types of fine-tuning

The classic way to fine-tune image models is conceptually simple: provide a large dataset of labeled image and caption pairs; then re-run training using the existing model weights as a prior. This process is very similar to how the pre-trained model learned concepts in the first place. Lambda Labs wrote very good article on how they produced the “text-to-pokemon” model by doing this.

This type of fine-tuning, while substantially cheaper and easier to do than building a new model from scratch, still requires a lot of data (on the order of hundreds of images) and GPU compute time. Since then, there have been many innovative techniques published by researchers on how to make this process even more effective and efficient. I’ll briefly touch on the three most popular ones:

Textual inversion

This paper from researchers at Tel Aviv University and NVIDIA proposed a way to learn a new concept from as little as 3-5 example images, and notably does not require changing the base pre-trained model in any way. I won’t go into the details of how this works here, but paperspace has a good tutorial on this process, and there is a page maintained in the Automatic1111 UI Wiki on how to use and train using textual inversion.

The ultimate output of this process is a very small “embeddings” file that is typically less than 100 kilobytes. This embedding can be used in tandem with the base model to generate the learned concepts in new ways.

Dreambooth

This paper from researchers at Google also only requires 3-5 example images and is able to learn a new character or style. However, Dreambooth does this by directly modifying the pre-trained models and updates the weights. This is a more powerful technique, but the output of the process is a whole new model that is roughly the same size as the pre-trained one. For Stable Diffusion 1.5 that means your newly produced model will be around 4.5 gigabytes!

There is evidence to suggest that commercial products like Lensa’s Magic Avatars uses this technique with great results. Replicate has a good blog post on how to train a model using Dreambooth, and their service makes it easy to make one if you don’t have access to a powerful GPU.

LoRA

This type of fine-tuning is based on the paper “Low-Rank Adaptation of Large Language Models” — which as the name suggests — was originally a technique used to fine-tune large language models like GPT-3.

While the general technique predates both Textual Inversion and Dreambooth, its application to diffusion models for image generation is very new, kicked off by early this year by cloneofsimo. This method produces an output that is between 50 and 200 megabytes in size, and does not require modifying the pre-trained model.

Pros & Cons

Let’s summarize these three techniques.

Type	# of Examples	Output Size	Notes
Textual Inversion	3-5 minimum, ideally 20-30	< 100 KB embeddings file	OK quality, works better for remixing of existing concepts in the base model. You can combine multiple embeddings at runtime to generate multiple concepts, a single embedding usually represents a single concept or style.
Dreambooth	3-5 minimum, ideally 20-30	~4.5 GB full model	Great quality, works well for both characters and styles. Since it produces a new full model, you have to train multiple characters in a single training. Mixing styles can be tricky to accomplish.
LoRA	20-50 for characters, 50-200 for styles	~50-200 MB tensor files	Good quality, somewhere between Textual Inversion and Embeddings. You can apply multiple LoRAs at runtime (just like embeddings) and are very flexible to mix and match.

Note: I won’t discuss another technique called “hypernetworks” here (this is NOT the same as the technique popularized by this 2016 paper), primarily because vetted knowledge of how to make one optimally is hard to come by.

In this tutorial, I will focus on the LoRA fine-tuning technique. In my own experimentation, I’ve found it gave me results that were higher quality than textual inversion, but with an output not as heavyweight as Dreambooth. This let me build a number of characters and styles fairly quickly.

Basics of a LoRA setup

Training a LoRA model itself takes only around 10 minutes, but expect the whole process including setting up and preparing training data to take around 2 hours.

There are two main options you have for LoRA training. The original implementation by cloneofsimo seems to have worked well for many, however, I had two issues with it:

The default parameters did not work well with my own training data.
The output of this implementation is not compatible with the Automatic1111 UI (that I recommended in my post on Installing Stable Diffusion). If you do want to try out this original implementation, replicate once again has an easy step-by-step tutorial on how to do it.

I opted for kohya’s implementation because it produces outputs compatible with the Automatic1111 UI and offers additional options for adjusting the fine-tuning process. I ran this locally on my (brand new!) RTX 4090, but the process can be run on a machine with as little as 6 GB of VRAM. This guide is focused on fine-tuning locally with an NVIDIA card. If you don’t have one, you can use Google Colab to train your models, which has a generous free tier - this notebook uses much of the same techniques I’ll talk about here.

There is very handy UI wrapper on top of kohya’s training scripts called kohya_ss that we will be using. It makes it easier to manage various configurations and has some nice utilities like auto-captioning that will come in handy.

We’ll use pyenv again to manage our Python environment. Let’s make a new one for the kohya_ss GUI, as the requirements here differ slightly from the ones needed for the Automatic1111 UI.

$ pyenv install 3.10.10
$ git clone git@github.com:bmaltais/kohya_ss.git
$ cd kohya_ss
$ pyenv local 3.10.10

Now, let’s install the requirements for the GUI and training scripts. Execute in this specific order, so you have the most optimized version of pytorch (you basically want the one with CUDA 11.8 support):

$ pip install -r requirements.txt
$ pip install torch torchvision --extra-index-url https://download.pytorch.org/whl/cu118
$ pip install -U xformers

Finally, let’s configure accelerate. The defaults work just fine, the only setting that I changed was to use bf16 for mixed precision. All Ampere architecture cards (RTX 3000 or higher) support this format, which results in speedier training runs. Use fp16 if you care about backward compatibility, e.g. you want to run inference on your fine-tuned model on cards that don’t support bf16.

$ accelerate config

In which compute environment are you running?
This machine

Which type of machine are you using?
No distributed training

Do you want to run your training on CPU only (even if a GPU is available)? [yes/NO]:
NO

Do you wish to optimize your script with torch dynamo?[yes/NO]:
NO

Do you want to use DeepSpeed? [yes/NO]:
NO

What GPU(s) (by id) should be used for training on this machine as a comma-seperated list? [all]:
all

Do you wish to use FP16 or BF16 (mixed precision)?
bf16

Let’s test that everything worked by starting the GUI:

# From the kohya_ss directory
$ python kohya_gui.py
Load CSS...
Running on local URL:  http://127.0.0.1:7860

To create a public link, set `share=True` in `launch()`.

Open up the printed URL in your browser. If that worked, we are ready to begin the process of fine-tuning. I’ll walk you through what I did to train a LoRA for my own face.

Training data preparation

Machine learning makes the age-old computer science concept of “Garbage in, garbage out” painfully obvious. Expect to spend a good chunk of time preparing your training data, this is time well-invested towards ensuring a good quality output! I can imagine this workflow getting more and more automated over time, but for now, we’ll have to do it manually.

Image selection

My first stop was Google Photos, where I grabbed the 50 most recent images that included my face. A few guidelines to keep in mind:

Focus on high resolution, high quality images. If a photo is blurry or low resolution, err on the side of not including it in the training set.
You need to be able to crop the photo such that you are the only face in the photo after doing the cropping.
A good variety of close-up photos and some full-body shots are ideal. Avoid photos where you are really far away, there’s not much for the model to learn from these.
Try to shoot for variety in terms of lighting, poses, backgrounds, and facial expressions. The greater the diversity, the more flexible your model will be.

Applying these guidelines, I got a set of 25 usable images. I then proceeded to crop them — the default Windows Photos tool worked great, birme is another useful online tool. Don’t worry about cropping to a specific size or aspect ratio, just crop so your face and/or body are the predominant entity in the image with some background detail but ideally no other people. Make sure all your cropped images are in a single directory and have unique names (excluding the file extension).

Captioning

The next step in the process is to caption each image. In my experiments, there was a marked difference in quality when training with captions compared to without, so I highly recommend doing this.

Fortunately, as I described in my image models primer post, you can use automated techniques to generate image descriptions, which can significantly reduce your workload! One of the reasons we are using the kohya_ss GUI instead of the original training scripts directly is for the captioning feature, so let’s fire that up.

Open up the printed URL in your browser, head over to the Utilities tab and BLIP Captioning. Use the following settings:

Image folder to caption: should point to the directory containing the cropped images from the previous step.
Caption file extension: the convention is to have a .txt file next to each image file containing the description for it.
Prefix to add to BLIP caption: it helps to prefix each description with photo of , — so the model learns to associate the token with your likeness.
Increase “Number of beams” to 10, and set “Min length” to 25, so you get captions with at-least two sentences.

The page should look something like this:

Click “Caption images” and let it run! Shouldn’t take more than a few minutes, you can review the progress on the command line where you launched kohya_gui.py.

Once the process is complete, open up the folder of training images where you should now see a .txt file with the same filename as each image. I would do a quick once over to make sure the captions are sensible, feel free to make edits and save them back to the same text file. Here are some general guidelines for good captions:

Your goal is to associate your likeness with the token you chose. Characteristics that are the same in every photo (in my case, “black hair” and “brown skin”) should not be in the caption.
Explicitly caption things that are different in each photo, e.g. “wearing sunglasses” or “wearing a white shirt”.
Describe the type of photo, e.g. “close up” or “full body”.
If the photo contains other elements such as a background, describe them as well, e.g. “in a kitchen” or “in a park”.
If the photo was taken in the iPhone portrait mode, call it out, and include tags like “blurry background”.

Remember, the more accurate and descriptive your captions are, the easier it will be for the model to be flexible in image generation (e.g., swapping out black hair for purple hair).

Example auto-generated caption: “photo of anantn, a man in an orange shirt and sunglasses sitting on a rock in the middle of the desert”

Hyperparameter selection

I spent a lot of time playing with the knobs you have at your disposal when fine-tuning. I’ll discuss the relevant hyperparameters in a more detail below, but if you’re just interested in the optimal configuration I found, jump ahead to the training section. In fact, I’d recommend you do that first, and then come back here to read about the hyperparameters to further refine your LoRA.

Training steps

The total number of training steps your fine-tuning run will take is dependent on 4 variables:

total_steps = (num_images * repeats * max_train_epochs) / train_batch_size

Your goal is to end up with a step count between 1500 and 2000 for character training. The number you can pick for train_batch_size is dependent on how much VRAM your GPU has, and the higher the number the faster your training goes. However, I wouldn’t pick a number higher than 2 — and for most cases a default of 1 works just fine; since a higher train_batch_size means you need more training images, and the training time is pretty fast as it is.

Generally, you also want more repeats than epochs — since there is the option to checkpoint your fine-tuning every epoch — and you’ll want to make use of that to see how learning is progressing. Epochs in the 5 to 20 range are reasonable, adjust your repeats accordingly.

In my case, recall that I had 25 example images. I went with:

train_batch_size = 1
repeats = 15
max_train_epochs = 5

These values impute a step count of 1875.

Learning rates

The learning rate hyperparameter controls how quickly the model absorbs changes from the training images. Under the hood, there are really two components to learning, the “text encoder” and “UNET”. To oversimplify their roles:

The “text encoder” learning rate (text_encoder_lr) controls how quickly the captions are absorbed.
The “UNET” leaning rate (unet_lr) controls how quickly the visual artifacts are absorbed.

Through repeated experimentation, the community has concluded that it is better to learn these at different rates — “text encoder” should be learned at a slower rate than the “UNET”.

The default values here are pretty sane, set text_encoder_lr to 5e-5 and unet_lr to 1e-4. You can specify them in scientific notation or written out in decimal like 0.00005 and 0.0001 respectively.

Note that if you specify learning rates for the text encoder and UNET separately as suggested above, the global learning_rate parameter is ignored.

Scheduler & Optimizer

The next option you have to tweak is the lr_scheduler_type. There are basically three good options here:

constant: the learning rate is constant throughout the training process.
cosine_with_restarts: the learning rate oscillates between the initial value and 0.
polynomial: the learning rate polynomial decreases from the initial value to 0.

Empirically, polynomial worked best for me, but many in the community swear by cosine_with_restarts. If you find the model isn’t really learning as quickly as you’d like, constant is worth a try. It’s hard to be prescriptive about the right option here as it seems to be very dependent on the shape, size, and quality of your training data. Since each training run only takes on the order of 10 minutes, this is one of the settings that’s worth experimenting with and seeing what works best for you.

A related setting is the optimizer_type. The default is Adam8bit which works wonderfully for most cases. In theory, the Adam optimizer works with higher precision but requires more VRAM and is only supported by the newer GPUs; in practice I didn’t find the improvement to be noticeable.

Network Rank & Alpha

This is likely the most controversial setting. The “network rank” (interchangeably called “network dimensions”, represented as the network_dim parameter) is a proxy for how detailed your fine-tuning model can get. More dimensions mean more layers available to absorb the finer details of your training set. Be warned, too many layers without enough quantity or diversity in your training data could lead to bad results. Network alpha (network_alpha) is a dampening effect that controls how quickly the layers absorb new information.

This is where the controversy arises. ML theory suggests that you’d normally need only 8 dimensions (small number of layers) with an alpha of 1 (heavy dampening) to achieve good results, and these are in fact the defaults in both the cloneofsimo and kohya LoRA implementations. These values result in a further dampening effect on the learnings rates we chose above, on the order of 1 / 8 = 0.125.

In practice, I and others have found these settings to result in extremely weak learning rates resulting in fine-tuned models that don’t produce images resembling the training data at all. You could, in theory, counteract this by increasing the learning rates themselves. What I’ve found works better is instead to crank up the number of dimensions and set an alpha to be equal to this number. This results in NO dampening effect, but it works since our learning rates were already conservative to begin with. I settled on:

network_dim = 128
network_alpha = 128

I’m no ML expert by any means, but can say these values empirically worked much better than the defaults for my training set. If any knowledgeable folks can provide insights on the theory behind these values, that would be very valuable for the community!

Adaptive optimizers

There are two very interesting and new “adaptive” optimizers available. While they didn’t work as well for me, they are worth a discussion. Both DAdaptation and AdaFactor optimizers find the best learning rate automatically through the learning process!

In my experimentation, I found that they settled on a rather low learning rate, producing models that made high quality images which showed no signs of overfitting at all — however — the likeness of my face just never got as close as with the other optimizers. You might have better luck with them, and I could be missing some other key factor. For instance, some have suggested that for AdaFactor to work properly you need a lot more training steps and epochs than usual.

DAdaptation output was pretty underfit

Both these optimizers work best with a high network rank, and an alpha equal to rank. However, it seems that optimal settings for these two optimizers require a lot more training steps than a classic AdamW8bit optimizer would. This makes a 10-minute training run into a 20 or 30-minute operation, and I just couldn’t justify the additional time relative to the quality improvement.

It’s very likely I am missing something though. If you have any insights on getting better results from these optimizers, please share them in the comments! If you want to try either of these optimizers out, keep the following caveats in mind:

DAdaptation

This optimizer was developed by Meta and is intended as a drop-in replacement for Adam, Just pip install dadaptation, and for best results couple it with the following settings:

scheduler must be constant.
learning_rate must be 1.
text_encoder_lr must be 0.5.
You have to provide the following “Optimizer extra arguments”: decouple=True which instructs the optimizer to learn UNET and text encoder at different rates (text at half the rate of UNET since you provided the above values). You can also optionally provide the weight_decay=0.05 argument, but I couldn’t really tell if this made a difference.

AdaFactor

This optimizer has shown very promising results in the language model community. It comes with its own scheduler that you must use:

scheduler must be set to adafactor.
learning_rate must be 1.
text_encoder_lr must be 0.5.
You have to provide the following “Optimizer extra arguments”: relative_step=True scale_parameter=True warmup_init=False. You can set warmup_init=True for smoother learning rate convergence, however, I’ve found you need a lot more training steps and epochs with this setting.

Training

To recap, here are the hyperparameters I finally settled on. I suggest you start your first run with these, and go back to the previous sections for other options if you’re not happy with the results. The values below are for a training set of 25 images, adjust the number of repeats and epochs accordingly if you have more or less images.

Hyperparameter	Recommended value
batch_size	1
repeats	15
max_train_epochs	5
text_encoder_lr	5e-5
unet_lr	1e-4
lr_scheduler_type	polynomial
optimizer_type	AdamW8bit
network_dim	128
network_alpha	128

Recall that you specify the number of repeats implicitly by naming your training images folder of the form _. I recommend sharing the same training data and logs directory across multiple training runs, while creating a new directory for each new configuration of hyperparameters you want to experiment with. Something like this could work, for example:

|- lora
|  |- training
|  |  |- 15_anantn
|  |  |  |- 1.jpg
|  |  |  |- 1.txt
|  |  |  |- ...
|  |  |  |- 25.jpg
|  |  |  |- 25.txt
|  |- outputs
|  |  |- v1
|  |  |  |- config.json
|  |  |- v2
|  |  |  |- config.json
|  |  |- ...
|  |- logs

For the base model, I recommend sticking with Stable Diffusion v1.5 (v2.0+ still isn’t widely adopted by the community yet). Even if you plan on doing inference on models that are derived from v1.5, it is still beneficial to perform your initial on just the vanilla model for maximum flexibility with a variety of downstream models. You should already have a copy of this model if you installed the Automatic1111 Stable Diffusion UI.

Now all we have to do is plug these folder paths and hyperparameters into the UI Make sure to select the (confusingly named) Dreambooth LoRA tab at the very top!

Folder configuration

Hyperparameter configuration

Double check all the values again. Couple of more settings to keep in mind:

Don’t forget to set the “Caption extension” value to .txt!
For max resolution, if most of your images are larger than 768x768, then you can set 768,768 as the value. If not, leave it at the default 512,512.

Click the “Train model” button, and wait for the training to complete. You can follow progress on the terminal where you started kohya_gui.py. For 1875 steps on my RTX 4090, this took less than 10 minutes.

Testing

Now comes the fun part! Let’s test our model to see how we did.

For testing, I used Protogen x5.3, which is a fine-tuned derivation of Stable Diffusion v1.5. You can certainly use the plain v1.5 model, but models like Protogen give you a lot more creative tools and prompting options.

To use a model like Protogen, just download the safetensors file from HuggingFace and place it in the models/Stable-diffusion/ directory in your Automatic1111 installation. At this point, also copy (or symlink) the output of your LoRA fine-tuning run into models/Lora/. The directory structure inside stable-diffusion-webui should now look something like this:

|- models
|  |- Lora
|  |  |- last-000001.safetensors
|  |  |- last-000002.safetensors
|  |  |- last-000003.safetensors
|  |  |- last-000004.safetensors
|  |  |- last.safetensors
|  |- Stable-diffusion
|  |  |- ProtoGen_X5.3.safetensors

Fire up launch.py (see my installation post for the best command-line arguments) and think of your test prompt. I chose something like:

a portrait of anantn, red hair, handsome face, oil painting, artstation, 
Negative prompt: ugly, poorly drawn hands, poorly drawn feet, poorly drawn face, out of frame, extra limbs, disfigured, deformed, watermark, signature, bad art, blurry, blurred
Steps: 35, Sampler: Euler a, CFG scale: 7, Seed: 363423421, Size: 512x512

You’ll obviously want to swap out the token anantn for whatever value you used during training. Click generate! If all went well you should see an oil painting of someone with your likeness with red hair 😊

Testing tips

Here are a few guidelines to keep in mind as you test your newly baked LoRA!

X/Y/Z Plots

This is a very handy feature in Automatic1111 that lets you generate batches of images varying some aspect of your prompt. In the prompt we used above, the token referred to only the last epoch at full strength. Varying this value through an X/Y plot can help you understand how training progressed and if a different epoch at a different strength produced more desirable results.

To generate something like that, just select “X/Y/Z plot” from the drop-down in “Script”. For “X type”, and “Y type” select “Prompt S/R”, which stands for Prompt Search/Replace.

Set “X values” to last, last-000001, last-000002, last-000003, last-000004
Set “Y values” to 1.0, 0.9, 0.8, 0.7, 0.6, 0.5

This will cycle through all epochs you trained as well as give you a sense of how applying the LoRA at different strengths affects the output. I used X/Y/Z plots extensively to compare outputs not just between epochs, but also between different hyperparameter configurations. For example, here is an X/Y plot that compares the polynomial, cosine_with_restarts, and constant scheduler type at various strengths:

Underfit or overfit?

What you look for in a good LoRA can depend a lot on your particular training set and your own artistic sensibility. In my specific case, I was really looking for two main things:

Likeness: the output should look like me! I comb my hair somewhat unconventionally, from right to left, the model picking up on that was a good sign that it picked up my likeness. Note how in the grid above, the combing direction changes at strength 1.0 for polynomial and 0.8 for constant.
Adaptability: an easy test was to see if the model could turn my (naturally black) hair into a bright red color. Hair turning blackish even though my prompt says red hair is a sign of overfitting. There can be other signs too, such as at strength 1.0 in constant, you can see elements in the background that are from my training set but not my prompt.

It might be helpful for you to identify your own criteria based on details of your likeness and face, both on the low end (turning a stock image into your likeness) and on the high end (outputs looking too similar to training data, losing prompting flexibility).

AdaFactor produced underfit: likeness not strong enough, beard still present

Constant with high LR produced overfit: background is smudged

Generalizability

Once you have found a particular epoch and strength that you like on a simple prompt like the one above — you can move onto to testing the flexibility and generalizability of your LoRA.

A good LoRA will give you good results in a variety of conditions; I encourage you to experiment with things like varying hair colors, adding accessories, changing outfits, and trying out both artistic and realistic environments and backgrounds. Here are a few that I tried:

In military uniform, steampunk style

Realistic, jester clothes, swiss town

Working in a ramen shop, as a boy

In an astronaut suit

I’m pretty pleased with the results - this feels like a solid, baked, generalizable LoRA 👍 While I was a happy “Magic Avatars” customer, these results far surpass what I got back from most commercial services offering AI-generated avatars! And I can generate a near-infinite number of these, limited only by my imagination and prompt-writing ability.

Keep in mind the work shown in this post took me several tries over several days to achieve successful results. If you didn’t get great results on your first attempt, go back to the section on hyperparameter selection and see if tweaking these values helps. I’d also encourage keeping a training diary to keep track of your experiments and results.

Objects & Styles

Now that you have a repeatable process for creating new LoRAs, you can try them for different characters, objects, and styles. Applying the same process to a training set consisting of photos of my dog, I got impressive results:

Wearing metallic armor

Ghibli style

My friend Vikrum had an awesome idea to create art resembling the style found on Indian trucks. This art style is quite unique, but very niche and hence not present in any mainstream image generation model. The ubiquitous phrase “Horn OK Please” is painted on the back of nearly every truck in India, surrounded by art in a distinctive style.

Typical art style found on the back of Indian trucks

We had trouble reproducing this style in Midjourney, but that makes it a perfect use-case for LoRAs with Stable Diffusion. Once again, following the process above, we produced a LoRA that can represent any concepts in the style of Indian truck art:

American bald eagle in Indian truck style

Mona Lisa in Indian truck style

This LoRA was trained with only 30 training images; I suspect we can do substantially better with more training data. Anecdotally, styles transfer best at 100+ training images. Here is a reddit post on creating a style LoRA based on the Artist Photoshop effect, which could be another good resource.

LyCORIS (LoCon / LoHa)

There have been some extensions to the core LoRA algorithm, called “LoCon” and “LoHA”, which you might see in the dropdown options in the kohya_ss GUI. These are built on the learning algorithms in this repository.

I gave these a try but was unable to reproduce results that got anywhere close to the quality of the original LoRA — even with the suggested parameters of <32 dimensions and low alpha (1 or lower). It’s worth keeping an eye on these methods as they evolve, but for now I suggest sticking with conventional LoRA.

Summary

LoRA is a powerful and versatile fine-tuning method to create custom characters, objects, or styles. While the training data and captioning process is rather cumbersome today, I imagine large parts of this process will be automated to a great degree in the coming months. It wouldn’t surprise me to see many LoRA-based commercial applications pop up in the near future.

Let me know how this method worked for you, or if you have any questions or comments on the process!

Quick guide to installing Stable Diffusion

2023-04-05T00:00:00+00:00

In my last post “AI primer: Image Models 101”, I recommended Midjourney and Stable Diffusion as my top two choices. If your goal is to make beautiful images and art, Midjourney is the best choice, as they have a very easy to use service and can produce high quality output with minimal effort.

However, if you are a hobbyist and tinkerer, or just want to invest a bit of time to go deeper in this space, Stable Diffusion is a great choice. It’s a bit more involved to set up, but has several advantages:

It is one of the few image generation models that you can run fully offline on your own computer - it’s free and no cloud service required!
You can generate images that you have a hard time generating with models like Midjourney, such as images with an obscure or very specific style.
The model weights are open to inspect, and you can fine-tune these weights to achieve various effects (which we will talk about in a future post).
It has a powerful suite of image to image models — such as for upscaling, inpainting, and outpainting.
There’s a large community of open source developers and hobbyists who work on a wide range of tools and applications around it.

Sold? Let’s get started!

Easy 1-click options

The open source community has made it easy to set up a basic Stable Diffusion setup in just a few clicks. You can use these options if you want something quick and easy to work with.

NMKD UI is the easiest 1-click option for Windows users.
Diffusion Bee is a popular option for Mac users.
Easy Diffusion has installers for Windows and Linux that are fairly easy to use.

The option I would most recommend however, is the Automatic1111 Web UI. This UI has the most functionality and is actively used by the community for the most cutting-edge work in this space. Installing and using it is a little more daunting compared to the other three options above, but if ease of use were your goal, you might be better off with something like Midjourney anyway.

So, if working with the command line and getting a bit in the weeds doesn’t faze you, read on!

Pre-requisites

Large AI models typically require immense compute capability, particularly in the form of GPUs. Even though Stable Diffusion is light enough to run on your own computer, you’ll still get the best results from having a fairly beefy setup:

PC with 8 GB or more RAM, and at least 10 GB of free disk space.
A good GPU with at least 6 GB of VRAM. NVIDIA is preferred though AMD can work too.
If you are on a Mac, then any of the Apple Silicon (M1 or higher) laptops will work, thanks to CoreML optimizations made by the Apple ML research team!

If you are on a Windows PC, I recommend installing WSL2 first. WSL2 is basically an easy way to run a full Linux installation within Windows. While Stable Diffusion is fully supported on Windows natively, I’ve only set it up in WSL2 and there are a lot of nice Unix tools that make my workflows easier.

Installing WSL is fairly straightforward, open a Windows terminal or PowerShell and type:

> wsl --install

This installs the latest version of Ubuntu, which is what I use. Most of the instructions that follow are common to Windows, Linux, and Mac.

Python is the lingua franca of all things AI, so we’ll need an installation. Python is notorious for a very messy package management system that is rife with version conflicts, specifically so for commonly used ML packages that aren’t always compatible with every python version. I don’t recommend using the standard system version for this reason.

A lot of projects suggest using python virtual environments (venv), but I go a different route. venvs are annoying because you have to manually activate them, and sometimes I do want two projects to use the same set of python packages. So I went with pyenv which is a lightweight way to manage multiple python versions. Installing pyenv is easy:

$ curl https://pyenv.run | bash

Make sure to add the required lines to your~/.bashrc or ~/.zshrc file as instructed. For bash this looks something like:

$ echo 'export PYENV_ROOT="$HOME/.pyenv"' >> ~/.bashrc
$ echo 'command -v pyenv >/dev/null || export PATH="$PYENV_ROOT/bin:$PATH"' >> ~/.bashrc
$ echo 'eval "$(pyenv init -)"' >> ~/.bashrc

You can test this worked by opening a new terminal and typing:

$ pyenv versions

Don’t worry about installing a python version for now, we’ll do it while installing the Stable Diffusion UI. Now is a good time to install git if you haven’t already:

$ sudo apt install git

On the Mac, git should already be installed if you have ever installed Xcode. If you didn’t, I recommend you install Homebrew first, and then install git:

$ brew install git

You should now have all the prerequisites!

Installing stable-diffusion-webui

Automatic1111’s UI has everything you need to get going. I recommend installing it by cloning the Github repository as it makes keeping up to date with changes easy:

$ git clone https://github.com/AUTOMATIC1111/stable-diffusion-webui.git

Because we are using pyenv, the installation instructions will differ somewhat compared to the official ones. But feel free to use the official installer scripts if you prefer using the venv setup, they work just as well.

# Make sure you are in the webui folder
$ cd stable-diffusion-webui
# Python version 3.10.6 works best
$ pyenv install 3.10.6
# We set this specific version to activate whenever we are in this directory
$ pyenv local 3.10.6

Now we can install the UI:

# IGNORE if you are on WSL or Linux, only do this on Mac:
$ source webui-macos-env.sh

# launch.py also intalls any dependencies the first time you use it
$ python launch.py

This will take some time the first time you run it, as it will also install all your python dependencies into pyenv, as well as fetch the Stable Diffusion 1.5 model weights. Once it’s done, you should see something like this:

Running on local URL:  http://127.0.0.1:7860

To create a public link, set `share=True` in `launch()`.
Startup time: 8.2s (import torch: 1.0s, import gradio: 0.9s, import ldm: 0.3s, other imports: 1.0s, load scripts: 0.3s, load SD checkpoint: 4.6s, create ui: 0.1s).

Launch that URL in a browser and you should see something like this:

Default stable-diffusion-webui

Generating your first image

The UI can be a bit overwhelming at first, but you only need a few key inputs to get started. On the top left, the UI should have automatically selected the “v1-5-pruned-emaonly.safetensors” checkpoint. This is the default stable diffusion v1.5 image generation model. The real power of Stable Diffusion comes from community-generated variants of this base model, which you can download and select here. However, for now, let’s stick with the default.

The two big text boxes are where you will type your prompt and negative prompt. Stable diffusion is unique in its ability to receive negative prompts to suppress undesirable elements in an image. AI models specifically struggle with drawing accurate human hands, faces, and limbs, which can sometimes lead to deformed images — negative prompts are a great way to help with this.

For now, type this into your prompt text box:

masterpiece, best quality, cartoon illustration of a corgi, happy, running through a beautiful green field, flowers, sunrise in the background

and in the negative prompt:

low quality, worst quality, bad anatomy, bad composition, poor, low effort

Make sure “Sampling method” is left at Euler a. Set “Sampling steps” to 40. Set “Seed” to 1481517414. Click the big “Generate” button!

If all went well you should see an image that’s almost exactly or as close to the one in picture above! That’s because we used the same base model, sampler, sampling steps and most importantly the seed number.

Congratulations on generating your first image with Stable Diffusion!

Tips for NVIDIA GPUs

If you have any NVIDIA GPU, you will get better performance by starting the UI with the xformers argument:

$ python launch.py --xformers

For this to work you have to install the proper NVIDIA video drivers. If you are on WSL, make sure to install the drivers on Windows and not Linux. There is an annoying bug in WSL currently that also requires you to create a few symbolic links for libcuda.so - the workaround is described here, essentially you run this in a Windows terminal or PowerShell:

> C:
> cd \Windows\System32\lxss\lib
> del libcuda.so
> del libcuda.so.1
> mklink libcuda.so libcuda.so.1.1
> mklink libcuda.so.1 libcuda.so.1.1

You’ll have to do this every time you update your NVIDIA drivers until Microsoft fixes the bug in WSL.

Confusingly, if you have an NVIDIA RTX 40 series card, my observation is that you will get the best performance by not using xformers - but instead updating to the latest version of torch with CUDA 11.8 support:

$ pip install torch==2.0.0 torchvision --extra-index-url https://download.pytorch.org/whl/cu118

Then launch stable-diffusion-webui with the following arguments:

$ python launch.py --opt-sdp-no-mem-attention --opt-channelslast

You can substitute --opt-sdp-no-mem-attention with just --opt-sdp-attention for even faster performance at the cost of some non-determinism (you may not be able to recreate the exact image even with the same seed).

To squeeze even more performance out of Stable Diffusion, if you are on WSL, consider disabling “Hardware-accelerated GPU scheduling” from your Windows Settings. This may impact performance of games running natively on Windows though, YMMV.

Things to try

Now that you have a basic set up, you can start to explore all the knobs presented to you in the UI. Here are some things to try:

The dice icon next to “Seed” sets the value to -1 which will use a random number every time you click generate. Try this a few times to see how you can generate an almost infinite variety of images from the same prompt.
The “CFG” Scale slide above it sets how directive you want your prompt to be. Low values will give the model freedom to be much more creative in its interpretation of your prompt, while high values will make it more likely to generate images that stick very closely with your prompt.
- At first glance, it might seem like a good idea to set this to higher values, but I’ve realized I’m not very good at writing detailed descriptions of images. Lower values can be quite helpful to fill in details you didn’t think of, but are natural in the output 😉
Setting “Batch size” to 4 will generate a set of four images for every prompt, so you can generate more variations in parallel (pair with Seed of -1).
The “Sampling methods” are fun to play with. Almost all of them will converge on the same image given enough steps. Euler converges to a good image in a reasonably few number of steps (30-ish), others can generate higher quality images but need more steps to fully realize.
- The exception to convergence are the methods suffixed with a, such as Euler a or DPM 2 a. These samplers will converge on a different image that are a bit more random than the ones without. It’s great for creativity and variety, but if you want determinism then models like Euler or DDIM are reliable and fast. Check out this reddit post on a comparison between samplers!
Of course, the prompt and negative prompts are the most important input into this process. You can get help from ChatGPT to help you create prompts or get inspiration from other creators (googling for “stable diffusion prompts” can get you started).

That covers most of the basic functionality of text-to-image generation with Stable Diffusion. There’s a lot of power to uncover through the various tabs at the top and custom scripts or extensions, but we’ll leave that for another time.

Have fun and let your creativity loose!

AI Primer: Image Models 101

2023-04-01T00:00:00+00:00

It’s spring of 2023, and if you aren’t paying attention to what’s happening in the world of AI, you really should. The world is about to change in big ways!

I’ll save the general discussion on AI for another post — I’m hoping for this article to be one of a multipart series where we dive into the practical aspects of each major area of advancement. My goal is to give readers a taste of what’s going on, what the capabilities of these AI models are, and why you should care.

Let’s start with image generation models, as they’re the oldest in the recent wave of innovation. Skip ahead to the model comparison and summary section if you are already familiar with text-to-image models!

Text-to-image

The most basic capability of image generation models is to convert a plain language description into a picture. Let’s start there:

“the indian warrior arjuna, riding a golden chariot, pulled by four white horses, in a battlefield, highly detailed, digital art”

This type of text is called a “prompt” in the world of text-to-image models. There are a number of these models commercially available today, and here are a few results from the ones I tried:

RunwayML

Dall-E 2

Firefly

Bing (Dall-E 2 Exp)

Stable Diffusion

Midjourney

Well, isn’t that neat? It’s remarkable that these images have never been created until now. The AI made them on-the-fly based on my prompt, and drawing from its vast knowledge of all images it has “seen”.

A discerning viewer might look at this and not be impressed - you can easily spot a number of defects: the horses aren’t quite lined up right, one of them painted only one horse instead of four, the human faces are pretty blurry and sometimes creepy looking, and so on.

However, I attempted to list the images in order of how impressive they are (yes, I know, art is subjective, bear with me). They come from:

RunwayML, which seems to be running a very early version of the Stable Diffusion open source generation models (also listed below).
Dall-E 2, which is a commercial model from OpenAI that kick-started the modern age of image generation models.
Firefly, the most recent entrant from Adobe that focuses on commercial use and sourcing training material from licensed images.
Stable Diffusion, an open source model that kicked off a whole revolution in image generation we will cover in the next post.
Bing Create, a partnership between OpenAI and Microsoft on the next version of “Dall-E 2”, lovingly called “Dall-E 2 Exp”.
Midjourney v5, the latest commercial model from their research lab, perhaps the most impressive of the bunch with a very active and fast-growing community.

Even if you aren’t impressed by any of these individual images, I instead direct your attention how quickly the rate of improvement has been. All the services above have a free tier, feel free to try them out with your own prompts!

One model not listed above but worth keeping an eye on is ImageN from Google. You can see a few examples of what it’s capable of on their website. It remains to be seen if Google will eventually release a version of this publicly.

Prompt Engineering

Another reason this capability is impressive, is that the images shown above came from a prompt I didn’t spend very much time thinking about. It was just the first sentence that popped in my head. People have been able to achieve vastly superior quality by really crafting their prompt to better direct the AI.

Here’s a couple of photo-realistic AI-generated images that have been doing the rounds lately:

Do I have your attention yet? These are FAKE AI-generated images, but it takes quite a bit of scrutiny to find out. In the Pope’s image, note how his hand that is holding what seems to be a Starbucks cup appears mangled. In Trump’s case, the photo is missing a finger.

Both of these images were made by Midjourney v5, and I suspect by the time v6 rolls out, it will be increasingly hard to tell which images are fake and which are real. We’ll have to rely on the sources they come from and other ways of confirming visual evidence instead.

Here are a couple examples of beautiful artistic images that further showcase how high quality image generation can get:

“character concept, cute baby dragon in the woods, dungeon and dragons, fantasy, medieval”, from Midjourney

“a renaissance painting of an elephant in a tuxedo”, from Bing Create

So, what’s the takeaway? With sufficient effort put into your prompt, you can create very high quality photo-realistic as well as artistic images!

The process of writing great prompts is often referred to as “prompt engineering”. While over time we expect image generation models to produce fantastic outputs just from simple descriptions, their usage today predominantly relies on creative use of prompts to obtain the desired results.

It’s also note-worthy that prompts that work well in one model don’t necessarily translate to another. I recommend sticking with a single tool and refining your prompts there, learning through iteration. Midjourney’s interface is particularly good for this - the main way to make images with them is by joining their Discord. There you can watch other users go through the process of creating images, which is sure to inspire you. And you will learn a few tricks to make great prompts along the way!

Image-to-image

Turning text into pictures is the most common use of image generation models. However, they can also be applied to existing images. This is a very useful capability to further refine images you made through prompting, or to enhance photos or art that already exist.

There are four main ways you can use this capability.

Upscaling

This is the simplest application of image-to-image models. You can take an image that is of low resolution (say 512x512 pixels) and upscale it to a higher resolution (say 1024x1024 pixels). This isn’t just resizing the image to make it bigger, that would just result in a blurry and ugly mess.

What the model does is repaint the image at the higher resolution, adding detail and clarity along the way to make it look good at a bigger size. It can only do this by accurately predicting what detail should be added - given the context of the image. Because modern image generation models have “seen” so many images, they are able to do this very well.

I made a favicon for this site a long time ago and still only have the low resolution 64x64 version of it. It looks pretty blurry when scaled up to 256x256, let’s see how the upscaler does:

Nice! Very crisp, and it picked up on the grunge font details.

Style transfer

This process can take the style from one image and apply it to another. The classic example of this process is the “Mona Lisa” in the style of “Starry Night”:

Style transfer technology (known as neural style transfer) has actually existed for a while, and predates modern text-to-image models. These are basically “filters” that Instagram, Snapchat, and TikTok have made so popular. However, advancements in text-to-image models have made image-to-image style transfers much more versatile and flexible.

Previously available techniques required a model to be pre-trained for a specific style. Present-day style transfer can be done much more fluidly by first generating an image (e.g. by using prompting as we discussed above) in a particular style, then mixing it with an image of a subject. This is a much more flexible approach, and opens the door to mixing and matching any style with any subject. Remixing is now limited only by your imagination!

Mona Lisa in the style of a New Yorker front page caricature, illustration

Inpainting

Inpainting is the process of replacing a portion of an image with something else. This process works similarly to the text-to-image process we talked about earlier. The only additional complexity is that the generated image is aware of the image content surrounding it and tries to blend in well.

Like style transfer, inpainting as a concept and technology has existed for a while and also predates modern text-to-image models by several years. However, because of these new image generation capabilities, you are able to go far beyond the classical use-cases of inpainting, in a much more flexible and generalizable way.

Fixing defects

The simplest application of inpainting is used to restore damaged parts of images. The canonical example here is to remove scratches from scanned analog pictures. This capability has been available from even early versions of OpenCV:

Damaged Version

Restored Version

Removal

Modern inpainting can do a lot more than that. Google recently launched “Magic Eraser” in Google Photos, which can remove arbitrary subjects or objects from images, which is a potent example of this capability.

Pixel 6 Magic Eraser tool at work pic.twitter.com/dyeUa76HYe

— Shevon Salmon (@its_shevi) October 25, 2021

Check out this page of real world examples of magic eraser at work - you can remove not just people, but almost every object in the frame.

Replacement

An even more powerful application is replacing subjects or objects with something else. Note how text-to-image models we used above can sometimes struggle with generating detailed human faces that look natural. Well, inpainting just the face and iterating through a few more generations with models optimized to generating good human faces can help fix that!

Let’s try this technique on the image we generated earlier of Arjuna on a horse using Stable Diffusion:

Mark portion of image to replace

Newly generated face blended in!

Now we’re really getting somewhere! We can use the same inpainting techniques to fix other issues with the image, such as the hand holding the reins.

Outpainting

OpenAI was first to apply the inpainting technique in a creative way — to extend the boundaries of an existing image. Outpainting can be thought of as inpainting, but with a blank canvas extending beyond the original boundaries. This lets you dream up how an image might be extended in a way that matches the original style and blends in.

The artist August Kamp collaborated with OpenAI and used Dall-E to outpaint the classic Dutch painting “Girl with a Pearl Earring” by Johannes Vermeer:

Original Painting by Johannes Vermeer

Imaginative Expansion by August Kamp

You can watch the full process of outpainting in action in the original blog post from OpenAI.

Image-to-text

One interesting possibility is that of applying these image models — in reverse — to generate descriptive text of any picture.

This technique is not to be confused with Optical Character Recognition (OCR for short), which is a method to extract any text found in pictures into a digital form — for example scanning the phone number from a photo of a receipt.

Instead, I’m talking about using these models to describe an image in natural language, essentially captioning them. As an example, let’s take the following image and feed it to “clip-interrogator”, a popular python library that combines the two most-used image captioning models (CLIP from OpenAI and BLIP from Salesforce).

Still from a Gorillaz music video

from PIL import Image
from clip_interrogator import Config, Interrogator
image = Image.open('gorillaz.jpg').convert('RGB')
ci = Interrogator(Config(clip_model_name="ViT-L-14/openai"))
print(ci.interrogate(image))

Running this program prints:

cartoonish illustration of a man in front of a table with a tablecloth,
2d gorillaz,
winning awards,
handsome hip hop young black man,
excited facial expression,
goro and kunkle,
victorious on a hill,
dvd cover,
table in front with a cup

Cool! You can imagine feeding text similar to this as a prompt to various text-to-image models to generate creative variations. Besides the obvious use-cases around accessibility, there are many interesting applications of this technology that revolve around prompt engineering and creating your own custom image models that we will discuss in future posts.

Text-to-video

Ok, we’ve already covered a lot! But, stay with me, we have one more potential application to cover.

Generating full-fledged videos from a plain language description is a rapidly emerging field. While there are no readily available tools to do this today like there are for image generation, three key players to watch are:

Meta AI kick-started innovation in this area and published impressive demos of their “make a video” research late in 2022.
This was shortly followed by Google’s ImageN video generation demos about a month later, which are equally impressive.
RunwayML seems closest to providing a usable commercial tool. Their “Gen1” product allows video editing of existing videos through text prompts, while “Gen2” promises full video generation capabilities like the Meta and Google demos showed above.

Some community members have hacked together a poor man’s version of text-to-video, by stitching together multiple generated images using a tool called ControlNet to more finely control the output. This is an advanced technique that is out of scope for this post, but we may discuss it in the future.

Ethical considerations

“With great power comes great responsibility.”
-Uncle Ben

We just walked through a lot of awesome capabilities and there is a lot to be optimistic about. Just imagine the reaction of a renaissance-era painter upon hearing that one day there will be a machine that can produce any imaginable art from just a plain language description of it!

But, as with any major technological leap, there are a myriad number of ways to misuse these tools. We’ve seen this happen with PCs, the internet, mobile phones, and social media — the potential for abuse with AI is going to be much larger because these capabilities are so much more powerful. I believe the potential for abuse with image generation models is particularly high because pictures and video tend to be much more evocative and believable.

Impact on artists. These models gained their capabilities by learning from the vast amount of images generated and cataloged by artists and photographers who published their work on the internet. Some model creators were careful to only source licensed content (e.g. Adobe Firefly) and give artists tools to exclude their work from training sets. Others (e.g. Stable Diffusion) were a bit more fast and loose in their image acquisition strategy.
- The debate is nuanced, as human artists also learn from other works they have observed throughout their lifetime. There is an ongoing philosophical question on what is truly original work, and what is merely remixing past observations. Society must contend with how we can sustainably compensate artists, particularly for work that eventually leads to commercial outcomes for people and teams using models to generate art based on their work.
- There is the question of how disruptive this technology will be to the livelihood of artists and illustrators. These tools certainly make them all much more productive, but the fact remains that as productivity rises, industries will need fewer people to produce the same amount of work. My suggestion for both aspiring and established creatives is to start mastering these tools.
- The very top artists of the world will be in greater demand to create truly unique and creative work that the world has not seen before. This is already happening in the modern art world. Scarcity will generate value, but this value is likely to be captured by a handful of the world’s best.
Impact on society. This technology will very likely be used to generate fake news and misinformation, as we showed with the images of Trump and the Pope above. We’ve seen this happen with “deepfakes” — fabricated images or videos portraying individuals saying things they never said, or in situations that never occurred in reality. Deepfakes were challenging and laborious to make by hand, until now. This will become a trivial process going forward, something that can be done in large quantities even by non-technical people. The “evidence of your eyes” will require much more scrutiny in this new era.
- Abuse such as blackmail or revenge porn based on fabricated images are highly likely to become more commonplace. Governments should move quickly to enact and enforce stringent laws to protect individuals from this type of misuse.
- These models have shown to reflect bias that exists in the world and their training data. There is potential for unintentionally reinforcing harmful stereotypes. Indeed, in my own experience, I’ve noticed these models tend to generate humans with fair skin, light eyes, and blonde hair by default. Generating a diverse set of images of various body types, skin tones, and hair colors requires a lot more intentional effort.
- On a larger scale, this technology will be weaponized to spread propaganda and influence politics worldwide. In the long run, society as a whole will need to adapt as a response to the inundation of fake content, as we grow to rely more on authoritative and trustworthy sources for information.

I urge you to keep these ethical considerations in mind as you wield these extremely powerful tools to create content, but also in consuming content that will increasingly be AI-generated.

Make sure to read and adhere to the licenses that you agree to when using these models, as they also reinforce the ethical considerations above. Be mindful of the impact that your creations will have on society and other individuals.

Model comparison

Ok, now that you’ve seen what the breadth of capabilities are, let’s summarize the most talked about tools available today!

Model	Features	Price	Notes
RunwayML	✅ Inpainting ✅ Outpainting ✅ Customization See full list of their AI magic tools.	Free trial for the first 25 images. $12/month for 125 images/month thereafter.	Runway's claim to fame is their easy to use video editing tools. Their AI magic tools cover a wide range of capabilities, some of which are newer and less mature.
Dall-E 2	✅ Inpainting ✅ Outpainting ❌ Customization	Free for 15 images per month. $0.016-$0.020 per image thereafter.	Dall-E 2 is useful for generating abstract or artistic images. It is less competitive at photorealism.
Adobe Firefly	❌ Inpainting ❌ Outpainting ❌ Customization	Free during the beta.	Adobe's focus is on generating content that is safe to use commercially. However, during the beta, no commercial use is allowed, and additional features such as inpainting and customization are still in the works.
Dall-E 2 Exp (Bing Create)	❌ Inpainting ❌ Outpainting ❌ Customization	First 25 images are created fast. Subsequent images are still free, but slower to generate.	Bing hosts an improved version of Dall-E 2 on their website. It produces higher quality images than stock Dall-E 2 but cannot be customized, and does not support inpainting.
Stable Diffusion	✅ Inpainting ✅ Outpainting ✅ Customization	Free to run on your own computer. Cloud versions offered by multiple providers cost in the range of $0.02-$0.06 per image.	Stable diffusion is an open image generation model that can be run locally on your computer with no restrictions and infinite usage. Stability AI also offers a paid hosted version that runs in the cloud and is more user-friendly for non-technical audiences.
Midjourney	❌ Inpainting ❌ Outpainting ⚠️ Customization* * You can customize to a limited extent by uploading your own image as part of the prompt to do style transfers or redraws. Midjourney also launched `/describe` which can caption an image (image-to-text).	Free trial discontinued due to abuse. $10/month for ~200 images (varies by quality and operation).	Midjourney has the most unique and curated art style of all these models. They also boast a very active community and unique user interface through use of Discord.

My recommendations:

If you are just getting started in the world of AI image generation, start with the Bing Create tool for your first few images. It is free and easy to use, although it won’t produce the highest quality images and has no customization options.
If you want to increase the quality of your images with minimal effort, I recommend signing up for Midjourney. They have a very active community, and their models produce the best-looking images with minimal tweaking to prompts!
If you are fully invested in this space and don’t mind committing time to installing software and tweaking your prompts, I highly recommend Stable Diffusion. You can get the highest quality outputs with sufficient prompt engineering, and the customization options are unparalleled.

I’ve been personally spending the most time with Stable Diffusion (and a little on Midjourney) — my next post will cover some ways in which you can fine-tune your own models with Stable diffusion to include your own styles and subjects!

I hope you enjoyed this introduction, and that you’ll unleash your creativity with these newfound superpowers. Please do so responsibly ❤️

Migrating a G Suite Legacy Account

2022-03-20T00:00:00+00:00

Google recently announced that all “G Suite legacy free edition” (formerly known as “Google Apps”, currently known as “Google Workspace”) accounts will need to transition to their paid workspace plans starting July 1, 2022. Legacy users will get access to a discounted rate of $3/user/month, which will turn into $6/user/month starting July 2023 at the lowest tier.

I’m the sole user on my G Suite account, so the new rates aren’t a big issue per se. I’ve been getting a ton of value from this service for over a decade — namely the ability to use Gmail and other Google services but with my own custom domain.

The bigger issue is that Google Workspace accounts have long been denied access to several products. You can’t use the new GPay, you cannot sign up for Google One, and you cannot sign up for a Pixel Pass. For some time you couldn’t sign up for Google Fi either, though that changed a few years ago.

So with this news, I decided to bite the bullet and transition my account to a plain old consumer Google account. Google already has a mechanism to transition educational accounts into personal ones, and it appears they might be working on a solution for all accounts “soon”. However, I didn’t want to wait for that solution, and I don’t have a ton of payments or purchases on my GSuite account. It’s mostly the data that needed to be moved, and I figured this would be a good exercise to see how much of my life exactly depends on my Google account.

I came across this excellent article detailing the steps. I had to tweak some of the instructions to fit my use-cases, and in some cases found simpler ways to migrate, which I’ll write about here.

Setup

Before you begin, it is important to note what you cannot transfer from a G Suite account through a manual migration:

Any purchases or subscriptions made through Google Play (Apps, music, games).
Google pay subscriptions and payment methods.
Accounts on external websites that you signed into via Google.

That last point varies from provider to provider, some will let you relink by verifying your email address, but others won’t work at all. You can review all third-party apps you logged into via Google here.

I’m probably missing other things — but if any of these are important to you then this migration is not suitable. You’re better off upgrading to the paid plan, or waiting for Google to roll out their migration tool.

I’ll refer to the old G Suite account as a “workspace” account here on out. The new plain old consumer Google account will be called the “personal” account.

Start by backing up everything in your workspace account. You can do this via Google Takeout. It may take a day or two for them to generate the data depending on how much you have. Recommend storing these zip files somewhere safe.
Review the services and data stored on your workspace account via this dashboard. It gives you a nice overview of all your data and services used, which will come in handy as you decide how to migrate each one.
Create a new browser profile, and sign up for a “plain” Google account, one with a @gmail.com suffix. Having both accounts logged in on two Chrome profiles is pretty handy as you follow the steps below.

Now, we’ll start migrating data for each service one-by-one.

Gmail

This is arguably the most important service — and was the primary reason I signed for a workspace account all those years ago.

We’ll do this process in three phases:

Redirect incoming mail

If you use Google domains to manage your custom domain, this part is relatively easy. You can use the Google domains email forwarding feature to redirect all mail from your workspace email address into your personal one. If you don’t use Google domains built-in DNS service, forwarding won’t work — you will have to update MX records at your DNS host. My DNS is run by Cloudflare, and updating my MX records to the ones listed on that support page worked well.

# Name  # Type  # Priority  # Value
@	MX	5	    gmr-smtp-in.l.google.com
@	MX	10	    alt1.gmr-smtp-in.l.google.com
@	MX	20	    alt2.gmr-smtp-in.l.google.com
@	MX	30	    alt3.gmr-smtp-in.l.google.com
@	MX	40	    alt4.gmr-smtp-in.l.google.com

Send a test email from somewhere else to make sure you are able to receive email for your custom domain in the inbox of your personal account before proceeding.

Configure outgoing mail

Now let’s make it so you can send email for your custom domain but from your personal account.

On the gmail screen in your personal email, go to “Settings”, then “Accounts and Import”.
Click on “Add another email address” under “Send mail as”.
Put in the email of your workspace account, with “Treat as an alias” checked.
Enter smtp.gmail.com as the SMTP server, set port to 465, and put in the credentials to your personal account in username and password.
This won’t work the first time, and you’ll be directed to enable “Less-secure” mode for your personal Google account, which you’ll have to enable.
Retry with your credentials again and it should work this time. You can now set this email as the default through the “make default” link.

Try sending an email and ensure everything is working properly.

Migrate all your old mail

There are a few ways to move all your email: using POP/IMAP is a popular option but has a drawback that you can’t migrate labels if you made heavy use of them.

I decided to use a custom tool — “Got Your Back” — which uses the Gmail API and preserves labels.

# This is generally a pretty terrible way to install scripts, be warned
bash <(curl -s -S -L https://git.io/gyb-install)

Start by downloading all your email from the workspace account:

# gyb installs into ~/bin by default
~/bin/gyb --email 

This process will first ask for authorization to manage your Google cloud account (instructions provided when you run the command). Once you’ve created the app, make sure to go into “APIs & Services” > “Oauth consent screen” and click “Make external”. You can make it external for just one “Test user”, enter the email for your personal account here. This will be necessary later when you import your emails into the personal account.

After the cloud app is created and you’ve pasted in the requisite client ID & secret, GYB will request authorization to read your email. On this screen, it is sufficient to grant just “readonly access”. GYB will then begin downloading your email — this process took around 2 hours for me, YMMV.

Once this is done, you can then upload the email to your new account:

~/bin/gyb --email  --action restore \
    --local-folder GYB-GMail-Backup-

This process takes much longer (took around 14 hours for me). You can tend to moving your other services while this is happening, but you should see all your old email start appearing in your personal inbox.

You can delete GYB from your cloud console when the process is complete.

Filters

If you have a lot of filters, you can export them from the “Filters and Blocked Addresses” settings page. Select all the filters — or just ones you want to move — and click export. You can then import this file into your personal account. I recommend doing this after GYB has finished uploading your new email and labels, as the filters will rely on them.

Calendar

Migrating this involves a simple export & import of your calendar events through a single file.

From the workspace calendar, go to Settings (gear in the top-right), then “Import & Export”.
Click on “Export”, which will download a zip file.
Unzip the file and upload it in the “Import” section of settings on your personal account.

Note that this will only copy the events, but not any linked users. You’ll also have to re-share any calendars imported if you had shared them previously.

Photos

Your takeout zip file should contain all your photos which you can upload again. However, I found an easier way to accomplish this, by using the “partner sharing” feature. On your workspace account, initiate a partner share from settings to your personal account and accept it on the other end. On the personal account, select the option to save all photos from your partner account into your library.

Once I did this, face recognition on my family and pets didn’t work out of the box. I had to disable and re-enable the feature from my personal account to get this to work. After about a day, the timeline and photos all appeared exactly as they did in the workspace account. You may disable partner sharing once all photos have been saved to the library on your personal account.

Keep in mind that any albums shared with your workspace account must also be shared to your personal account — manually and one at a time. This can be a time consuming process depending on how many albums were shared with you, but I found no way to automate this.

Drive

I didn’t have much stuff in Google drive, so I ended up just uploading everything again manually. For any Google sheets, docs, or slides — just share them directly with your personal account. Google Drive does offer a desktop app to make this process a bit easier.

YouTube

I was rather lucky in this regard. Back when Google asked all YouTube accounts to be migrated to a Google Plus account, the backlash was so immense that they quickly offered an option to keep your YouTube account separate (but linked) to your Google account. I remember taking advantage of this option, which has since come to be known as a “brand” account.

If you’re in a similar situation as me — transferring your YouTube uploads, watch history, and playlists becomes somewhat simple. You need to add your personal account as an admin to your channel — doing this is not obvious. Go to studio.youtube.com, click “Settings” in the left pane, then “Permissions” > “Manage Permissions”. This takes you to a page where you can invite your personal account as an “Owner” to your channel.

After 7 days, you will be able to switch the personal account from just “Owner” to “Primary Owner”. At this point, you can remove the workspace account, and still retain your YouTube account.

Google Fi

Moving a Fi mobile subscription to a personal Google account is thankfully a documented and supported process. Painful, but doable.

Contact Fi support and they’ll walk you through the steps. It generally involves verifying both your workspace and personal accounts. You can only do this if you have fully paid off your phone. If you have additional friends or family on your account, you’ll have to remove them from your plan (temporarily). This was the most painful part of the process as additional Fi subscribers on my account basically lost cell service for around 2 hours.

Once it is moved over, you keep your original phone number, billing, and service. At this point you can re-add your friends & family, everything should be back to normal.

Analytics

If you use Google analytics, you can add your personal account as an admin for any properties you created. By giving this account full admin privileges, you retain basically the same functionality as before.

Cloud

In case you use Google Cloud or Firebase, these services are also tied to your workspace account. Similar to Analytics, adding your personal account as admin on projects you wanted to keep was a simple way of retaining access.

Alerts, Groups, Keep

I found no way to migrate these services to the new account. I manually recreated alerts from my personal account, and re-subbed to the groups I was interested in. For Keep, the notes you had are available in the takeout file as plain text.

Android Phone

The final step was to switch my Android phone to my personal account. You can login with multiple Google accounts on Android which was helpful. I first moved my WhatsApp backup, by uploading it to my personal account (you can also use local backup).

After about a week of using my phone this way to make sure all data was moved over, I did a factory reset and logged back in with only my personal account. It’s been a couple of weeks using it that way and I haven’t had to go back to my workspace account for anything!

Closing Thoughts

Switching from workspace to a personal account was a time consuming yet insightful process. As more and more of your personal data moves to the cloud, you end up being beholden to a single company for your whole digital life. This can be scary, and using a custom domain is one of the important ways in which I’m able to retain some control over my digital identity. Going through this process made me build some confidence in my ability to move things over to a new provider should the need arise in the future.

While being able to download your data through services like Takeout is a helpful start, we are still a long way from true data portability. As the process above has outlined, it’s not just about access to your raw data but metadata that may be provider specific — such as comments on photos, filters & labels on your email, and your video uploads and watch history. I dream of a future where you are able to seamlessly store, control, and and move not just your raw data but also all digital interactions that you have had or others have had with content you create.

Google has been working on the Data Transfer Project, which Apple recently joined and includes contributions from Facebook, Microsoft, and Twitter. The project has similar goals but currently only works for moving Photos between a select few services. We shall see if this initiative will expand to more types of data in the future!

Project Assemble Redux

2020-05-24T00:00:00+00:00

Last time I posted about building PCs was in 2011. That PC lasted me quite a while - 6 years - at which point it got an upgrade that I didn’t write about (Intel 6700k on an Asus Z170, with a GTX 970). That build certainly held its own and even ran Half-Life: Alyx on a Rift just fine. But, the graphics card in particular is starting to show its age, and hey, with everyone stuck at home I figured it was time for another upgrade.

I haven’t really stopped using Macbook Pros for work - so my PC mostly gets used for gaming (and occassionally reusing my mouse/keyboard/monitor with a docked MBP, via Synergy for working on side projects). Gaming is usually synonymous with an Intel-based build (and all my builds so far have used their chips), however, I was pleasantly surprised to discover that AMD has been giving Intel a run for their money of late. Hooray for competition!

The Ryzen 3rd gen CPUs are without a doubt the best bang for buck in the consumer CPU market. Even with Intel’s 10th generation chips launching, which bring market-leading raw clock speed performance, they still can’t match AMD’s price point to core count ratio. While Intel still dominates the highest end gaming segment, doing almost anything else on your computer (like streaming, multi tasking in your browser, or writing code) means that AMD pulls ahead of Intel quite handily at the cost of slightly lower gaming performance.

On the graphics card end, NVIDIA is still king of the hill. Wish there were more competition here, but the Turing RTX 20 series are the best consumer cards in market, and with the new Ampere architecture expected to launch with the RTX 30 series cards later this year, there are no signs of them slowing down.

After a couple days of research, I settled on the final build spec.

CPU

Ryzen 3700x. This is an 8-core/16-thread processor at a base clock of 3.6GHz but almost always runs at 4GHz by default, and can boost up to 4.4GHz occasionally. I probably could have gotten away with a 3600x just fine, but this buys me a smidge of future proofing given the graphics card I was going to pair with.

Motherboard

Gigabyte x570 Aorus Elite. Again, probably could have gotten away with the older 450-series motherboard, but the 570 comes with PCIe Gen 4 and USB 3.2 support. Not that much more expensive either.

RAM

G.Skill 2x16GB @ 3600MHz. AMD builds have a reputation for being somewhat finicky with RAM setup. The Ryzen 7 is basically built to take advantage of 3600Mhz DDR4 and this kit is widely used with no known compatibility problems. 3200Mhz will also work just fine if you want to save a little.

GPU

Asus ROG Strix RTX 2080 Super. The RTX 20 “Super” series is great value for money and beats every non “super” card (except for the very high-end 2080Ti). If you’re gaming at 1080p, a 2070 super is of great value, but I was going for a 1440p monitor (more on that below) and felt the 2080 would last me longer at that resolution.

Power Supply

EVGA Supernova 650 G5. Something is off with the PSU supply chain, with most units sold out at all major retailers. Not sure if this is COVID-19 related or otherwise, but buying a PSU right now is mostly just a function of what you can get. My first choice was a Corsair, but this EVGA had “OK” reviews and was available. A few days after I ordered, a bunch of Seasonic units were back in stock which might have been my second choice. One interesting thing about power draw over the years is that they have reduced: SLI is no longer in vogue and every component has just gotten more efficient. The units are also “modular” now, which means you only use the cables you actually need (this is a bigger deal than average for me, my previous full size “high air flow” Cooler Master tower with a non-modular power supply had a really bad dust problem).

Hard Drive

Aorus NVMe Gen4 M.2 2TB. Ok, I’ll admit this drive is somewhat of an overkill, but if I’m getting a Gen 4 compatible CPU and Motherboard, why not? These are some blazing fast read/write speeds, that you’ll likely only notice with some very disk heavy workload (like compiling Firefox?). This is my first M.2 drive, and I quite like that it fits right on your motherboard. Very happily kissed my old set of SSDs and spinning drives goodbye!

Tower

Fractal Meshify C. Given nobody uses optical drives anymore, it just makes sense to go for a compact mid-tower case unless you’re planning to go crazy with expansion and storage. I liked the look and feel of the Meshify series, and Fractal are known for their great cable management and generally high quality cases. I’m not crazy about RGB in my build, so the dark tinted tempered glass it comes with works pretty well.

Monitor

Alienware AW3418DW. This monitor is what really started the whole upgrade idea in my head, as I happened to find a really good deal at $750 (it usually retails for around $999). Figured it was time to embrace the ultrawide 1440p experience!

Peripherals

Kinesis Gaming Freestyle keyboard, and Kinesis Gaming Vektor mouse. I use a Kinesis Advantage for work, and really love their gaming products too. High quality, reliable hardware, what more to say?

Assembly

Actual assembly was easier than ever, with everything living on the board itself, cable management was trivial and the whole thing was up and running in just a couple of hours. I made one small change, which was to swap out the Wrath Prism cooler included with the CPU for a Hyper 212 (Black Edition). The prism cooler worked fine from a thermal perspective, it was just too noisy for my taste.

Final touches were on the actual desk setup. I had to figure out how to make use of my two old monitors alongside the new ultrawide. Decided to change out my desk to one that would fit all the monitors side-by-side with a couple of monitor arms. This is how it all came together:

The PC’s been running smoothly and performed as expected on the benchmarks, though it runs a little hotter than I’m used to. Over a sustained gaming session, both the CPU and GPU stay just a little below 70°C which feels nominal, so I’m not too worried.

It felt like the whole PC building process has gotten much smoother over time, and the average consumer is lot more informed (YouTube has thousands of videos on the topic these days). Definitely a great time to get into it as a hobby if that’s something you’ve always wanted to try!

Introducing ThinMusic

2018-12-29T00:00:00+00:00

At the peak of my career as a software engineer, I spent most of my free time either playing video games or reading books about engineering management. These days, my day job is mostly engineering management, and so I find myself carving out play-time to write some code (and of course, still indulge in video games).

A result of that play-time over this winter break merits broader sharing than my usual side project. I built a web player for Apple Music, called ThinMusic, to scratch two of my itches:

As an Apple Music subscriber, I had no way to play songs on my Linux desktop.
Scrobbling my play history on Apple Music to last.fm has never worked reliably.

The latter point irked me quite a bit (given I’ve been scrobbling consistently since 2006 or so), but not enough to switch to Spotify which supports scrobbling natively (worth mentioning that Apple Music’s family plan is best in the industry especially with international family members, adding additional inertia).

However, at this year’s WWDC, Apple announced MusicKit JS which was quite intriguing on its own, but also opened the doors to kill these two birds with one stone. Thus, ThinMusic was born:

ThinMusic requires an Apple Music subscription (and a Facebook account so it can store the authentication tokens securely). It supports all the basic features of a music player and works on any modern browser. It is also optimized for desktop use, since on mobile devices you’re probably better off with Apple Music’s native app (available on both iOS and Android). You can use it on mobile if you really want, but be warned the experience is not as good (mostly due to my laziness to optimize the layout and make a real responsive design).

As an added bonus, it appears this might be a good way to play songs on the Portal, since the Apple Music skill for Alexa doesn’t work on it yet:

Just open the browser app, navigate to thinmusic.com and login. Since it is running inside the browser, there is no support for voice control (and who wants to type for extended periods on the Portal), but if you just want to queue up a playlist quickly, this setup can work pretty well.

If you’re an Apple Music subscriber, give ThinMusic a whirl and email support@thinmusic.com with your questions or suggestions!

Teaching Ozlo about Pokémon GO

2016-08-04T00:00:00+00:00

Pokémon GO is all the rage these days. Ozlo, your friendly AI sidekick, would be remiss if he didn’t help you catch them all!

Thanks to Ozlo’s unique, knowledge-based approach to the world, we were able to teach him about Pokémon in just under a week, including how to find PokéStops and Pokémon Gyms near places you might be going. In this blog post, we’ll take a look at some of Ozlo’s inner workings, what goes into teaching him a completely new concept, and why his ability to learn quickly matters.

The process involves three high-level steps:

Feeding Ozlo data about the new concept
Teaching Ozlo to understand how people talk about the concept
Teaching Ozlo how to talk to people about what it knows

We’ll cover each of these steps one-by-one and then discuss why it’s important we do things this way — and why that makes Ozlo fundamentally different than many other chatbots and AI assistants out there.

Data

Ozlo’s view of the world consists of entities (people, places, or things) and relationships among them. Teaching Ozlo about something new begins with acquiring data about the subject that so we can augment his knowledge of the world. This can happen by several means — crawling the web, hitting APIs, and obtaining data from partners, for example.

In Pokémon GO’s case, we decided to focus on a use-case that helps you play the game effectively but doesn’t break it or cheat in any way: finding PokéStops. PokéStops are places all around the world, and they have certain attributes that identify them: coordinates, a name, picture and sometimes a description.

Once we found all the PokéStops in the US, we turned them into entities and started creating relationships. Ozlo already knows about all the cities in the US as well as what landmarks and restaurants exist in each city. With this knowledge, he can perform reasoning to know that if a given place is inside the polygon for a given city’s boundary, then the place must be in the city (and so on…)

When this process concludes, Ozlo has a mental map of where all the PokéStops in the US are located, which of them are “gyms”, what cities they are in, and which landmarks and restaurants they are near.

Understanding

Next, we had to teach Ozlo some of the common ways in which humans might ask him about PokéStops. In the beginning that involves just writing out some examples and telling Ozlo what each of them mean.

Consider the following sentence, resembling something a human might ask Ozlo:

“Show me pokemon gyms near the ferry building”

There’s a lot in that sentence that Ozlo can already understand! He has a basic understanding of the English language, but also knows how people talk about restaurants and landmarks (since we taught him that earlier). What does Ozlo see in that sentence?

“show me”: Here’s a hint that the answer to this question requires some sort of visual presentation.

“near”: I’ve seen this word many times before and when it is followed by a name of a place, I know what that means.

“ferry building”: Looks like I have many entities that match this name. But, I can rank all the places with this name by their popularity and distance from where the user currently location to narrow down a likely candidate.

The only part of that sentence Ozlo didn’t quite understand was “pokemon gyms”. This is where we step in and give him some examples along with what they mean:

“pokestops”: This means entities that are PokéStops

“pokemon gyms”: This means entities that are PokéStops of type “gym”

We also added many more variations of the above to give him a basic understanding of PokéStops. And don’t forget — Ozlo also keeps learning as you use him — so he’ll collect a lot more examples over time than what we just start him off with!

Presentation

The final step was to teach Ozlo how to turn his answer into words and interactions that humans can understand. In many ways this is exactly the reverse of Ozlo trying to translate what a human said into terms he can understand.

Ozlo already has a good knowledge of English, so he can mostly construct the sentence on his own. We just need to give him a few hints and we get:

“There are many Pokémon Gyms around Ferry Building”

Then we construct the visual format of the response. In our iOS app we settled on using the “multi-pin map” element, which is an easy way to view several points of interest in a given geographic area. For now, we just tell Ozlo what type of visual result format to use, based on the user’s device.

Ozlo’s capabilities aren’t limited to just rendering maps though - he can choose between a variety of output formats - and we pick the one that’s best suited to the medium you’re using to communicate with him.

Why This Matters

Why go to all this effort to actually teach Ozlo about PokéStops instead of just having Ozlo redirect your question to some other service? We’ve talked about the multi-agent problem before — and we believe there is a fundamental difference between bots that know things and bots that guess what other services might know about things.

As Ozlo’s knowledge of the world grows, adding more data to it enriches his entire world view. There’s a network effect between entities — because these entities have relationships with each other — adding new entities has an exponential effect on Ozlo’s understanding of the world. This is what lets us leverage the fact that Ozlo already knows about “the Ferry Building” to help you find out what PokéStops are near it with only a minimal amount of effort.

We can’t wait for the day where we’re not the only ones teaching Ozlo about new concepts! In the meantime, please keep using Ozlo and giving him feedback to help him continue to learn more about the world.

Meet Ozlo

2016-05-12T00:00:00+00:00

Two days ago, a project I’ve been working on for a little over two years was unveiled to the world. Meet Ozlo, your friendly AI sidekick!

First things first: if you haven’t signed up yet, hit up this link which includes a VIP code to fast-track you into our invite-only app.

A lot has been said about Ozlo already: Charles Jolley (co-founder), John Lilly (investor), Lloyd Hilaiel (friend & colleague), Todd Agulnick (friend & colleague) and even Buzzfeed! Here’s my perspective…

Why

It didn’t take me very long since I first heard the idea for a better mobile search experience from Mike ¹ and Charles to stop what I was doing and jump on board.

The fundamental problem we’re trying to solve is that even though our smart phones enable us to do a lot more than we could before, the process of finding people, places and things on them is not very different from how you would do it on a desktop.

That’s usually the natural course for any technology to take.² The first application on any new platform usually is a v1 – “available here too” – product. This first version often under-utilizes the platform’s true capabilities and its creators can quickly be lulled into thinking that they’ve created the optimal experience for the consumer.

What’s v2 for search on mobile devices? To answer this question is why we created Ozlo.

What

In attempting to answer this question, we built something that we thought might work. It didn’t work quite as well as we’d have liked. So we did it again. And again. Fast-forward two years and you arrive at Ozlo: a personal and intelligent companion that helps you find things.

The first manifestation of that idea is an iOS application that can help you find food. In the app, you interact with Ozlo via a chat-like interface. Here I am trying to find that place that I can’t quite remember the name of:

This iteration of the app is purposely focused on one goal – finding you food. But there are several underlying themes that have the potential to pave Ozlo’s way to something grander:

Conversational

Searching for something is usually not a one-shot type of activity. Humans don’t work that way. We ask a question, and often follow up with more questions; until we’ve refined our own thoughts to ultimately get the answer we’re looking for. It’s exciting that Ozlo has the potential to participate in this back-and-forth.

Personal

Ozlo has the potential to know you over time, learn about your preferences and interests in a meaningful way. To me, this brings a face to the otherwise utilitarian search box that feels disconnected and impersonal.

As a vegetarian, I can already appreciate Ozlo helping me find hidden gems at restaurants I’d usually dismiss. What if Ozlo could also recommend movies for me to watch, grab that hard to get restaurant reservation and help me find the perfect anniversary gift?

Intelligent

In the past few years, technology seems like it’s finally getting to the point where building an agent that can really understand what humans say is tantalizingly close to being possible.³

Ozlo is different from usual search engines, the ones that return results with the same words as your query, without knowing what the words mean. Ozlo tries to understand what you said and then tries to arrive at an answer. To me, that makes Ozlo intelligent.

Training Ozlo to understand the nuances of human language is going to be a very difficult task. But it is by no means impossible, given the resources we (as computer engineers and scientists) have at our disposal these days.

How

The really interesting bits are in the technology behind Ozlo and how we built it. This is some of the deepest technology I’ve ever had a part in building and I’m extremely proud of it. To make Ozlo work, we’ve had to write several pieces of software from scratch.

On the backend:

Data Pipeline:
to ingest, dedupe and glean structure from the mess of data we find; at scale; with speed.
Search Engine:
to index the facts our data pipeline emits and allow us to efficiently query it; at scale, with speed.
Query Understanding:
to turn human language into a series of structured queries machines can understand.
Dialog System:
to keep track of the high-level structure of the conversation you’re having with Ozlo.

On the frontend:

Language Synthesis:
to turn structured results back into friendly text humans can understand.
Layout Language:
to efficiently and generatively render results as a graphical layout.
View Synthesis:
to aggregate, refine and generate the final layout humans will see.
iOS App:
to turn that layout back into pixels that are delightful to look at and interact with.

We built most of our backend in Go. It’s no secret that I’ve been a fan of Go since its inception, primarily because of my affinity to Plan 9; but this is the first time I’ve been able to observe it being used at a large scale for a production-quality project. I couldn’t be happier with our choice, and I’ll admit that I’ve had some days where I get into work only because I’m excited by the prospect of writing some Go.

We built most of our frontend in NodeJS (and ObjC for the iOS app, of course). It’s also no secret that I’ve been a huge proponent of Javascript and our frontend has been chugging along happily (we’ve had a few refactorings, but really, what JS code base doesn’t go through atleast two?). Say what you will about JS & NPM, especially in the recent past, one cannot deny the convenience and speed of development that is offered by the JS runtime.

I hope you share my excitement about Ozlo. Looking forward to whatever comes next!

John calls Mike an “anytime, anything, anywhere” person, and it couldn’t be truer. ↩
Take publishing for instance – when tablets were first introduced – a publication’s first instinct was to just take what they had on paper and turn it into pixels. ↩
We’ve observed the resurrection of the term “AI” to refer to this sort of thing. It’s often an overloaded term, but there is no doubt that the industry as a whole has made big technological strides in deep learning and machine intelligence. ↩

Amazon Echo

2015-04-26T00:00:00+00:00

I received my Amazon Echo recently. I ordered it merely as a curiosity and to generally stay aware of industry trends. But after just a few days of using it at home; I love it enough to prompt dusting off this blog after almost two years of no posts!

The first thing that caught my attention was the sheer accuracy of its voice recognition. The state of the art is already pretty good; the dual pillars of cheap, persistent computing power (i.e., “the cloud”) and renewed interest in machine learning brought us Siri (and their Microsoft & Google equivalents). In my experience, these have been accurate more than 95% of the time, a long way ahead from the days of offline speech recognition (à la Dragon Naturally Speaking). The bar is already high, but I’m comfortable saying Amazon Echo’s voice recognition is definitely better than Siri’s or Google’s.

I’m not sure how they pull it off. Maybe it’s not because of better software, but simply better hardware. The Echo has an array of 7 microphones that are always listening and wake up as soon as you say “Alexa”…

There is an element of genius in packaging it as a standalone cylinder that sits in a somewhat central location in your home. The experience of being able to talk to it, hands-free, from a wide range of places in your home sounds like it wouldn’t be that big a deal… until you actually do it. Then it makes having to find your phone, pick it up, and push a button to make it do something seem archaic and boring.

Now, the only problem is that even though Echo understands what I’m saying, it doesn’t always know how to respond. The companion app often shows an accurate transcription of what I said, but since it did not fall into one of the categories it’s designed to handle at the moment, I don’t get a satisfactory response.

But there are relatively easy ways to fix that. The Echo API is currently invite-only, like the ability to purchase the device itself. As more developers get their hands on it, we’ll start to see many interesting things happen. Some developers already hacked their way into controlling their home lighting (Amazon now officially supports integrations with Philips Hue). I’m itching to make it control my home theater system, currently riddled with half a dozen remotes.

The idea of having a single hub in your home that’s always listening and can control all your other devices is incredibly appealing to me. For one, it means that your TV, Xbox, Nest, and other home devices don’t have to build in (often bad) voice recognition systems themselves.

Finally, I was impressed by how handily the Echo passed Larry Page’s “toothbrush test”. Even with its somewhat limited functionality, my wife and I have already used it several times every day since we received it — for things ranging from compiling a shared grocery list, adding reminders, setting alarms, and playing music.

I think Amazon is onto something big. Priced at an aggressive $99, the Echo has the potential to make it into the living room of a large majority of households in the US (and the world, eventually). Will we now see a flurry of competitors from tech giants and startups?

Exciting times!