Chapter 3: Basic Tools

This chapter discusses some practical tools every developer should use, with a high level of proficiency. They aren’t the be-all, end-all of being a developer, but they are very good at what they can do.

Plain Text

They provide a compelling argument that using plain text is important. It can be structured (JSON, YAML) or unstructured. In either version, it should be written in a `human-readable` format, not just be human understandable. The difference here is that a random string of letters and numbers is understandable to a human, but it isn’t readable, where the reader can gain insights or understanding.

Shells

They advise having a bias towards using Shells versus IDEs as much as possible. They are the `workbench` for manipulating files. They can, of course, be customized to your preferences. For example, my shell in Terminal (macOS) will display the current folder and git branch in one line above the cursor.

You can also use aliases to help automate repeated operations to avoid having to retype them manually. For example, I used ZSH at Amazon to create shortcuts for specific terminal commands and options to improve my workflow.

Another useful tip is to leverage command completion. I use command completion all the time, especially when working with Git. I can easily recall previous commands and do path completion when navigating folders and files in Terminal.

My training and work as an engineer at Amazon were more focused on learning to use an IDE (IntelliJ). There was some command-line work, but it wasn’t the bulk of it. As one engineer mentioned to me, today, it’s easier to use a chatbot to learn command-line prompts and shortcuts. We no longer have to memorize them.

Debugging

They offer some sage advice on how to approach debugging without panicking. When I was a new developer, it was definitely panic-inducing to get paged for an issue in a large codebase I was unfamiliar with. I can recall the feelings of helplessness even today. 

Their advice, “don’t panic!” 

Instead, start to develop a problem-solving strategy for how to squash bugs: 

  • First, develop a debugging mindset. Remind yourself that it’s only problem-solving
  • Start with a clean build. If the build isn’t clean, don’t even bother starting to debug.
  • Gather relevant data. Depending on the issue, it could be dashboards, logs, or user-reported information.
  • Reproduce them. Recreate the bug to help gain a better understanding of what’s going on and where to begin investigating.
  • Write a failing test before fixing a bug. This one is self-explanatory.
  • Figure out if it’s a crash, a bad result, or input-value sensitivity. Knowing which one is a good starting point.
  • Use binary chop. Similar to a binary search, halve the stack until you isolate the problem.

Binary Chop

I hadn’t heard of this methodology until now. It’s a simple idea that makes sense. Instead of trying to trace a stack with hundreds of lines of information, cut it in half. See if the problem occurs in that half, then continue dividing the stack until the bug’s origin is found. It’s the debugging version of a binary search method.

Many times, I remember seeing a stack trace of an issue at Amazon, consisting of hundreds of lines. This strategy would have been useful back then, but of course, hindsight is 20/20.

Process of Elimination

Identify whether the bug is in your code, a framework from a vendor, or the environment the code runs in. Don’t bypass testing simple lines of code because you assume it’s too simple to cause a bug. Such assumptions will cost time and effort that could be avoided. 

Engineer’s Daybook

One advice that I enjoyed reading is their advice to use an engineering daybook. In today’s parlance, it’s a bullet journal, or something similar. It’s a place where you can jot down and reference notes as you go through the day. Later, these can be referenced for a look back in time or to discover new ideas and solutions. I give a big plus one to this.

Conclusion

I found chapter three to be very practical, with tangible advice that an engineer can start implementing immediately. Most of the tools mentioned are free or easy to get. They do require time and commitment to reach a high level of fluency, but are worth the effort, even on a basic level.

Apprends Lundi 2025-04-28

J’ai continué apprendre comment parler de la nourriture. Je sais maintenant comment utiliser ‘plein’ qui est le contraire de ‘vide’. J’ai appris aussi comment utiliser ‘ça a l’air’ avec la nourriture. J’ai pris quelque temps à apprendre le différence entre ‘en face de’ et ‘d’en face’. ‘En face de’ est près de devant, alors ‘d’en face’ est comme la main gauche et la main droite.

ExpressionsNotes
Il n’y a plus d’eau dans la bouteille. Elle est <vide>.Il n’y a plus rien dans la bouteille.
Je viens de faire les course. Il y a beaucoup de choses dans le frigo. <Il est plein>.Le contraire de vide.
<Mon bol> est pleinOn mange de la soupe dans un bol.
J’adore <les sushi>.Le pluriel de sushi est ‘sushi’ et ‘sushis’, sans dénombrable et sexe.
Si tu veux de la glace, <il y en a ici>.
Non merci, ma tasse est <déjà pleine>.La tasse déjà contient quelque chose
On mange avec <une fourchette> et <un couteau>.
<Le poivre> est noir et on l’utilise souvent avec le sel quand on cuisine.Le poivre est une épice.
Il y a beaucoup <d’épices> dans le curry.
J’aime tous les fruits sauf les poires. Je déteste les poires.(excepté) Je mange tout mais je ne mange jamais des poires.
Je veux acheter une baguette mais <Il n’y en a plus> à la boulangerie.Comment dit-ons <there isn’t any more> en anglais.
On peut manger <de la chantilly> avec de la glace ou des fraises.On fait de la chantilly avec la crème épaisse.
<La recette> explique comment on doit préparer un plat.
<Le boulanger> vient de faire ce pain. C’est du pain frais.Quelqu’un qui fait des pains, des croissants, des biscuits, etc.
Je regarde tous ces gâteaux et je pense que <ça a l’air délicieux>.Quand on voit quelque chose avec les yeux.
J’ai <tout mélangé>Mets les ingrédients ensemble pour faire quelque chose de neuf
Il ajoute les légumes et elle les mélange avec <une grosse cuillère>.On mange des céréales avec une cuillère.
J’adore tes pâtes d’habitude, mais <celle-ci sont sucrées>.
C’est <une omelette> au fromage.On prépare une omelette avec des œufs.
Béa mange dans un restaurant, à Rome. Un serveur <passe à côté d’elle>.Quand une personne marche près de quelqu’un.
J’ai mangé ce plat. Il y a sept ans et elles n’avaient pas le même goût.On goûte les nourritures ce qu’on mange.
Sept ans, c’est long. Vous êtes différente maintenant, <vos goûts> on peut-être changé.
Non, il n’a jamais travaillé ici. Il travaille dans le restaurant <d’en face>.Il y a un différence entre ‘en face de’ et ‘d’en face’. ‘En face de’ est quand quelqu’un ou quelque chose est devant un autre. Par exemple, ma voiture est en face de le magasin. ‘D’en face’ est “opposite” en anglais. Par exemple: Ma main droite est d’en face de ma main gauche.
<Les vegan> aiment les légumes.
Le sel n’est pas bon <pour la santé>.
Le frigo est vide et Léo <n’a pas envie de> cuisiner. Le restaurant japonais de son quartier est très bon, alors il va commander des sushis.Quand on ne veux pas.
Elle adore la course à pied‘La course à pied’ est le nom de courir.

Apprends Dimanche 2025-04-27

J’ai finis apprendre de ‘argent et j’ai commencé apprendre de nourriture. J’ai appris comment dire une information, dépenser d’argent d’un distributeur. J’ai appris aussi quelques expressions sur monnaie. C’était facile pour moi. J’ai commencé apprendre des plat et ingrédients de la nourriture. Donc, je trouve que c’est plus facile à apprendre et parler le français.

ExpressionsNotes
<Il faut> économiser.Comment utiliser <Il faut> ou le verbe falloir
Le dollar est <la monnaie> d’États-UnisOn doit savoir le sens de <monnaie> et que le mot est aussi féminin.
<Gagner sa vie>, c’est gagner assez d’argent pour vivre.Parler de comment on gagner de l’argent dans ta vie.
Je gagne bien ma vieEncore comment utiliser <gagner ta vie>
Nous gagnons bien notre vieLa méme qui est au-dessus
Je dois remplir <un formulaire>Ce n’est pas une formule de chimie. C’est comme un papier qu’on met ton nom, ta téléphone et t’adresse de maison.
Vous n’économisez pas assez d’argent.On parle de ne dépenser de l’argent.
Vous économisez assez d’argent <pour acheter> une voiture.
Toutes ces choses sont gratuites.L’Accord entre le nom et l’adjectif gratuit(es)
Vous pouvez <faire votre code>.
Attention, il ne faut pas oublier ton passeport
Tu as fait les courses, Sophie. Oui, mai j’ai perdu <toute la monnaie>
Tu retires combien d’argent?Retirer de l’argent d’un distributeur.
Vous pouvez payer <par chèque> ou <en espèces>?Avec l’espèce, on paye avec la monnaie au papier
On doit vraiment aller au marriage de tante Élisabeth?
C’est stupide. Je ne vais porter cette robe qu’une seule fois.Savoir le sens de <stupide>
<En fait>, je pense que c’est parfait!En anglais: <in fact>
enfinEn anglais: finally
Vous pouvez payer seulement en espècesl’espèces en anglais: cash
Excusez-moi, j’ai besoin d’une information.une information en anglais: a form
Avez-vous toutes les informations?
Je ne perds jamais rienUne phrase que j’ai voudrais savoir.
Tu en ajoutes
Il y a trop de sel dans cette soupe. Elle est trop salée.On dit sel comme <salt> et salée comme <salty> en anglais.
Ces biscuits sont sucrésOn dit sucrés en anglais est <sweet>
L’alcool est peut-être dangereuxFacilement, l’alcool est <alcohol> en anglais

Chapter 2: A Pragmatic Approach

I found this chapter to be insightful and can relate it to my experience as an engineer, both good and bad. The chapter discusses good design, DRY, tracer code, prototyping, and estimating. I’ll share what I learned along with my thoughts.

Easy To Change (ETC)

The chapter begins with a discussion on what makes good design. They emphasize the benefits of good design, including that it must be changeable without changing any adjacent, upstream, or downstream code. In essence, use ETC or make it easier to change by using techniques such as modularization and interfaces to keep responsibilities separate.

ETC is more than just applying the Single Responsibility Principle. *Thomas* and *Hunt* advocate that it’s a value for decision-making. It won’t tell you how to design and implement systems, but it will help you make better decisions between design and implementation options. 

When dealing with unknown paths or novel projects, their sage advice is to make it replaceable. That way, if a better design option is presented later, that chunk of code can be quickly replaced. In their words, your initial design or implementation decision “won’t be a roadblock” to implementing the better option. 

Finally, be sure to note the situation and reasoning in a journal and code to be able to understand and reflect on your decision later. This is wise advice for me. I didn’t keep meta notes, and I feel this bit of wisdom is helpful to grow as an engineer.

Don’t Repeat Yourself (DRY)

They emphasize that DRY, or Don’t Repeat Yourself, applies to both code and documentation. If a change has to be made in multiple locations, then the code isn’t DRY. For example, having a function where multiple lines use the same steps to perform a calculation. In this case, remove the repetition by moving those steps to a separate function that can be used by each operation. Even low-level operations such as number and text formatting should be converted to a reusable function call, rather than repeating the operation multiple times in the code. 

This also means that documentation in the code shouldn’t be a copy of what’s in the code. This will lead to the documentation and the code becoming misaligned in the future because, as the code changes, it’s more likely that the documentation won’t be updated. Instead, use documentation to highlight an exception, a known issue to fix later, or to explain an engineering decision. 

Duplication should also be avoided between APIs (internal and external) and data sources. The goal is to find a neutral standard to specify APIs or data.

Orthagonality

They introduce the concept of *orthogonality* in programming with an example of the systems used to control a helicopter in flight. The controls on a helicopter aren’t independent; moving one lever will affect how the other controls should be manipulated. You cannot move the cyclic to get it to move forward without having to adjust the pitch lever, throttle, and foot pedals. 

The interplay of a helicopter’s systems is an example of a non-orthogonal system where each is intricately intertwined such that you can’t adjust one without adjusting the others. This is not how code should perform.

Each system should operate independently of the other. This makes code changes easier, saves time, and reduces risk because an engineer will understand that they only need to create and test the change in one system (or subsystem). It also promotes component reuse, where new and creative combinations can be made in the future.

During the design process, maintain orthogonality by using implementation layers. That is, the user interface is developed separately from the data access layer, which is separate from authorization and any business logic. A quick check they advise is to ask yourself, ‘If I dramatically change the requirements behind a particular function, how many modules are affected?

Also, be mindful of how your design is decoupled from the real world. For example, using randomly generated IDs for user account IDs instead of real-world information such as phone numbers, because they can change, and you will not have control over them.

And of course, this also applies to documentation where content and presentation are separated. It is best to have a platform that handles the presentation layer, such as Markdown, so that you can focus on the content.

Tracer Code

An interesting tool they mention is to use tracer code. Tracer code is where a system is developed just enough to get each layer working together end-to-end to show that the system as a whole can be integrated. This is especially important for novel projects where the possibilities are unknown. The tracer code is developed in the same environment it needs to run in, within the same constraints. It gets from the requirement to an operational but simplified version of the system running quickly. From there, developers can add to each subsystem until all requirements are implemented.

It avoids the burden of developing all the requirements at once without anything to demo until much later. Tracer code helps to demonstrate to all stakeholders that the project is viable and will encourage buy-in and support for the rest of the implementation. Tracer code is skeleton system code that can be used to implement the rest of the functionality.

Prototypes

Prototypes, on the other hand, are meant to be disposable. It is a way to demo or work out a specific problem without producing production code. It can be done on sticky notes, a computer, or a small-scale model. The ultimate difference between tracer code and a prototype is that a prototype can be discarded.

Estimating

Finally, they discuss estimating and how to get a better handle on setting time frames for project task completion. They suggest referencing prior projects or talking to engineers who have experience with this type of project, which is important wisdom to have. 

They mention how to talk about and reference estimation to set expectations. For example, if a project is estimated to take 25 weeks to complete, set the duration to six months. That way, the expectation is to have it done in 5-7 months. This will provide 1-3 months of wiggle-room versus 1-3 weeks, which is a significant difference. 

They also suggest building a model of the steps needed to understand what needs to be done. The goal is to make sure you have a good understanding of how your team or organization develops projects.

They advise keeping a journal of your estimations to reference and reuse later. As an engineer, this is an area that I struggled with. It is challenging to set a task estimate of work you haven’t done before on a team that provides little guidance and shared estimating knowledge. It can feel daunting. 

They advocate for my preferred method of project task planning and estimation. This method broadly defines all but the initial tasks. Then, as project implementation progresses, progressively refine the remaining tasks in parallel. In my role at Amazon, I was expected to create a firm, detailed task plan as a new engineer, regardless of my experience and uncertainty, and then not deviate from that plan. This method meant that revisions to my timeline had a high negative impact without much guidance.

Conclusion

Overall, my previous experience resonated with what Thomas and Hunt advise. I honestly feel that this book should be a required reading for new engineers because it gives them access to wisdom gained from years of experience, whether their team can provide it or not. It is a way to build a strong foundation as an engineer that can be referred to and refined over time.

I am looking forward to reading each chapter of this book. Coming up next, chapter three, entitled The Basic Tools.

Pragmatic Programmer Notes #2

Continuing with chapter one, section seven (7) on communication. Thomas and Hunt offer advice on how to be a better engineer by communicating more effectively.

The section begins by advising that it’s “not just what you’ve got, but also how you package it”. They caution that even the best code or ideas are useless unless other people are aware of them.

Developers create various forms of communication, including in meetings, through written code, proposals, and reports. When preparing non-code communication, treat your native language like a programming language by honoring the DRY principle and leveraging automation. Automation in this case can include any documentation templates.

For code, inline documentation should focus on why a decision is made. It shouldn’t explain how because that’s what the code is for.

  • Know your audience: They caution, “just talking isn’t enough”. You must understand the needs and capabilities of your audience, and request audience feedback to gauge their level of understanding and engagement.
  • Know what you want to say: Create a plan for what you want to say to ensure it expresses what you want in all communication types, including verbal communication.
  • Choose your moment: Figure out if your audience is receptive to your ideas before you start sharing them by understanding their priorities.
  • Choose a style: Understand how your audience wants the information delivered – a formal document, quick details, or a verbal discussion.
  • Make it look good: A good-looking document matters, so take the time to edit and format your communication to make an impact.
  • Involve your audience: Get readers engaged early in the documentation process.
  • Be a listener: Listen to others as you would have them listen to you.
  • Get back to people: Whether it’s email, social media, or documentation comments, there’s no excuse not to respond.

Documentation Best Practices
They nicely devote a separate section for documentation, and as a fan of technical writing, it pleased me to see it.

  • Don’t waste time documenting how in your code, document why.
  • Comment source code to explain parts of a project or engineering trade-offs.
  • Plan your documentation from the start, not as an afterthought.

It’s generally understood that we all need to communicate better. It’s a persistent challenge given all the other priorities we face every day. But it’s worth taking the time to communicate effectively. These engineering communication tips are shared with a voice of experience from both authors.

Pragmatic Programmer Notes #1

I finally got the chance to start reading The Pragmatic Programmer, 20th Anniversary Edition by David Thomas and Andrew Hunt. These are my initial notes from this first reading.

These are my rough notes and are intended for me to reflect on what I read. I capture the key points that I feel are important to pay attention to.

Attitude and Style

  • Think beyond the immediate problem.
  • Place it in a larger context.
  • Seek the bigger picture.

Team Trust

  • The team needs to trust me.

Own It

  • Look for risks beyond my control.
  • Have contingency plans for risks.
  • Know my options such as: prototyping, testing, automation, and learning.

Software Entropy

  • Fix ‘broken windows’ quickly.
  • Document issues as soon as they are known.
  • Don’t do additional harm while fixing additional issues.

Handling Change

  • Be a catalyst for change.
  • Make reasonable asks.
  • Avoid a narrow focus.
  • Know the big picture.
  • Use situational awareness.

Software Quality

  • Write ‘good enough’ software.
  • Meet user and system requirements.
  • Let users participate.
  • Consider modularization or microservices.

Keep Skills Fresh

  • Learn a new language annually.
  • Read a technical book monthly.
  • Take classes.
  • Participate in user groups.
  • Try coding in different environments.
  • Read current news and events.

Apprends Samedi 2025-04-26

Aujourd’hui j’ai étudié plus mieux que hier et je ne sais pas pourquoi, mais je me sens contente avec ça. J’étais un peur inquiète parce que j’adore le français et je ne veux pas oublier rien. J’ai CEFR quarante-six aussi et je trouve que je peux parler de beaucoup de la vie. J’ai commencé les leçons de l’argent comme ‘monnaie’, ‘l’espèce’ and ‘code’.

ExpressionsNotes
J<’ai envie de> faire du sport.La même que ‘je <veux> faire du sport.
J’ai mal à <le genou>Les genoux est au milieu de la jambe
Tu devrais manger plus de légumes pour avoir bonne santé!
<A-t-elle envie de> faire de la course?<Est-ce qu’elle veut>
Vous devrez <prendre soin> de vousPour avoir la bonne santé
Sarah, tu as honte de toi?
Paul as honte parce qu’il maillot de bain est trop petit!
Vous souhaitez payer <en espèces>? Non, je préfère toujours payer <par carte>.
Si vous voulez retirer de l’argent, il y a un distributeur.Les banques ont les distributeurs de l’argent.
Il remplit un document pour ouvrir un compte dans une banque française.
Vous avez économisé assez d’argent pour acheter une voiture.
Elle veut gagner de l’argent pour acheter une voiture rouge.
Je gagne bien ma vie!
J’achète un pull dans un magasin. Je pay, puis on me donne mon pull et <le ticket>.Le ticket est qu’on reçoit après payer quelque chose.
Ça coûte dix euros et quinze centimes. Tu as quinze centimes? J’ai un billet de dix euros mais je n’ai pas de (la) monnaie.

Google Gen AI 5-Day Intensive: Day Four – Part 2 (4/5)

Codelab #2 – Use Google Search In Generation

This is the first assigned codelab on day four of the intensive. Download it here from Github to run locally or run in this Kaggle notebook.

"""Use Google Search in Generation

Google Gen AI 5-Day Intensive Course
Host: Kaggle

Day: 4

Codelab: https://www.kaggle.com/code/markishere/day-4-google-search-grounding
"""

import io
import os
from pprint import pprint

from google import genai
from google.api_core import retry
from google.genai import types
from IPython.display import HTML, Image, Markdown, display

client = genai.Client(api_key=os.environ["GOOGLE_API_KEY"])

# Define a retry policy. The model might make multiple consecutive calls automatically
# for a complex query, this ensures the client retries if it hits quota limits.
is_retriable = lambda e: (
    isinstance(e, genai.errors.APIError) and e.code in {429, 503}
)

if not hasattr(genai.models.Models.generate_content, "__wrapped__"):
    genai.models.Models.generate_content = retry.Retry(predicate=is_retriable)(
        genai.models.Models.generate_content
    )

# To enable search grounding, specify it as a tool 'google_search'
# as a parameter in `GenerateContentConfig` passed to `generate_content`

# Ask for information without search grounding
response = client.models.generate_content(
    model="gemini-2.0-flash",
    contents="When and where is Billie Eilish's next concert?",
)
Markdown(response.text)

# And now rerun the same query with search grounding enabled.
config_with_search = types.GenerateContentConfig(
    tools=[types.Tool(google_search=types.GoogleSearch())]
)


def query_with_grounding():
    response = client.models.generate_content(
        model="gemini-2.0-flash",
        contents="When and where is Billie Eilish's next concert?",
        config=config_with_search,
    )
    
    return response


rc = query_with_grounding()
Markdown(rc.text)


# Response metadata
# Get links to search suggestions, supporting documents and information
# on how they were used.
while (
    not rc.grounding_metadata.grounding_supports
    or not rc.grounding_metadata.grounding_chunks
):
    # If incomplete groundind data was returned, retry.
    rc = query_with_grounding()

chunks = rc.grounding_metadata.grounding_chunks
for chunk in chunks:
    print(f"{chunk.web.title}: {chunk.web.url}")

HTML(rc.grounding_metadata.search_entry_point.rendered_content)

supports = rc.grounding_metadata.grounding_supports
for support in supports:
    pprint(support.to_json_dict())

markdown_buffer = io.StringIO()

# Print the text with footnote markers.
markdown_buffer.write("Supported text:\n\n")
for support in supports:
    markdown_buffer.write(" * ")
    markdown_buffer.write(
        rc.content.parts[0].text[
            support.segment.start_index : support.segment.end_index
        ]
    )

    for i in support.grounding_chunk_indices:
        chunk = chunks[i].web
        markdown_buffer.write(f"<sup>[{i + 1}]</sup>")

    markdown_buffer.write("\n\n")

# Print footnotes.
markdown_buffer.write("Citations:\n\n")
for i, chunk in enumerate(chunks, start=1):
    markdown_buffer.write(f"{i}. [{chunk.web.title}]({chunk.web.url})\n")

Markdown(markdown_buffer.getvalue())


# Search with tools
# Use Google search grounding and code generation tools
def show_response(response):
    for p in response.candidates[0].content.parts:
        if p.text:
            display(Markdown(p.text))
        elif p.inline_data:
            display(Image(p.inline_data.data))
        else:
            print(p.to_json_dict())
        
        display(Markdown('----'))
        
config_with_search = types.GenerateContentConfig(
    tools=[types.Tool(google_search=types.GoogleSearch())],
    temperature=0.0
)

chat = client.chats.create(model='gemini-2.0-flash')

response = chat.send_message(
    message="What were the medal tallies, by top-10 countries, for the 2024 Olympics?",
    config=config_with_search
)

show_response(response)

config_with_code = types.GenerateContentConfig(
    tools=[types.Tool(code_execution=types.ToolCodeExecution())],
    temperature=0.0
)

response = chat.send_message(
    message="Now plot this as a Seaborn chart. Break out the medals too.",
    config=config_with_code
)

show_response(response)

Google Gen AI 5-Day Intensive: Day Four – Part 1 (4/5)

Codelab #1 – Tune A Gemini Model

This is the first assigned codelab on day four of the intensive. Download it here from Github to run locally or run in this Kaggle notebook.

"""Tune Gemini Model for Custom Function

Google Gen AI 5-Day Intensive Course
Host: Kaggle

Day: 4

Codelab: https://www.kaggle.com/code/markishere/day-4-fine-tuning-a-custom-model
"""

import datetime
import email
import os
import re
import time
import warnings
from collections.abc import Iterable

import pandas as pd
import tqdm
from google import genai
from google.api_core import retry
from google.genai import types
from sklearn.datasets import fetch_20newsgroups
from tqdm.rich import tqdm as tqdmr

client = genai.Client(api_key=os.environ["GOOGLE_API_KEY"])

for model in client.models.list():
    if "createTunedModel" in model.supported_actions:
        print(model.name)
        
newgroups_train = fetch_20newsgroups(subset='train')
newgroups_test = fetch_20newsgroups(subset='test')

# View list of class names for dataset
newsgroups_train.target_names
print(newsgroups_train.date[0])

def preprocess_newsgroup_row(data):
    # Extract only the subject and body.
    msg = email.message_from_string(data)
    text = f'{msg["Subject"]}\n\n{msg.get_payload()}'
    # Strip any remaining email addresses
    text = re.sub(r"[\w\.-]+@[\w\.-]+", "", text)
    # Truncate the text to fit within the input limits
    text = text[:40000]
    
    return text
    
def preprocess_newsgroup_data(newsgroup_dataset):
    # Put the points into a DataFrame
    df = pd.DataFrame(
        {
            'Text': newsgroup_dataset.data,
            'Label': newsgroup_dataset.target
        }
    )
    #  Clean up the text
    df['Text'] = df['Text'].apply(preprocess_newsgroup_row)
    # Match label to target name index
    df['Class Name'] = df['Label'].map(lambda l: newsgroup_dataset.target_names[l])
    
    return df

# Apply preprocessing to training and test datasets
df_train = preprocess_newsgroup_data(newgroups_train)
df_test = preprocess_newsgroup_data(newgroups_test)

df_train.head()

def sample_data(df, num_samples, classes_to_keep):
    # Sample rows, selecting num_samples of each label.
    df = (
        df.groupby('Label')[df.columns]
        .apply(lambda x: x.sample(num_samples))
        .reset_index(drop=True)
    )
    
    df = df[df['Class Name'].str.contains(classes_to_keep)]
    df['Class Name'] = df['Class Name'].astype('category')
    
    return df

TRAIN_NUM_SAMPLES = 50
TEST_NUM_SAMPLES = 10
# Keep rec.* and sci.*
CLASSES_TO_KEEP = '^rec|^sci'

df_train = sample_data(df_train, TRAIN_NUM_SAMPLES, CLASSES_TO_KEEP)
df_test = sample_data(df_test, TEST_NUM_SAMPLES, CLASSES_TO_KEEP)

# Evaluate baseline performance
sample_idx = 0
sample_row = preprocess_newsgroup_row(newsgroups_test.data[sample_idx])
sample_label = newsgroups_test.target_names[newsgroups_test.target[sample_idx]]

print(sample_row)
print('---')
print('Label:', sample_label)

response = client.models.generate_content(
    model='gemini-1.5-flash-001',
    contents=sample_row
)
print(response.text)


# Ask the model directly in a zero-shot prompt.

prompt = "From what newsgroup does the following message originate?"
baseline_response = client.models.generate_content(
    model="gemini-1.5-flash-001",
    contents=[prompt, sample_row])
print(baseline_response.text)


# You can use a system instruction to do more direct prompting, and get a
# more succinct answer.

system_instruct = """
You are a classification service. You will be passed input that represents
a newsgroup post and you must respond with the newsgroup from which the post
originates.
"""

# Define a helper to retry when per-minute quota is reached.
is_retriable = lambda e: (isinstance(e, genai.errors.APIError) and e.code in {429, 503})

# If you want to evaluate your own technique, replace this body of this function
# with your model, prompt and other code and return the predicted answer.
@retry.Retry(predicate=is_retriable)
def predict_label(post: str) -> str:
    response = client.models.generate_content(
        model="gemini-1.5-flash-001",
        config=types.GenerateContentConfig(
            system_instruction=system_instruct),
        contents=post)

    rc = response.candidates[0]

    # Any errors, filters, recitation, etc we can mark as a general error
    if rc.finish_reason.name != "STOP":
        return "(error)"
    else:
        # Clean up the response.
        return response.text.strip()


prediction = predict_label(sample_row)

print(prediction)
print()
print("Correct!" if prediction == sample_label else "Incorrect.")


# Enable tqdm features on Pandas.
tqdmr.pandas()

# But suppress the experimental warning
warnings.filterwarnings("ignore", category=tqdm.TqdmExperimentalWarning)


# Further sample the test data to be mindful of the free-tier quota.
df_baseline_eval = sample_data(df_test, 2, '.*')

# Make predictions using the sampled data.
df_baseline_eval['Prediction'] = df_baseline_eval['Text'].progress_apply(predict_label)

# And calculate the accuracy.
accuracy = (df_baseline_eval["Class Name"] == df_baseline_eval["Prediction"]).sum() / len(df_baseline_eval)
print(f"Accuracy: {accuracy:.2%}")


# Tune a custom model
# Convert the data frame into a dataset suitable for tuning.
input_data = {'examples': 
    df_train[['Text', 'Class Name']]
      .rename(columns={'Text': 'textInput', 'Class Name': 'output'})
      .to_dict(orient='records')
 }

# If you are re-running this lab, add your model_id here.
model_id = None

# Or try and find a recent tuning job.
if not model_id:
  queued_model = None
  # Newest models first.
  for m in reversed(client.tunings.list()):
    # Only look at newsgroup classification models.
    if m.name.startswith('tunedModels/newsgroup-classification-model'):
      # If there is a completed model, use the first (newest) one.
      if m.state.name == 'JOB_STATE_SUCCEEDED':
        model_id = m.name
        print('Found existing tuned model to reuse.')
        break

      elif m.state.name == 'JOB_STATE_RUNNING' and not queued_model:
        # If there's a model still queued, remember the most recent one.
        queued_model = m.name
else:
    if queued_model:
        model_id = queued_model
        print('Found queued model, still waiting.')


# Upload the training data and queue the tuning job.
if not model_id:
    tuning_op = client.tunings.tune(
        base_model="models/gemini-1.5-flash-001-tuning",
        training_dataset=input_data,
        config=types.CreateTuningJobConfig(
            tuned_model_display_name="Newsgroup classification model",
            batch_size=16,
            epoch_count=2,
        ),
    )

    print(tuning_op.state)
    model_id = tuning_op.name

print(model_id)


MAX_WAIT = datetime.timedelta(minutes=10)

while not (tuned_model := client.tunings.get(name=model_id)).has_ended:

    print(tuned_model.state)
    time.sleep(60)

    # Don't wait too long. Use a public model if this is going to take a while.
    if datetime.datetime.now(datetime.timezone.utc) - tuned_model.create_time > MAX_WAIT:
        print("Taking a shortcut, using a previously prepared model.")
        model_id = "tunedModels/newsgroup-classification-model-ltenbi1b"
        tuned_model = client.tunings.get(name=model_id)
        break


print(f"Done! The model state is: {tuned_model.state.name}")

if not tuned_model.has_succeeded and tuned_model.error:
    print("Error:", tuned_model.error)
    

#  Use the new model
new_text = """
First-timer looking to get out of here.

Hi, I'm writing about my interest in travelling to the outer limits!

What kind of craft can I buy? What is easiest to access from this 3rd rock?

Let me know how to do that please.
"""

response = client.models.generate_content(
    model=model_id, contents=new_text)

print(response.text)


@retry.Retry(predicate=is_retriable)
def classify_text(text: str) -> str:
    """Classify the provided text into a known newsgroup."""
    response = client.models.generate_content(
        model=model_id, 
        contents=text)
    rc = response.candidates[0]

    # Any errors, filters, recitation, etc we can mark as a general error
    if rc.finish_reason.name != "STOP":
        return "(error)"
    else:
        return rc.content.parts[0].text


# The sampling here is just to minimise your quota usage. If you can, you should
# evaluate the whole test set with `df_model_eval = df_test.copy()`.
df_model_eval = sample_data(df_test, 4, '.*')

df_model_eval["Prediction"] = df_model_eval["Text"].progress_apply(classify_text)

accuracy = (df_model_eval["Class Name"] == df_model_eval["Prediction"]).sum() / len(df_model_eval)
print(f"Accuracy: {accuracy:.2%}")


# Compare token usage
# Calculate the *input* cost of the baseline model with system instructions.
sysint_tokens = client.models.count_tokens(
    model='gemini-1.5-flash-001', contents=[system_instruct, sample_row]
).total_tokens
print(f'System instructed baseline model: {sysint_tokens} (input)')

# Calculate the input cost of the tuned model.
tuned_tokens = client.models.count_tokens(model=tuned_model.base_model, contents=sample_row).total_tokens
print(f'Tuned model: {tuned_tokens} (input)')

savings = (sysint_tokens - tuned_tokens) / tuned_tokens
print(f'Token savings: {savings:.2%}')  # Note that this is only n=1.


# Tweak output token quantity
baseline_token_output = baseline_response.usage_metadata.candidates_token_count
print('Baseline (verbose) output tokens:', baseline_token_output)

tuned_model_output = client.models.generate_content(
    model=model_id, contents=sample_row)
tuned_tokens_output = tuned_model_output.usage_metadata.candidates_token_count
print('Tuned output tokens:', tuned_tokens_output)