Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ci: Add extra flags for Engine & CLI tests #7416

Merged
merged 2 commits into from
May 22, 2024

Conversation

gerhard
Copy link
Member

@gerhard gerhard commented May 20, 2024

--parallel now defaults to the number of CPUs. Before, this was hard-coded to 16 which was set the number of CPUs in the larger GitHub Runners, the ubuntu-22.04-16c-64g-600gb instance size specifically. We still want 16 in CI - we set this explicitly in the GitHub workflows - but are likely to want less locally where some of us only have 8 or 10 CPUs.

--timeout defaults to 30m. This was changed not that long ago - the value was set to 1h when using race. The bottom line is that none of our test suites should take more than 10 minutes. This is something that we are actively working towards. For now, until the pipeline becomes more reliable, we will stick with the current 30m default. When running locally with fewer CPUs, or when hunting deadlocks, experience recommends a value of 180m.

--failfast is a new one. It sets --max-fails=1 in gotestsum. FTR:
https://github.com/gotestyourself/gotestsum/blob/e9677fb405a5c0dc0b9d7217a7c73c3a20c5cb66/cmd/testdata/gotestsum-help-text#L20


Part of:

FWIW, I am using the following shell function with this change:

dagger-test-all() {
        local parallel="${1:-16}"
        local timeout="${2:-180m}"
        time dagger call --debug --source=".:default" test all --race=true --failfast=true --parallel=$parallel --timeout=$timeout 2>&1 | ts | tee test.all.race.failfast.parallel${parallel}.timeout${timeout}.$(date "+%Y-%m-%d.%H-%M-%S").txt
}

@jedevc
Copy link
Member

jedevc commented May 20, 2024

I'm not sure we should set failfast by default - this is very helpful when making large refactorings, and trying to observe all the genuine failures. It's annoying in the case of flakes, or going for a fully green run, but getting all the failures is genuinely useful a lot of the time in my experience.

Comment on lines +107 to +112
// Default timeout to 30m
// No test suite should take more than 30 minutes to run
if timeout == "" {
timeout = "30m"
}
args = append(args, fmt.Sprintf("-timeout=%s", timeout))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should use // +default for this one.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we did this, we would have the default in 3 places:

  1. All()
  2. Important()
  3. Custom()

Since the race flag uses the same approach - default in code rather than // +default - I think that we should leave this as is. WDYT?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, alternatively, we should refactor all of these into one function - then which tests to run should probably be flags on in.

Not a blocker though, happy as is.

@gerhard
Copy link
Member Author

gerhard commented May 20, 2024

@jedevc I'm not sure we should set failfast by default - this is very helpful when making large refactorings, and trying to observe all the genuine failures.

Makes sense. Changing the default to false.

`--parallel` now defaults to the number of CPUs. Before, this was
hard-coded to `16` which was set the number of CPUs in the larger GitHub
Runners, the `ubuntu-22.04-16c-64g-600gb` instance size specifically. If
this new default value does not work well on the larger CI runners (they
have 48CPUs & we run multiple pipelines on them at the same time), we
can specify this value explicitly in the workflows.

`--timeout` defaults to `30m`. This was changed not that long ago - the
value was set to 1h when using race. The bottom line is that none of our
test suites should take more than 10 minutes. This is something that we
are actively working towards. For now, until the pipeline becomes more
reliable, we will stick with the current 30m default. When running
locally with fewer CPUs, or when hunting deadlocks, experience
recommends a value of 180m.

`--failfast` is a new one. It sets `--max-fails=1` in `gotestsum`. FTR:
https://github.com/gotestyourself/gotestsum/blob/e9677fb405a5c0dc0b9d7217a7c73c3a20c5cb66/cmd/testdata/gotestsum-help-text#L20

Signed-off-by: Gerhard Lazu <gerhard@dagger.io>
@gerhard gerhard force-pushed the add-extra-flags-to-engine-and-cli-tests branch from 2aabe60 to bc6ad3e Compare May 20, 2024 14:11
If it defaults to 64 CPUs, as it did here, it's more likely to fail:
https://dagger.cloud/dagger/traces/8a089c6b4c74959d5a3aef1d8aeded74#b2d066943ad6dbc4:L1051

Signed-off-by: Gerhard Lazu <gerhard@dagger.io>
Comment on lines +107 to +112
// Default timeout to 30m
// No test suite should take more than 30 minutes to run
if timeout == "" {
timeout = "30m"
}
args = append(args, fmt.Sprintf("-timeout=%s", timeout))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, alternatively, we should refactor all of these into one function - then which tests to run should probably be flags on in.

Not a blocker though, happy as is.

@@ -25,7 +25,7 @@ jobs:
uses: ./.github/workflows/_dagger_call.yml
secrets: inherit
with:
function: test all --race=true
function: test all --race=true --parallel=16
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this different from NumCPUs? I think we can leave this unset.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we are explicitly setting this to 16 in CI.

The new CI runners usually have 48 and even 64 CPUs which makes tests fail. Before setting this, checks failed.

@gerhard
Copy link
Member Author

gerhard commented May 22, 2024

Going to merge this since the failing check is due to:

The check is currently showing as failing:
image

But the last run - attempt #3 - succeeded 🤷‍♂️
image

@gerhard gerhard merged commit f1cbd7c into dagger:main May 22, 2024
93 of 94 checks passed
@gerhard gerhard deleted the add-extra-flags-to-engine-and-cli-tests branch May 22, 2024 13:15
vikram-dagger pushed a commit to vikram-dagger/dagger that referenced this pull request May 24, 2024
* ci: Add extra flags for Engine & CLI tests

`--parallel` now defaults to the number of CPUs. Before, this was
hard-coded to `16` which was set the number of CPUs in the larger GitHub
Runners, the `ubuntu-22.04-16c-64g-600gb` instance size specifically. We
still want `16` in CI - we set this explicitly in the GitHub workflows - but are 
likely to want less locally where some of us only have `8` or `10` CPUs.

`--timeout` defaults to `30m`. This was changed not that long ago - the
value was set to 1h when using race. The bottom line is that none of our
test suites should take more than 10 minutes. This is something that we
are actively working towards. For now, until the pipeline becomes more
reliable, we will stick with the current 30m default. When running
locally with fewer CPUs, or when hunting deadlocks, experience
recommends a value of 180m.

`--failfast` is a new one. It sets `--max-fails=1` in `gotestsum`. FTR:
https://github.com/gotestyourself/gotestsum/blob/e9677fb405a5c0dc0b9d7217a7c73c3a20c5cb66/cmd/testdata/gotestsum-help-text#L20

Signed-off-by: Gerhard Lazu <gerhard@dagger.io>

* ci: Default Engine & CLI test parallel to 16 in CI

If it defaults to 64 CPUs, as it did here, it's more likely to fail:
https://dagger.cloud/dagger/traces/8a089c6b4c74959d5a3aef1d8aeded74#b2d066943ad6dbc4:L1051

Signed-off-by: Gerhard Lazu <gerhard@dagger.io>

---------

Signed-off-by: Gerhard Lazu <gerhard@dagger.io>
vikram-dagger pushed a commit to vikram-dagger/dagger that referenced this pull request May 24, 2024
* ci: Add extra flags for Engine & CLI tests

`--parallel` now defaults to the number of CPUs. Before, this was
hard-coded to `16` which was set the number of CPUs in the larger GitHub
Runners, the `ubuntu-22.04-16c-64g-600gb` instance size specifically. We
still want `16` in CI - we set this explicitly in the GitHub workflows - but are
likely to want less locally where some of us only have `8` or `10` CPUs.

`--timeout` defaults to `30m`. This was changed not that long ago - the
value was set to 1h when using race. The bottom line is that none of our
test suites should take more than 10 minutes. This is something that we
are actively working towards. For now, until the pipeline becomes more
reliable, we will stick with the current 30m default. When running
locally with fewer CPUs, or when hunting deadlocks, experience
recommends a value of 180m.

`--failfast` is a new one. It sets `--max-fails=1` in `gotestsum`. FTR:
https://github.com/gotestyourself/gotestsum/blob/e9677fb405a5c0dc0b9d7217a7c73c3a20c5cb66/cmd/testdata/gotestsum-help-text#L20

Signed-off-by: Gerhard Lazu <gerhard@dagger.io>

* ci: Default Engine & CLI test parallel to 16 in CI

If it defaults to 64 CPUs, as it did here, it's more likely to fail:
https://dagger.cloud/dagger/traces/8a089c6b4c74959d5a3aef1d8aeded74#b2d066943ad6dbc4:L1051

Signed-off-by: Gerhard Lazu <gerhard@dagger.io>

---------

Signed-off-by: Gerhard Lazu <gerhard@dagger.io>
Signed-off-by: Vikram Vaswani <vikram@dagger.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants